📅 17 June, 2024 📍 Seattle, US
Workshop on Computer Vision for Mixed Reality
In conjunction with CVPR 2024
Call for Papers


Virtual Reality (VR) technologies have the potential to transform the way we use computing to interact with our environment, do our work and connect with each other. VR devices provide users with immersive experiences at the cost of blocking the visibility of the surrounding environment. With the advent of passthrough techniques such as those in Quest-3 and Apple Vision Pro, now users can build deeply immersive experiences which mix the virtual and the real world into one, often also called Mixed Reality (MR). MR poses a set of very unique research problems in computer vision that are not covered by VR. Our focus is on capturing the real environment around the user using cameras which are placed away from the user's eyes, yet reconstruct the environment with high fidelity, augmented the environment with virtual objects and effects, and all in real-time. We aim to offer the research community to deeply understand the unique challenges of Mixed Reality and research on novel methods encompassing View Synthesis, Scene Understanding, efficient On-Device AI among other things.


📍 Summit 332, Seattle Convention Center 📺 Zoom link (TBD)
08:00-08:15 am Rakesh Ranjan Opening Remarks
08:15-09:00 am Douglas Lanman [Keynote] Taking a Small Step in a Different Direction
The computer vision community has recently made rapid and significant progress on the grand challenge of novel view synthesis. New frameworks — including multiplane images, neural radiance fields, and Gaussian splatting — may ultimately provide the foundation for tomorrow’s volumetric video systems. When viewed with emerging mixed reality (MR) headsets, such frameworks may unlock fully immersive forms of today’s television and film content.
Yet, these emerging view synthesis frameworks do not fully meet the needs of MR headsets. In addition to capturing and viewing entire environments across broad viewpoint changes, MR fundamentally needs computer vision systems that can also reproject from headset-mounted sensors to the perspective of the viewer’s eyes. In this talk, we aim to inspire a greater focus in the computer vision community on developing view synthesis algorithms that can achieve this ‘small step’ in perspective with algorithms that may fundamentally differ from emerging frameworks (due to the need to achieve this transformation in real time, with limited computing resources, and at a fidelity approaching that of human vision).
We start with a systems-level view of this problem: examining whether hardware modifications alone might eliminate the need for real-time view reprojection for MR, based on recent psychophysical studies determining the threshold of detectability for perspective distortions. We’ll also review our latest progress on meeting this system-level challenge, reviewing our ‘neural passthrough’ and ‘reverse passthrough’ headset prototypes, as well as early demonstrations of mixed reality stylization and editing systems that can be applied in combination with real-time passthrough reprojection algorithms. We conclude by looking towards the larger problems in this space, including building volumetric capture and real-time view synthesis methods that match the limits of human perception, including the challenges of variable-focus, wide-field-of-view, and high-dynamic-range imaging.
09:00-09:30 am Nima Kalantari Reconstructing 3D Scenes from Sparse Images
Reconstructing the visual appearance of scenes has a wide range of applications, including virtual/augmented reality, e-commerce, and video conferencing. In recent years, the field of novel view synthesis has seen significant progress with the introduction of approaches like neural radiance fields. However, accurately reconstructing 3D scenes still requires a large number of input images, which is not feasible in most practical scenarios. In this talk, I will discuss our recent efforts to reconstruct 3D scenes from only a few or even a single image. Specifically, I will first discuss our work on novel view synthesis from a few images using 3D Gaussian splatting. Then, I will talk about our approach to handle view-dependent highlights in single-image view synthesis.
09:30-10:00 am Federico Tombari 3D scene understanding with neural representations for Augmented Reality
Neural representations have shown tremendous progress and represent a promising tool for novel applications in the space of Augmented and Mixed Reality. In this talk I will give an overview on the use of neural representations for AR/XR applications with a focus on 3D scene understanding, and for common tasks such as novel view synthesis, 3D semantic segmentation and 3D asset generation. For each of these three tasks, I will first highlight some important practical limitations of current neural representations. I will then show solutions designed to overcome such limitations, which include mobile novel view synthesis at high framerate, open set 3D scene segmentation with radiance fields, and realistic 3D asset generation from text prompts.
10:00-10:30 am Lei Xiao Exploring Neural Rendering for Mixed Reality
In the realm of Mixed Reality, the pursuit for perceptually-realistic 3D reconstruction and rendering of dynamic environments represents a significant research challenge. This is a crucial step towards our ultimate aspiration of passing the Visual Turing Test on headsets. In this talk, we will share our experiences and learnings on this subject.
We will touch upon a variety of specific challenges we have encountered, such as gaze-contingent rendering, real-time supersampling, real-time passthrough view synthesis, online video depth estimation, and dynamic object reconstruction. Additionally, we will share our explorations in the creative domain of 3D stylization, and our initial steps towards text-driven realistic 3D editing.
10:30-11:15 am Poster Spotlight + Break Location: 4E Workshop Posters
11:15-11:45 pm Natalia Neverova Generative AI for 3D content creation
Scaling XR Metaverse applications will require development of fast and performant models for immersive content creation, capable of generating and editing individual 3D assets, animated 3D characters and eventually whole 3D worlds. In this presentation, we will talk about first foundation blocks that we are building as a part of this journey, from generating shapes and texturing to creating full 3D assets with PBR materials, starting with textual descriptions and visuals.
11:45-12:15 pm Noam Aigerman Manipulating, Deforming and Controlling 3D Objects with Machine Learning
Production of 3D content relies on the ability to manipulate 3D objects by “deforming” them, i.e., moving around 3D points on the object: each frame in an animation sequence is a deformation of a base model; alternatively, generation of 3D shapes often relies on “sculpting” the object from other shapes through deformation, or otherwise adding additional details to an existing object. Thus, enabling neural networks to directly deform 3D objects can automate and improve such applications, making learning of deformations a heavily-researched area. However, devising learning-based methods to accurately and robustly produce deformations that meet practical application needs is a challenging and unsolved task, especially when considering less-explicit 3D representations, such as NeRFs, SDFs and Gaussian Splats. This talk aims to give an overview of the specific challenges that need to be overcome for a practical framework for learning deformations, as well as the recent directions my work has taken to tackle them.
12:15-12:45 pm Laura Leal-Taixe Efficient Annotations for the Trackers of Tomorrow
Multi-object tracking is an essential task for mixed reality, which aims at seamlessly merging the virtual and the real world, and therefore needs to have a good understanding of the dynamics of the real world. Tracking algorithms are thriving on large-scale dataset training, but video annotation is very time consuming.
There are surprisingly very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a tracking data engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost.


Douglas Lanman (Keynote)

Director of Research, Meta Reality Labs

Natalia Neverova

Meta, GenAI

Noam Aigerman

University of Montreal

Lei Xiao

Meta Reality Labs Research

Laura Leal-Taixe

Nvidia Research & TUM

Nima Kalantari

Texas A&M University

Federico Tombari


Call for Papers

Important Dates:

  • Paper submission deadline: March 15, 2024 (PST)
  • Notification to authors: April 1, 2024
  • Camera-ready deadline: April 10, 2024

Topics of Interest include:

  • Real time View Synthesis for Passthrough
  • Depth Estimation for Stereoscopic Reconstruction
  • 3D capture, reconstruction and rendering for virtual objects
  • 3D Scene Reconstruction
  • SLAM
  • Scene understanding
  • Stylization for Passthrough
  • Novel Applications of Mixed Reality in areas such as Healthcare, Manufacturing, etc.

Submission Guidelines:

  • We invite submissions of max 8 pages (excluding references), and 4-page extended abstracts as well.
  • Submitted manuscript should follow the CVPR 2024 paper template.
  • If you have other media to attach (videos etc), please feel free to add anonymized links.
  • Submissions will be rejected without review if they:
      1. Contain more than 8 pages (excluding references).
      2. Violate the double-blind policy.
      3. Violate the dual-submission policy for papers with more than 4 pages excluding references.


Rakesh Ranjan


Peter Vajda


Xiaoyu Xiang


Vikas Chandra


Andrea Colaco


Contact Us