CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

Jiezhi "Stephen" Yang¹, Khushi Desai², Charles Packer³, Harshil Batia⁴, Nicholas Rhinehart³, Rowan McAllister⁵, Joseph Gonzalez³

¹Harvard University, ²Columbia University, ³UC Berkeley ⁴Avataar.ai, ⁴Toyota Research Institute

Code arXiv

Accepted to ECCV 2024

Abstract

We propose CARFF: Conditional Auto-encoded Radiance Field for 3D Forecasting, a method for predicting future 3D scenes given past observations. Our method maps 2D ego-centric images to a distribution over plausible 3D latent scene configurations and predicts the evolution of hypothesized scenes through time. Our latents condition a global Neural Radiance Field (NeRF) to represent a 3D scene model, enabling explainable predictions and straightforward downstream planning. This approach models the world as a POMDP and considers complex scenarios of uncertainty in environmental states and dynamics. Specifically, we employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations, and auto-regressively predict latent scene representations utilizing a mixture density network. We demonstrate the utility of our method in scenarios using the CARLA driving simulator, where CARFF enables efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving occlusions.

Two-Stage Training and Inference

CARFF's two stage training process

The PC-VAE encodes images into Gaussian latent distributions. Upper right: the pose-conditional decoder stochastically decodes the sampled latent using the given camera pose into an image. The decoded reconstruction and ground truth images are used in the MSE loss for the PC-VAE. Lower right: a NeRF is trained by conditioning on latent variables sampled from the optimized Gaussian parameters. These parameters characterize the timestamp distributions derived from PC-VAE. We use a separate MSE loss for the NeRF as well.

CARFF's auto-regressive inference pipeline

The input image is encoded using pre-trained PC-VAE to obtain a latent distribution, which is fed into our MDN model. The MDN predicts a mixture of Gaussians, which is sampled to obtain a predicted latent used to render a 3D view of the scene. To perform auto-regressive predictions, we probe the NeRF for the location of the car and feed the information back to the pre-trained encoder to predict the scene at the next timestamp.

Results

Accuracy and recall curves from predicted beliefs

From the number of samples starting at 0 to 50, the belief state coverage generated under partial observation (recall), and the proportion of correct beliefs sampled under full observation (accuracy) is plotted for predicted beliefs. There is an ideal margin between the two as shown in the curves above for the two Multi-Scene CARLA datasets used to train our model.

CARFF planning with controllers

CARFF-based controllers outperform baseline controllers by choosing the optimal action in potential collision scenarios over all 30 trials conducted.

Related Works

Related papers that perform similar tasks to CARFF, including reasoning in a 3D space, performing predictions for scenarios under state and dynamics uncertainty, and planning.

[1] NeRF-VAE: A Geometry Aware 3D Scene Generative Model (ICML 2021)
[2] 3D Neural Scene Representations for Visuomotor Control (CoRL 2021)
[3] Vision-Only Robot Navigation in a Neural Radiance World (RA-L 2022)
[4] Is Anyone There? Learning a Planner Contingent on Perceptual Uncertainty (CoRL 2022)

BibTeX

        
          @InProceedings{10.1007/978-3-031-73024-5_14,
            author="Yang, Jiezhi
            and Desai, Khushi
            and Packer, Charles
            and Bhatia, Harshil
            and Rhinehart, Nicholas
            and McAllister, Rowan
            and Gonzalez, Joseph E.",
            editor="Leonardis, Ale{\v{s}}
            and Ricci, Elisa
            and Roth, Stefan
            and Russakovsky, Olga
            and Sattler, Torsten
            and Varol, G{\"u}l",
            title="CARFF: Conditional Auto-Encoded Radiance Field for 3D Scene Forecasting",
            booktitle="Computer Vision -- ECCV 2024",
            year="2025",
            publisher="Springer Nature Switzerland",
            address="Cham",
            pages="225--242",
            abstract="We propose CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting, a method for predicting future 3D scenes given past observations. Our method maps 2D ego-centric images to a distribution over plausible 3D latent scene configurations and predicts the evolution of hypothesized scenes through time. Our latents condition a global Neural Radiance Field (NeRF) to represent a 3D scene model, enabling explainable predictions and straightforward downstream planning. This approach models the world as a POMDP and considers complex scenarios of uncertainty in environmental states and dynamics. Specifically, we employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations, and auto-regressively predict latent scene representations utilizing a mixture density network. We demonstrate the utility of our method in scenarios using the CARLA driving simulator, where CARFF enables efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving occlusions. Video and code are available at: www.carff.website.",
            isbn="978-3-031-73024-5"
            }