EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Daiwei Zhang1, Gengyan Li1,2, Jiajie Li1, Mickaël Bressieux1, Otmar Hilliges1, Marc Pollefeys1,3, Luc Van Gool1,4,5 Xi Wang1
1ETH Zürich, 2Google 3Microsoft 4KU Leuven 5INSAIT, Sofia
3DV 2025

EgoGaussian simultaneously reconstructs 3D scenes and dynamically tracks 3D object motion from RGB egocentric input alone.

Abstract

Human activities are inherently complex, often involving numerous object interactions. To better understand these activities, it is crucial to model their interactions with the environment captured through dynamic changes. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand human-object interactions in 3D environments. However, most existing methods for human activity modeling neglect the dynamic interactions with objects, resulting in only static representations. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors.

To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background, with both having explicit representations. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. EgoGaussian shows significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art. We also qualitatively demonstrate the high quality of the reconstructed models.

Dynamic reconstruction

Our compact scene representation enables the rendering of novel views of the dynamic scene from arbitrary viewpoints.


HOI4D

Icon 1
Input
Icon 2
Fixed Cam
Icon 3
Panoptic-view
Icon 4
Free-view

EPIC-KITCHENS

Novel View Synthesis

We compare our method with two state-of-the-art dynamic 3D-GS-based methods, both of which apply deformation fields to model monocular dynamic scenes. To ensure a fair comparison, we additionally modify both methods to support masking out gradients on the segmented human body, similar to EgoGaussian.

Input
EgoGaussian
4DGS w/o hands
(Wu et al., 2023)
Deformable-3DGS w/o hands (Yang et al., 2023)

Dynamic Object Modeling

Deformation-field-based methods also encounter difficulties in tracking and modeling the rigid movement of an object alone.

Input
EgoGaussian
4DGS
Deformable-3DGS

Approach

Static Reconstruction

Description of image
We use this clip-level online learning pipeline to jointly reconstruct and segment dynamic object from static background.

Dynamic Reconstruction

Description of image
We use a sequential pipeline with regularization from previous frames, allowing us to estimate object poses and iteratively refine their shapes simultaneously.

BibTeX

@article{zhang2024egogaussian,
  title={EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting},
  author={Zhang, Daiwei and Li, Gengyan and Li, Jiajie and Bressieux, Micka{\"e}l and Hilliges, Otmar and Pollefeys, Marc and Van Gool, Luc and Wang, Xi},
  journal={arXiv preprint arXiv:2406.19811},
  year={2024}
}