Language-Guided 4D Object Reconstruction from Videos

Hu, Xiao

Language-Guided 4D Object Reconstruction from Videos

dc.contributor.author	Hu, Xiao
dc.contributor.supervisor	Lang, Jochen
dc.date.accessioned	2026-01-28T16:01:19Z
dc.date.available	2026-01-28T16:01:19Z
dc.date.issued	2026-01-28
dc.description.abstract	Creating 4D assets from real-world data is a crucial problem in computer vision, with applications in Augmented Reality (AR), Virtual Reality (VR), and animation. However, this task is often cost-prohibitive, as it requires both detailed object 3D structure and accurate motion within a complex 3D environment. This thesis proposes a framework that converts object-centric monocular videos into 4D representations. The two primary challenges we aim to tackle are: 1) key object extraction from the video, and 2) dynamic object capture and representation. For the first challenge, Referring Video Object Segmentation (RVOS) is employed, using language prompts to identify target objects while ignoring distracting backgrounds. The limitations of existing RVOS methods are first analyzed, particularly in terms of temporal consistency, and then enhanced by improving the understanding of temporal context. Additionally, we adapted previous frameworks into online methods capable of processing video frames in real-time, which significantly increases the usability. For the second task, 3D Gaussian Splatting (3DGS), a popular method for 3D scene representation, serves as the baseline for dynamic object representation. One main limitation in 3DGS is identified, that large object motions cannot be well handled during reconstruction. To address this, a motion-deformation-decoupled dynamic 3DGS structure is designed to estimate the object's large overall motion and the local deformation separately, which can represent highly dynamic objects better. Additionally, monocular casual video has an inherent limitation on spatial-temporal observations. Only a limited part of the object can be observed at each point in time. This leads to undesired artifacts in weakly observed areas. To overcome this limitation, a novel framework is proposed to leverage the prior of the 2D generative model to import additional constraints over the weakly observed spatial-temporal areas. In summary, this thesis takes a casually captured monocular casual video, along with a descriptive sentence pointing to the desired object as input and produces a complete 4D object with accurate 3D structure and corresponding motion, which advances the usability of 4D reconstruction in practical applications.
dc.identifier.uri	http://hdl.handle.net/10393/51324
dc.identifier.uri	https://doi.org/10.20381/ruor-31712
dc.language.iso	en
dc.publisher	Université d'Ottawa / University of Ottawa
dc.rights	Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Vision Language Understanding
dc.subject	4D Reconstruction
dc.title	Language-Guided 4D Object Reconstruction from Videos
dc.type	Thesis	en
thesis.degree.discipline	Génie / Engineering
thesis.degree.level	Doctoral
thesis.degree.name	PhD
uottawa.department	Science informatique et génie électrique / Electrical Engineering and Computer Science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Hu_Xiao_2026_thesis.pdf
Size:: 145.25 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.65 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

- Thèses, 2011 - // Theses, 2011 -