Repository logo

Language-Guided 4D Object Reconstruction from Videos

dc.contributor.authorHu, Xiao
dc.contributor.supervisorLang, Jochen
dc.date.accessioned2026-01-28T16:01:19Z
dc.date.available2026-01-28T16:01:19Z
dc.date.issued2026-01-28
dc.description.abstractCreating 4D assets from real-world data is a crucial problem in computer vision, with applications in Augmented Reality (AR), Virtual Reality (VR), and animation. However, this task is often cost-prohibitive, as it requires both detailed object 3D structure and accurate motion within a complex 3D environment. This thesis proposes a framework that converts object-centric monocular videos into 4D representations. The two primary challenges we aim to tackle are: 1) key object extraction from the video, and 2) dynamic object capture and representation. For the first challenge, Referring Video Object Segmentation (RVOS) is employed, using language prompts to identify target objects while ignoring distracting backgrounds. The limitations of existing RVOS methods are first analyzed, particularly in terms of temporal consistency, and then enhanced by improving the understanding of temporal context. Additionally, we adapted previous frameworks into online methods capable of processing video frames in real-time, which significantly increases the usability. For the second task, 3D Gaussian Splatting (3DGS), a popular method for 3D scene representation, serves as the baseline for dynamic object representation. One main limitation in 3DGS is identified, that large object motions cannot be well handled during reconstruction. To address this, a motion-deformation-decoupled dynamic 3DGS structure is designed to estimate the object's large overall motion and the local deformation separately, which can represent highly dynamic objects better. Additionally, monocular casual video has an inherent limitation on spatial-temporal observations. Only a limited part of the object can be observed at each point in time. This leads to undesired artifacts in weakly observed areas. To overcome this limitation, a novel framework is proposed to leverage the prior of the 2D generative model to import additional constraints over the weakly observed spatial-temporal areas. In summary, this thesis takes a casually captured monocular casual video, along with a descriptive sentence pointing to the desired object as input and produces a complete 4D object with accurate 3D structure and corresponding motion, which advances the usability of 4D reconstruction in practical applications.
dc.identifier.urihttp://hdl.handle.net/10393/51324
dc.identifier.urihttps://doi.org/10.20381/ruor-31712
dc.language.isoen
dc.publisherUniversité d'Ottawa / University of Ottawa
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectVision Language Understanding
dc.subject4D Reconstruction
dc.titleLanguage-Guided 4D Object Reconstruction from Videos
dc.typeThesisen
thesis.degree.disciplineGénie / Engineering
thesis.degree.levelDoctoral
thesis.degree.namePhD
uottawa.departmentScience informatique et génie électrique / Electrical Engineering and Computer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
Hu_Xiao_2026_thesis.pdf
Size:
145.25 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
license.txt
Size:
6.65 KB
Format:
Item-specific license agreed upon to submission
Description: