Latent Region-Wise Image Composition Using Multimodal Large Language Model

dc.contributor.authorChen, Tan
dc.contributor.supervisorLaganière, Robert
dc.date.accessioned2026-06-29T19:13:56Z
dc.date.issued2026-06-29
dc.description.abstractRecent advances in text-to-image generative models, particularly latent diffusion models and more recent rectified-flow-based variants, have substantially improved the ability to synthesize images from textual prompts. Notably, models such as IP-Adapter provide effective image-prompt conditioning and can capture coarse global appearance cues from reference images. However, these models typically condition generation at the image level, which may limit their ability to faithfully reproduce fine-grained regional details or spatially localized features from the reference. As a result, they may struggle to preserve specific object identities or detailed visual elements from individual reference images, especially when multiple diverse regions or entities are involved. To address this limitation, we propose a novel training-free plugin that adopts a segmented binding and generation scheme. Specifically, we divide the input prompt and image space into semantically meaningful subregions and associate each region with its own reference image. A Multimodal Large Language Model (MLLM) is used as a planner to parse the prompt into region-wise descriptions, while each region is conditioned separately during the diffusion process. This enables more targeted control over specific elements in each region using dedicated reference images, thereby improving the alignment between the generated image and the intended local semantics. In our experiments, the proposed approach achieves stronger visual and semantic consistency than baseline methods such as SDXL, EasyRef, Gen-4 by Runway, and DALL·E 2, particularly in multi-entity and identity-preserving scenarios. Our method provides a flexible and interpretable framework for region-aware, reference-driven image synthesis.
dc.identifier.urihttp://hdl.handle.net/10393/51789
dc.language.isoen
dc.publisherUniversité d'Ottawa / University of Ottawa
dc.rightsAttribution-NonCommercial 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/
dc.subjectImage Generation
dc.titleLatent Region-Wise Image Composition Using Multimodal Large Language Model
dc.typeThesisen
thesis.degree.disciplineGénie / Engineering
thesis.degree.levelMasters
thesis.degree.nameMCS
uottawa.departmentScience informatique et génie électrique / Electrical Engineering and Computer Science

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1
En cours de chargement...
Vignette d'image
Nom:
Chen_Tan_2026_thesis.pdf
Taille:
69.11 MB
Format:
Adobe Portable Document Format

Trousse de licence

Voici les éléments 1 - 1 sur 1
En cours de chargement...
Vignette d'image
Nom:
license.txt
Taille:
2.51 KB
Format:
Item-specific license agreed upon to submission
Description: