Latent Region-Wise Image Composition Using Multimodal Large Language Model

Chen, Tan

Latent Region-Wise Image Composition Using Multimodal Large Language Model

dc.contributor.author	Chen, Tan
dc.contributor.supervisor	Laganière, Robert
dc.date.accessioned	2026-06-29T19:13:56Z
dc.date.issued	2026-06-29
dc.description.abstract	Recent advances in text-to-image generative models, particularly latent diffusion models and more recent rectified-flow-based variants, have substantially improved the ability to synthesize images from textual prompts. Notably, models such as IP-Adapter provide effective image-prompt conditioning and can capture coarse global appearance cues from reference images. However, these models typically condition generation at the image level, which may limit their ability to faithfully reproduce fine-grained regional details or spatially localized features from the reference. As a result, they may struggle to preserve specific object identities or detailed visual elements from individual reference images, especially when multiple diverse regions or entities are involved. To address this limitation, we propose a novel training-free plugin that adopts a segmented binding and generation scheme. Specifically, we divide the input prompt and image space into semantically meaningful subregions and associate each region with its own reference image. A Multimodal Large Language Model (MLLM) is used as a planner to parse the prompt into region-wise descriptions, while each region is conditioned separately during the diffusion process. This enables more targeted control over specific elements in each region using dedicated reference images, thereby improving the alignment between the generated image and the intended local semantics. In our experiments, the proposed approach achieves stronger visual and semantic consistency than baseline methods such as SDXL, EasyRef, Gen-4 by Runway, and DALL·E 2, particularly in multi-entity and identity-preserving scenarios. Our method provides a flexible and interpretable framework for region-aware, reference-driven image synthesis.
dc.identifier.uri	http://hdl.handle.net/10393/51789
dc.identifier.uri	https://doi.org/10.20381/ruor-32045
dc.language.iso	en
dc.publisher	Université d'Ottawa / University of Ottawa
dc.rights	Attribution-NonCommercial 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/
dc.subject	Image Generation
dc.title	Latent Region-Wise Image Composition Using Multimodal Large Language Model
dc.type	Thesis	en
thesis.degree.discipline	Génie / Engineering
thesis.degree.level	Masters
thesis.degree.name	MCS
uottawa.department	Science informatique et génie électrique / Electrical Engineering and Computer Science

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1

Nom:: Chen_Tan_2026_thesis.pdf
Taille:: 69.11 MB
Format:: Adobe Portable Document Format

Télécharger

Trousse de licence

Voici les éléments 1 - 1 sur 1

Nom:: license.txt
Taille:: 2.51 KB
Format:: Item-specific license agreed upon to submission
Description:

Télécharger

Collections

- Thèses, 2011 - // Theses, 2011 -