Latent Region-Wise Image Composition Using Multimodal Large Language Model

Chen, Tan

Latent Region-Wise Image Composition Using Multimodal Large Language Model

Fichiers

Principal Chen_Tan_2026_thesis.pdf (69.11 MB)

Date

2026-06-29

Authors

Chen, Tan

Éditeur

Université d'Ottawa / University of Ottawa

Licence Creative Commons

Attribution-NonCommercial 4.0 International

Résumé

Recent advances in text-to-image generative models, particularly latent diffusion models and more recent rectified-flow-based variants, have substantially improved the ability to synthesize images from textual prompts. Notably, models such as IP-Adapter provide effective image-prompt conditioning and can capture coarse global appearance cues from reference images. However, these models typically condition generation at the image level, which may limit their ability to faithfully reproduce fine-grained regional details or spatially localized features from the reference. As a result, they may struggle to preserve specific object identities or detailed visual elements from individual reference images, especially when multiple diverse regions or entities are involved. To address this limitation, we propose a novel training-free plugin that adopts a segmented binding and generation scheme. Specifically, we divide the input prompt and image space into semantically meaningful subregions and associate each region with its own reference image. A Multimodal Large Language Model (MLLM) is used as a planner to parse the prompt into region-wise descriptions, while each region is conditioned separately during the diffusion process. This enables more targeted control over specific elements in each region using dedicated reference images, thereby improving the alignment between the generated image and the intended local semantics. In our experiments, the proposed approach achieves stronger visual and semantic consistency than baseline methods such as SDXL, EasyRef, Gen-4 by Runway, and DALL·E 2, particularly in multi-entity and identity-preserving scenarios. Our method provides a flexible and interpretable framework for region-aware, reference-driven image synthesis.

Mots-clés

Image Generation

URI

http://hdl.handle.net/10393/51789
https://doi.org/10.20381/ruor-32045

Collections

- Thèses, 2011 - // Theses, 2011 -

Notice complète

Latent Region-Wise Image Composition Using Multimodal Large Language Model

Fichiers

Date

Authors

Nom de la revue

ISSN de la revue

Titre du volume

Éditeur

Licence Creative Commons

Résumé

Description

Mots-clés

Citation

URI

Collections

Approbation

Évaluation

Complété par

Référencé par