Latent Region-Wise Image Composition Using Multimodal Large Language Model
En cours de chargement...
Date
Authors
Nom de la revue
ISSN de la revue
Titre du volume
Éditeur
Université d'Ottawa / University of Ottawa
Résumé
Recent advances in text-to-image generative models, particularly latent diffusion models and more recent rectified-flow-based variants, have substantially improved the ability to synthesize images from textual prompts. Notably, models such as IP-Adapter provide effective image-prompt conditioning and can capture coarse global appearance cues from reference images. However, these models typically condition generation at the image level, which may limit their ability to faithfully reproduce fine-grained regional details or spatially localized features from the reference. As a result, they may struggle to preserve specific object identities or detailed visual elements from individual reference images, especially when multiple diverse regions or entities are involved.
To address this limitation, we propose a novel training-free plugin that adopts a segmented binding and generation scheme. Specifically, we divide the input prompt and image space into semantically meaningful subregions and associate each region with its own reference image. A Multimodal Large Language Model (MLLM) is used as a planner to parse the prompt into region-wise descriptions, while each region is conditioned separately during the diffusion process. This enables more targeted control over specific elements in each region using dedicated reference images, thereby improving the alignment between the generated image and the intended local semantics.
In our experiments, the proposed approach achieves stronger visual and semantic consistency than baseline methods such as SDXL, EasyRef, Gen-4 by Runway, and DALL·E 2, particularly in multi-entity and identity-preserving scenarios. Our method provides a flexible and interpretable framework for region-aware, reference-driven image synthesis.
Description
Mots-clés
Image Generation

