X-Fusion

Introducing New Modality to Frozen Large Language Models

——— ICCV 2025 ———

Sicheng Mo¹ Thao Nguyen² Xun Huang³ Siddharth Srinivasan Iyer³ Yijun Li³ Yuchen Liu³ Abhishek Tandon³ Eli Shechtman³ Krishna Kumar Singh³ Yong Jae Lee² Bolei Zhou¹ Yuheng Li³

1. University of California, Los Angeles 2. University of Wisconsin-Madison 3. Adobe Research

Best Paper @ CVPR 2025 T4V Workshop

Paper

We introduce X-Fusion - a novel framework that adapts pretrained LLMs (e.g., LLaMA) to new modalities (e.g., vision)
while retaining their language capabilities and world knowledge!

📜 Abstract

We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM’s parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

⚡ X-Fusion Capabilities

📐 X-Fusion

Model Architecture

We propose X-Fusion to address two key challenges: (i) retaining the language abilities of the pre-trained LLM while (ii) adapting it with image generation/understanding capabilities. To achieve this, we introduce a Dual-Tower architecture that freezes all language weights and separate vision weights in each layer to help process visual information for the LLM. This approach aligns text and vision features not only at input or output level, but also at the intermediate processing level.

To show the effectiveness of our approach, we proposed alternative transformer block variants for multimodal integration, including (a) Single Tower, (b) Gated Layer,and (c) Dual Projection. We find the Dual-Tower architecture achieves the best performance in both image generation and understanding tasks among them.

Conceptual comparison of four model architecture baselines. Here, we illustrate how each layer processes the sequential multi-modal feature. (a) Single Tower: Directly fine-tuning pre-trained LLM. (b) Gated Layer: Duplicate Each LLM layer as the gated vision layer, (c) Dual Projection: Duplicate QKV matric and MLP layer for vision modality, (d) Dual Tower: Duplicated transformer block for vision modality.

Training Recipe

Alongside of the model architecture, we also study the training recipe for X-Fusion. We focused on two key questions:

How does noise level affect performance in a joint training setting?
Does multitask training provide mutual benefits between tasks?

We first study the effect of noise level in the training data. While diffusion-based image modeling benefits from noisy input for generation tasks, excessive noise can degrade visual quality and hinder image understanding. In contrast to conventional choose maximum of \( t = 50\% T \) to reduce distortion for visual understanding, we find clean images for visual understanding improve performance for both tasks. We show the performance comparison between varied the max noise level for image understanding (I2T) tasks: 0% (clean image), 25%, 50%, 75%, and 100% (normal generation setting) in the following figure.

We then study the effect of data ratio in the training data. To investigate the synergy between visual understanding and generation from data's perspective, we begin with the training data composed entirely of T2I tasks (100% T2I, 0% I2T, or denoted as 100/0 for short), then progressively decrease the proportion of T2I data while increasing I2T data (i.e., 66/33, 50/50, 33/66) until the dataset consists solely of I2T tasks (0/100). We find the asymmetric relationship: understanding data benefits generation, but generation data does not enhance understanding.

❗ Please refer to the main paper for detailed ablation experiments! ❗

🖼️ Experiment Result

Text-to-Image Generation

Image Captioning

Fine-tuned X-Fusion on Downsteam Tasks

Can X-Fusion extend its capabilities to other downstream vision-and-language tasks? We fine-tuned our model on four tasks simultaneously—including image editing, localization, in/oupt-painting, and Visual Question Answering using internal datasets for 20k training steps. In the following figure, we demonstrate that our unified X-Fusion model can handle multiple tasks without creating task-specific weights.

Transfer from Pretrained Diffusion Model

One main advantage of our dual-tower design is that allows non-identical block designs in language and vision towers as each block in the both tower processes the entire feature sequence independently. With this advantage, we could transfer the image generation capability from a large-scale pretrained diffusion model that uses diffusion transformers. We trained a a variation of X-Fusion using Llama3 model as the language tower and an in-house pretrained text-to-image DiT model as its vision tower, notated as X-Fusion(Pretrained DiT), with 50K steps. We show that X-Fusion(Pretrained DiT) obtained stronger image generation capability compared to the vanilla X-Fusion.

📥 BibTeX


    @article{mo2025xfusion,
      title={X-fusion: Introducing new modality to frozen large language models},
      author={Mo, Sicheng and Nguyen, Thao and Huang, Xun and Iyer, Siddharth Srinivasan and Li, Yijun and Liu, Yuchen and Tandon, Abhishek and Shechtman, Eli and Singh, Krishna Kumar and Lee, Yong Jae and others},
      journal={arXiv preprint arXiv:2504.20996},
      year={2025}
    }

💌 Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thank you (.❛ ᴗ ❛.).