We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM’s parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
We propose X-Fusion to address two key challenges: (i) retaining the language abilities of the pre-trained LLM while (ii) adapting it with image generation/understanding capabilities. To achieve this, we introduce a Dual-Tower architecture that freezes all language weights and separate vision weights in each layer to help process visual information for the LLM. This approach aligns text and vision features not only at input or output level, but also at the intermediate processing level.
To show the effectiveness of our approach, we proposed alternative transformer block variants for multimodal integration, including (a) Single Tower, (b) Gated Layer,and (c) Dual Projection. We find the Dual-Tower architecture achieves the best performance in both image generation and understanding tasks among them.
to be updated
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.