In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.
Standard diffusion models process batches of images in parallel, but each image is isolated. We identify this as a missed opportunity: can images in a batch collaborate? Our method, Group Diffusion, jointly denoises a group of related samples simultaneously.
This simple design allows our method to be seamlessly plugged into existing diffusion transformer architectures. In our implementation, we select DiT and SiT, along with JiT, as our baselines to validate the effectiveness of Group Diffusion.
During training, we dynamically construct groups of related images to enable effective cross-sample interaction. For each target image, we retrieve a set of semantically or visually similar candidates using cosine similarity on pre-trained embeddings (e.g., CLIP or DINO). We then randomly sample from these candidates to form a group, ensuring the model processes a coherent set of samples that can mutually aid the denoising process.
We propose two variants: GroupDiff-f and GroupDiff-l. GroupDiff-f employs group attention for both conditional and unconditional branches. In contrast, GroupDiff-l applies group attention exclusively to the unconditional branch. This distinction leads to a significant difference in training efficiency: since the unconditional branch is typically trained only 10% of the time (following standard Classifier-Free Guidance probability), GroupDiff-l allows the model to be trained with a standard group size of one for the vast majority of iterations (90%). This design makes GroupDiff-l computationally lightweight and comparable to the baseline cost.
Our experiments validate the core hypothesis: images can effectively collaborate during inference. By analyzing the internal attention weights, we observe that the model naturally establishes both intra-image and inter-image correspondence without explicit supervision.
We further investigate how the model utilizes the group context. We hypothesize two operating modes: a "distributed" mode versus a "neighbor-focused" mode, where the model attends primarily to the most similar sample. To quantify this, we introduce the Cross-Sample Attention Score, measuring how peaked the attention is towards the nearest neighbor:
where \( P_{\text{cross-max}} \) and \( P_{\text{cross-mean}} \) denote the maximum and mean cross-sample attention weights from one image to the others in the group. A score close to 1 indicates attention is highly focused on a single related sample, while a score near 0 implies uniform attention distribution.
Our analysis reveals that GroupDiff strongly encourages this neighbor-focused attention. Distinct clusters emerge in the plot corresponding to different group sizes; increasing the group size effectively encourages stronger cross-sample attention behavior. Moreover, even within each cluster, higher scores consistently correlate with lower FID, confirming that this interaction reliably reflects generation quality.
Crucially, we discover that larger group sizes inherently encourage this neighbor-focused attention mechanism. This leads to a consistent scaling behavior: as we scale up the group size, the Cross-Sample Attention Score increases, directly translating to improved generation quality. We show selected examples of class-conditional generation using GroupDiff-f with group sizes of {1, 2, 4, 8} without CFG.
We evaluate GroupDiff-4 across different baselines in two distinct settings: training from scratch and plug-and-play finetuning. The following table reports the FID performance (lower is better).
We benchmark Group Diffusion against leading generative systems on ImageNet 256x256. The table below shows that GroupDiff consistently improves performance across various settings, outperforming state-of-the-art baselines.
| Method | Epochs | FID | IS | Prec. | Recall |
|---|---|---|---|---|---|
| With semantic feature distillation | |||||
| DDT-XL | 800 | 1.26 | 310.6 | 0.79 | 0.65 |
| SiT-XL/2 + REPA-E | 800 | 1.26 | 314.9 | 0.79 | 0.66 |
| SiT-XL/2 + REPA | 800 | 1.42 | 305.7 | 0.80 | 0.64 |
| + Our GroupDiff-4 (finetune) | 800+100 | 1.14 | 315.3 | 0.77 | 0.66 |
| Without semantic feature distillation | |||||
| ADM | 400 | 4.59 | 186.7 | 0.82 | 0.52 |
| VDM++ | - | 2.12 | 267.7 | - | - |
| LDM-4 | 1400 | 3.60 | 247.7 | 0.87 | 0.48 |
| MDTv2-XL/2 | 900 | 1.58 | 314.7 | 0.79 | 0.65 |
| VAR-d30 | 350 | 1.92 | 323.1 | 0.82 | 0.59 |
| LlamaGen-3B | - | 2.18 | 263.3 | 0.81 | 0.58 |
| RandAR-XXL | 300 | 2.15 | 322.0 | 0.79 | 0.62 |
| MaskDiT | 1600 | 2.28 | 276.6 | 0.89 | 0.61 |
| DiT-XL/2 | 1400 | 2.27 | 278.2 | 0.83 | 0.57 |
| + Our GroupDiff-4 (from scratch) | 800 | 1.66 | 279.4 | 0.83 | 0.57 |
| + Our GroupDiff-4 (finetune) | 1400+100 | 1.55 | 285.4 | 0.80 | 0.63 |
| SiT-XL/2 | 1400 | 2.06 | 270.3 | 0.82 | 0.59 |
| + SRA | 800 | 1.58 | 311.4 | 0.80 | 0.63 |
| + Dispersive Loss | 1200 | 1.97 | - | - | - |
| + Our GroupDiff-4 (from scratch) | 800 | 1.63 | 283.2 | 0.81 | 0.64 |
| + Our GroupDiff-4 (finetune) | 1400+100 | 1.40 | 290.7 | 0.79 | 0.64 |
Uncurated generation results of GroupDiff-4 with guidance