Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Abstract

In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.

Method

Standard diffusion models process batches of images in parallel, but each image is isolated. We identify this as a missed opportunity: can images in a batch collaborate? Our method, Group Diffusion, jointly denoises a group of related samples simultaneously.

We simply flatten the batch dimension for the attention layers:

This allows every patch in every image to attend to any other patch in the entire group.

This simple design allows our method to be seamlessly plugged into existing diffusion transformer architectures. In our implementation, we select DiT and SiT, along with JiT, as our baselines to validate the effectiveness of Group Diffusion.

Approach. (Left) Previous approaches generate images independently. We explore Group Diffusion, which allows a set of images to collaborate together during inference time. (Right) Group attention can be implemented simply by reshaping the tokens within a batch, before and after the attention operation.

GroupDiff Training

During training, we dynamically construct groups of related images to enable effective cross-sample interaction. For each target image, we retrieve a set of semantically or visually similar candidates using cosine similarity on pre-trained embeddings (e.g., CLIP or DINO). We then randomly sample from these candidates to form a group, ensuring the model processes a coherent set of samples that can mutually aid the denoising process.

We propose two variants: GroupDiff-f and GroupDiff-l. GroupDiff-f employs group attention for both conditional and unconditional branches. In contrast, GroupDiff-l applies group attention exclusively to the unconditional branch. This distinction leads to a significant difference in training efficiency: since the unconditional branch is typically trained only 10% of the time (following standard Classifier-Free Guidance probability), GroupDiff-l allows the model to be trained with a standard group size of one for the vast majority of iterations (90%). This design makes GroupDiff-l computationally lightweight and comparable to the baseline cost.

GroupDiff Mechanism

Our experiments validate the core hypothesis: images can effectively collaborate during inference. By analyzing the internal attention weights, we observe that the model naturally establishes both intra-image and inter-image correspondence without explicit supervision.

Attention Map Visualization. We visualize the attention maps from the second layer of the denoising network. The star indicates the query patch (anchor). The heatmaps reveal that the model attends to semantically similar regions across different samples in the group (e.g., a patch on a dog's face attends to faces in other dog images). This confirms that GroupDiff leverages cross-sample semantic correspondence to enhance generation quality.

Cross-Sample Attention Score

We further investigate how the model utilizes the group context. We hypothesize two operating modes: a "distributed" mode versus a "neighbor-focused" mode, where the model attends primarily to the most similar sample. To quantify this, we introduce the Cross-Sample Attention Score, measuring how peaked the attention is towards the nearest neighbor:

$$S_{cross} = \frac{P_{cross\text{-}max} - P_{cross\text{-}mean}}{P_{cross\text{-}max}}$$

where $ P_{\text{cross-max}} $ and $ P_{\text{cross-mean}} $ denote the maximum and mean cross-sample attention weights from one image to the others in the group. A score close to 1 indicates attention is highly focused on a single related sample, while a score near 0 implies uniform attention distribution.

Our analysis reveals that GroupDiff strongly encourages this neighbor-focused attention. Distinct clusters emerge in the plot corresponding to different group sizes; increasing the group size effectively encourages stronger cross-sample attention behavior. Moreover, even within each cluster, higher scores consistently correlate with lower FID, confirming that this interaction reliably reflects generation quality.

FID vs Cross-Sample Score Correlation Plot

Correlation Analysis (r=0.95). Higher generation quality is achieved when the model successfully identifies and attends to its most relevant neighbor.

Unlocking Trade-off between Generation Quality and Inference Efficiency

Crucially, we discover that larger group sizes inherently encourage this neighbor-focused attention mechanism. This leads to a consistent scaling behavior: as we scale up the group size, the Cross-Sample Attention Score increases, directly translating to improved generation quality. We show selected examples of class-conditional generation using GroupDiff-f with group sizes of {1, 2, 4, 8} without CFG.

Class 88: "macaw"

Results

We evaluate GroupDiff-4 across different baselines in two distinct settings: training from scratch and plug-and-play finetuning. The following table reports the FID performance (lower is better).

training from scratch

DiT-XL/2 1.66 +26.9%

SiT-XL/2 1.63 +20.9%

plug-and-play finetuning (+100 epochs)

DiT-XL/2 1.55 +31.7%

SiT-XL/2 1.40 +32.0%

REPA 1.14 +19.7%

System-Level Performance

We benchmark Group Diffusion against leading generative systems on ImageNet 256x256. The table below shows that GroupDiff consistently improves performance across various settings, outperforming state-of-the-art baselines.

Method	Epochs	FID	IS	Prec.	Recall
With semantic feature distillation
DDT-XL	800	1.26	310.6	0.79	0.65
SiT-XL/2 + REPA-E	800	1.26	314.9	0.79	0.66
SiT-XL/2 + REPA	800	1.42	305.7	0.80	0.64
+ Our GroupDiff-4 (finetune)	800+100	1.14	315.3	0.77	0.66
Without semantic feature distillation
ADM	400	4.59	186.7	0.82	0.52
VDM++	-	2.12	267.7	-	-
LDM-4	1400	3.60	247.7	0.87	0.48
MDTv2-XL/2	900	1.58	314.7	0.79	0.65
VAR-d30	350	1.92	323.1	0.82	0.59
LlamaGen-3B	-	2.18	263.3	0.81	0.58
RandAR-XXL	300	2.15	322.0	0.79	0.62
MaskDiT	1600	2.28	276.6	0.89	0.61
DiT-XL/2	1400	2.27	278.2	0.83	0.57
+ Our GroupDiff-4 (from scratch)	800	1.66	279.4	0.83	0.57
+ Our GroupDiff-4 (finetune)	1400+100	1.55	285.4	0.80	0.63
SiT-XL/2	1400	2.06	270.3	0.82	0.59
+ SRA	800	1.58	311.4	0.80	0.63
+ Dispersive Loss	1200	1.97	-	-	-
+ Our GroupDiff-4 (from scratch)	800	1.63	283.2	0.81	0.64
+ Our GroupDiff-4 (finetune)	1400+100	1.40	290.7	0.79	0.64

Qualitative Results

Uncurated generation results of GroupDiff-4 with guidance

Class 33: "loggerhead sea turtle"

Citation

@article{groupdiffusion2026, title={Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration}, author={Mo, Sicheng and Nguyen, Thao and Zhang, Richard and Kolkin, Nicholas and Iyer, Siddharth Srinivasan and Shechtman, Eli and Singh, Krishna Kumar and Lee, Yong Jae and Zhou, Bolei and Li, Yuheng}, journal={arXiv preprint}, year={2026} }