Lesson 3 • 2 min

DiT Architecture

Transformers for diffusion

Assembly line vs all-in-one machine

Old diffusion models (U-Net) were like an assembly line: process at different scales, then combine. DiT (Diffusion Transformer) is like an all-in-one machine: everything flows through the same transformer blocks.

DiT replaced the traditional U-Net architecture with pure transformers. Why? Transformers scale better—double the size, roughly double the quality. Plus, they're well-understood from language models.

Compare U-Net vs DiT architectures

DiT block

class DiTBlock:
    def forward(self, x, time_embed, text_embed):
        # 1. Self-attention (patches see each other)
        x = self_attention(x) + x

        # 2. Cross-attention (patches see text)
        x = cross_attention(x, text_embed) + x

        # 3. Feed-forward (process each patch)
        x = feed_forward(x) + x

        # Time embedding modulates everything
        # (model knows which denoising step it's on)
        return modulate(x, time_embed)

Quick Win

You understand DiT: a transformer-based architecture that replaced U-Net for better scaling and quality.

Next Lesson