Lesson 3 • 2 min
DiT Architecture
Transformers for diffusion
Assembly line vs all-in-one machine
Old diffusion models (U-Net) were like an assembly line: process at different scales, then combine. DiT (Diffusion Transformer) is like an all-in-one machine: everything flows through the same transformer blocks.
DiT replaced the traditional U-Net architecture with pure transformers. Why? Transformers scale better—double the size, roughly double the quality. Plus, they're well-understood from language models.
Compare U-Net vs DiT architectures
DiT block
class DiTBlock:
def forward(self, x, time_embed, text_embed):
# 1. Self-attention (patches see each other)
x = self_attention(x) + x
# 2. Cross-attention (patches see text)
x = cross_attention(x, text_embed) + x
# 3. Feed-forward (process each patch)
x = feed_forward(x) + x
# Time embedding modulates everything
# (model knows which denoising step it's on)
return modulate(x, time_embed)Quick Win
You understand DiT: a transformer-based architecture that replaced U-Net for better scaling and quality.