Lesson 1 • 3 min
The Full Pipeline
End-to-end walkthrough
Let's trace what happens when you type "a cat wearing sunglasses on a beach" and hit generate. Every concept we've learned comes together.
Interactive: step through the complete generation process
Complete pipeline
def generate(prompt: str) -> Image:
# STEP 1: Tokenization
# "a cat wearing sunglasses" → [1, 5847, 5765, 41031]
tokens = tokenizer.encode(prompt)
# STEP 2: Text Encoding
# Token IDs → contextual embeddings
# Each token becomes a 768-dim vector
text_embed = text_encoder(tokens)
# STEP 3: Start with Noise
# Random Gaussian noise in latent space
latent = torch.randn(1, 4, 128, 128)
# STEP 4: Iterative Denoising (8 steps)
for step in range(8):
# S3-DiT predicts noise to remove
# Uses self-attention + cross-attention
noise_pred = transformer(
latent,
text_embed,
step
)
# Remove predicted noise
latent = scheduler.step(latent, noise_pred, step)
# STEP 5: VAE Decode
# 128×128 latent → 1024×1024 RGB
image = vae.decode(latent)
return imageQuick Win
You can now trace the full pipeline: tokenize → encode → noise → denoise × 8 → decode.