Lesson 1 • 3 min

The Full Pipeline

End-to-end walkthrough

Let's trace what happens when you type "a cat wearing sunglasses on a beach" and hit generate. Every concept we've learned comes together.

Interactive: step through the complete generation process

Complete pipeline

def generate(prompt: str) -> Image:
    # STEP 1: Tokenization
    # "a cat wearing sunglasses" → [1, 5847, 5765, 41031]
    tokens = tokenizer.encode(prompt)

    # STEP 2: Text Encoding
    # Token IDs → contextual embeddings
    # Each token becomes a 768-dim vector
    text_embed = text_encoder(tokens)

    # STEP 3: Start with Noise
    # Random Gaussian noise in latent space
    latent = torch.randn(1, 4, 128, 128)

    # STEP 4: Iterative Denoising (8 steps)
    for step in range(8):
        # S3-DiT predicts noise to remove
        # Uses self-attention + cross-attention
        noise_pred = transformer(
            latent,
            text_embed,
            step
        )
        # Remove predicted noise
        latent = scheduler.step(latent, noise_pred, step)

    # STEP 5: VAE Decode
    # 128×128 latent → 1024×1024 RGB
    image = vae.decode(latent)

    return image

Quick Win

You can now trace the full pipeline: tokenize → encode → noise → denoise × 8 → decode.

Continue to Practice