Lesson 2 • 3 min

Self-Attention

How patches relate to each other

Everyone looking at everyone

Imagine a room where everyone can see everyone else, and they're all updating their opinions based on what others think. Self-attention lets every part of the image "see" every other part.

In image generation, the picture is divided into patches (like tiles). Self-attention lets each patch consider all other patches when deciding its final value. This creates global coherence.

Click a patch to see what it attends to

Self-attention for images

# Divide 1024x1024 image into 16x16 patches
# That's 64x64 = 4096 patches

def self_attention(patches):
    # Each patch creates query, key, value
    Q = linear_q(patches)  # "What am I looking for?"
    K = linear_k(patches)  # "What do I contain?"
    V = linear_v(patches)  # "What information do I offer?"

    # Every patch attends to every other patch
    attention_matrix = softmax(Q @ K.T)  # 4096 x 4096

    # Aggregate information
    output = attention_matrix @ V

    return output

Why is this powerful? A patch showing a cat's ear can attend to the patch showing its body, learning they should have matching fur color. Without self-attention, patches would be processed independently, losing coherence.

Quick Win

You understand self-attention: every patch considers every other patch, enabling global coherence in generated images.

Continue to Practice