Lesson 2 • 3 min
Self-Attention
How patches relate to each other
Everyone looking at everyone
Imagine a room where everyone can see everyone else, and they're all updating their opinions based on what others think. Self-attention lets every part of the image "see" every other part.
In image generation, the picture is divided into patches (like tiles). Self-attention lets each patch consider all other patches when deciding its final value. This creates global coherence.
Click a patch to see what it attends to
# Divide 1024x1024 image into 16x16 patches
# That's 64x64 = 4096 patches
def self_attention(patches):
# Each patch creates query, key, value
Q = linear_q(patches) # "What am I looking for?"
K = linear_k(patches) # "What do I contain?"
V = linear_v(patches) # "What information do I offer?"
# Every patch attends to every other patch
attention_matrix = softmax(Q @ K.T) # 4096 x 4096
# Aggregate information
output = attention_matrix @ V
return outputWhy is this powerful? A patch showing a cat's ear can attend to the patch showing its body, learning they should have matching fur color. Without self-attention, patches would be processed independently, losing coherence.
Quick Win
You understand self-attention: every patch considers every other patch, enabling global coherence in generated images.