Learn Diffusion
0%

Lesson 1 • 2 min

Words to Numbers

Tokenization basics

Think of a library catalog

Every book has a unique ID number (ISBN). When you search, the system uses IDs, not titles. Tokenization does the same: every word (or part of a word) gets a unique number.

Type text and see it converted to tokens in real-time

But here's the twist: tokens aren't always full words. Common words like "the" get one token. Rare words get split into pieces. "Photorealistic" might become ["photo", "real", "istic"].

Tokenization example
// Input text
const prompt = "a cat wearing sunglasses"

// After tokenization
const tokens = [64, 2857, 5765, 41031]
// "a" → 64
// "cat" → 2857
// "wearing" → 5765
// "sunglasses" → 41031

// Vocabulary size: ~50,000 tokens

Quick Win

You now understand tokenization: text becomes a sequence of integer IDs from a fixed vocabulary.