Building the GPT Tokenizer with Byte Pair Encoding
by @andrejkarpathy
ABOUT THIS SKILL
Tokenization is the necessary but gnarly bridge between raw text and transformer inputs; many LLM oddities trace back to tokenizer choices.
TECHNIQUES
KEY PRINCIPLES (12)
Tokens are the atomic unit of large language models.
Everything—training data size, context length, embedding tables—is expressed in tokens. GPT-2 used 50,257 tokens; GPT-4 uses ~100k.
Why: The transformer sees and attends to discrete tokens, not characters or bytes.
"tokens are this like fundamental unit um the atom of uh large language models if you will"
Character-level tokenization is simple but inefficient.
A 65-character Shakespeare tokenizer produced 1 token per character, yielding 1,000 tokens for 1,000 characters.
Why: Long sequences exhaust context length and waste attention capacity.
"here we had a very naive tokenization process that was a character level tokenizer"
BPE iteratively merges the most frequent adjacent byte pairs.
Start with 256 raw bytes; repeatedly replace the top pair with a new token ID, growing vocabulary and shrinking sequence length.
Why: Achieves a tunable trade-off between sequence length and vocabulary size.
"we iteratively find the pair of uh tokens that occur the most frequently and then once we've identified that pair we repl replace that pair with just a single new token"
Training the tokenizer is a separate, one-time preprocessing stage.
The tokenizer has its own training corpus; merges dictionary is frozen before LLM training begins.
Why: Allows curated data mixtures (code, multilingual text) to optimize token density per domain.
"tokenizer is a completely separate object from the large language model itself"
Decoding concatenates byte fragments then UTF-8 decodes with error replacement.
Invalid UTF-8 sequences produced by the model are replaced with the Unicode replacement character.
Why: Not every byte sequence is valid UTF-8; strict decoding would crash generation.
"the standard practice is to basically uh use errors equals replace"
Encoding walks merges in training order to guarantee determinism.
Greedy left-to-right merging respects the order in which merges were learned; stops when no mergeable pairs remain.
Why: Ensures identical token sequences for the same text across runs.
"we want to find the pair or like the a key inside stats that has the lowest index in the merges"
Whitespace handling dramatically affects code performance.
GPT-2 tokenized every Python indent space individually, bloating sequences; GPT-4 groups multiple spaces into single tokens.
Why: Fewer tokens per line of code lets the model see twice as much context.
"the Improvement in the python coding ability from gbt2 to gp4 is not just a matter of the language model... a lot of the Improvement here is also coming from the design of the tokenizer"
Case and position sensitivity create spurious token splits.
"egg" at start of string = 2 tokens; " egg" with leading space = 1 token; capitalized "Egg" = different token.
Why: The model must learn equivalence purely from co-occurrence statistics.
"for the same concept egg depending on if it's in the beginning of a sentence at the end of a sentence lowercase uppercase or mixed all this will be uh basically very different tokens"
WHAT'S INSIDE
This is a structured knowledge base — not a prompt file. Your AI retrieves principles semantically, understands the reasoning behind each technique, and connects to related skills via a knowledge graph.
Compatible with OpenClaw · Claude · ChatGPT
principles · semantic retrieval · knowledge graph
Free during beta · Sign in to save to dashboard