← BACK TO SKILLS
FREE

Building a Transformer Language Model from Scratch

by @andrejkarpathy

AI AI★★★★☆ principles

ABOUT THIS SKILL

A deep dive into the mechanics of the Transformer architecture, explaining how self-attention enables tokens to communicate and how to train a character-level language model on Shakespeare.

TECHNIQUES

character level tokenizationself attentionmatrix multiplicationbatched trainingcross entropy lossadam optimizerpositional encodingmasked fillsoftmax normalization

KEY PRINCIPLES (10)

Language Modeling

Language models predict the next token by modeling sequences probabilistically.

Given a sequence prefix, the model outputs a probability distribution over the next token, allowing multiple valid completions.

Why: Natural language is inherently ambiguous and context-dependent; probabilistic modeling captures this variability.

"chat GPT is a probabilistic system and for any one prompt it can give us multiple answers"

Transformer Architecture

The Transformer uses self-attention to let every token aggregate information from all previous tokens.

Each token produces a query and key; their dot-product yields affinities that weight how much each past token contributes.

Why: This allows parallel computation and long-range dependencies without recurrence, making training efficient and scalable.

"the Transformer neural network will look at the characters that I've highlighted and is going to predict that g is likely to come next"

Tokenization

Tokenization converts raw text into integer sequences via a vocabulary.

Character-level tokenization maps each character to an integer; sub-word tokenizers like BPE trade vocabulary size for shorter sequences.

Why: Neural networks operate on numeric tensors; tokenization bridges discrete text and continuous mathematics.

"tokenize the input text now when people say tokenize they mean convert the raw text as a string to some sequence of integers"

Training Efficiency

Training processes small random chunks rather than full sequences.

Chunks of length block_size are sampled; each chunk contains block_size individual examples (context → next token).

Why: GPU memory and compute are limited; chunking keeps utilization high while still covering the data distribution.

"we never actually feed entire text into a Transformer all at once that would be computationally very expensive"

Positional Information

Positional encodings inject order information because self-attention is permutation-invariant.

A second embedding table maps each position index (0…block_size-1) to a vector that is added to the token embedding.

Why: Without positional cues, the model cannot distinguish “dog bites man” from “man bites dog.”

"we're not just encoding the identity of these tokens but also their position"

Causal Masking

Future tokens must not influence current predictions.

A lower-triangular mask sets attention weights to −∞ for all future positions, zeroing them after softmax.

Why: During generation the model only has access to past context; training must mirror this constraint.

"the token at the fifth location it should not communicate with tokens in the sixth seventh and eighth location"

Matrix Multiplication Trick

Weighted averages across time can be computed in one batched matrix multiply.

Construct a lower-triangular weight matrix, normalize with softmax, then multiply with value vectors to aggregate past context.

Why: Loops over time steps are slow; the trick yields O(T²) parallel computation on GPU.

"we can do these averages in this incremental fashion because we just get um and we can manipulate that based on the elements of a"

Loss Function

Cross-entropy measures how well predicted logits match the true next token.

After reshaping logits to (B*T, vocab_size) and targets to (B*T), cross-entropy gives a single scalar loss.

Why: It directly optimizes the negative log-likelihood, aligning with the probabilistic objective.

"loss is the cross entropy on the predictions and the targets"

WHAT'S INSIDE

PRINCIPLES
9
TECHNIQUES
10
EXPERT QUOTES

This is a structured knowledge base — not a prompt file. Your AI retrieves principles semantically, understands the reasoning behind each technique, and connects to related skills via a knowledge graph.

Compatible with OpenClaw · Claude · ChatGPT

principles · semantic retrieval · knowledge graph

Free during beta · Sign in to save to dashboard