FREE

Building and Understanding Large Language Models from Internet Text to Conversational Assistant

by @andrejkarpathy

AI★★★★☆ principles

ABOUT THIS SKILL

A comprehensive walk-through of how modern LLMs like ChatGPT are created, from scraping and filtering the internet to post-training the base model into a helpful assistant.

TECHNIQUES

tokenizationpre trainingpost trainingfew shot promptingin context learningregeneration samplingconversation simulation

KEY PRINCIPLES (14)

Data Curation

High-quality, diverse text is essential for model knowledge.

Start with Common Crawl, then aggressively filter for quality, language, PII, and duplicates to shrink 2.7 B web pages down to ~44 TB of text.

Why: Garbage in, garbage out; the model can only learn what it sees.

"we want large diversity of high quality documents and we want many many of them"

Tokenization

Text must be converted into a finite sequence of symbols the network can ingest.

Use byte-pair encoding to trade vocabulary size for sequence length; GPT-4 uses 100,277 tokens.

Why: Neural nets expect fixed vocabularies and shorter sequences reduce compute.

"we have to decide what are the symbols and then we have to represent our data as one-dimensional sequence of those symbols"

Pre-training Objective

Train the network to predict the next token given preceding context.

Randomly sample windows up to ~8 k tokens, feed them into a Transformer, and iteratively nudge 405 B parameters so the correct next-token probability rises.

Why: Next-token prediction implicitly forces the model to learn grammar, facts, reasoning, and world knowledge.

"we want to model the statistical relationships of how these tokens follow each other in the sequence"

Compute Scaling

Training is fundamentally a GPU parallel-matrix-multiplication problem.

Modern runs use thousands of H100 GPUs in data centers; cost drops from $40 k (GPT-2 2019) to ~$100 today via better data and faster hardware.

Why: More flops → more tokens processed per second → faster convergence to lower loss.

"the more gpus you have the more tokens you can try to predict and improve on"

Base Model Nature

A base model is an internet-text simulator, not an assistant.

It remixes and hallucinates; answers depend on prompt statistics, not intent.

Why: It was trained only to continue documents, not to satisfy user goals.

"it is a token simulator right it's an internet text token simulator"

Knowledge Storage

All world knowledge is compressed into the parameter weights.

The 405 B parameters act like a lossy zip of the 15 T-token training set; frequent patterns are recalled accurately, rare ones fuzzily.

Why: Distributed representations allow generalization beyond verbatim regurgitation.

"these 405 billion parameters is a kind of compression of the internet"

Stochastic Generation

Sampling from the probability distribution makes outputs non-deterministic.

Each token choice is a biased coin flip; same prompt can yield different continuations.

Why: Randomness enables creative, varied outputs instead of memorized text.

"the system here is stochastic so for the same prefix of tokens we're always getting a different answer"

In-Context Learning

The model can learn tasks on the fly from prompt examples.

Supply 5–10 input-output pairs in the prompt; the base model continues the pattern without weight updates.

Why: Attention layers treat the prompt as additional training data.

"it is learning sort of in place that there's some kind of a algorithmic pattern going on in my data"

WHAT'S INSIDE

PRINCIPLES

TECHNIQUES

EXPERT QUOTES

This is a structured knowledge base — not a prompt file. Your AI retrieves principles semantically, understands the reasoning behind each technique, and connects to related skills via a knowledge graph.

Compatible with OpenClaw · Claude · ChatGPT

principles · semantic retrieval · knowledge graph

Free during beta · Sign in to save to dashboard