Deep Dive into LLMs (Karpathy)

Published

March 1, 2025

These are my notes on Andrej Karpathy (2025). There isn’t a lot of new information in this video if you’re already familiar with LLMs, but Karpathy’s presentation always makes it worth watching. This video was a good refresher on the fundamentals.

Gathering data.

LLMs are trained on a dataset derived from the whole internet. The FineWeb dataset is a representative example of such data. FineWeb has 15 trillion tokens and takes up 51 TB.

A dataset like this is generated from sources like CommonCrawl. There are many filtering, deduplication, language selection (e.g., 60% English), cleanup (PII) steps involved.

Tokenization.

The raw text is converted to tokens using byte-pair encoding. This algorithm repeatedly replaces the most common byte/token pair with a new token. This is done until the desired vocabulary size is reached. GPT4 for example has a vocabulary size of 100,277. The process of tokenization can be visualized on the Tiktokenizer site.

Neural network training.

The input to the neural network is the context window of text, for example 8192 tokens. The output is a prediction of the next token. This prediction is expressed as a probability for each of the possible tokens (the entire vocabulary) to be the next token. Inference is simply sampling from this probability distribution.

The structure of LLMs can be visualized on this tool.

GPT-2.

This is the first “recognizably modern” LLM. It had 1.6B parameters, trained on 100B tokens, with a 1024 token context. The llm.c repo implements the entire training code in ~1200 lines of C. GPT-2 originally took $40,000 to train but now can be done with hundreds of dollars on rented Nvidia GPUs.

Llama 3 and base model inference.

A base model is just a “compression of the internet” and they “dream internet pages”. They can’t answer questions. They’ll just start yapping on even simple questions like “what is 2+2?”.

Hyperbolic is one place to run inference on base models.

A base model can regurgitate its training data sometimes. For example, if you give it the first sentence of a Wikipedia article it might complete the rest of the article exactly.

Even a base model can be made useful through in-context learning. For example, if the prompt is 10 pairs of English-Korean word translations, the model will pick up on the pattern and auto-complete the 11th word’s translation. You can also turn a base model into an assistant by giving it a long template of a human/assistant interaction.

Post-training.

The base model is trained with a new dataset to turn it into a conversational assistant. The data used for this will have new special tokens like this:

<|im_start>user<|im_sep>What's a good name for a baby?<|im_sep>

where IM = “imaginary monologue”.

The InstructGPT paper Ouyang et al. (2022) discusses how this was done for GPT-3. OpenAI hired expert human labelers to write hundreds of thousands of prompts and responses. So when we talk to ChatGPT, it’s useful to think of it as chatting with an “instant simulation” of one of these humans, rather than an omniscient magical “AI”.

Extensive human labeling is no longer needed. We can now use LLMs themselves to produce the fine-tuning dataset. See UltraChat for an example.

Hallucinations and tool-use

Hallucinations were a big problem in the models from a couple of years ago, but they can be mitigated now. The mitigation works something like this:

  1. Take a random para from Wiki and generate 3 factual questions based on it.
  2. Ask the model these questions and probe its knowledge.
  3. Augment the training data set with examples of model getting things right or wrong, based on what we know about the model’s knowledge. So for example if it doesn’t know who won the cricket world cup in 1996, add a training example where it says “I don’t know” for that particular question.

The second enhancement/mitigation is tool use. Give the model training examples of the form <SEARCH_START> ... </..>, indicating that the model should use a tool. During inference, if you see tokens of that form, pause inference and run the tool and insert its output into the context and continue inference.

References

Andrej Karpathy. 2025. “Deep Dive into LLMs Like ChatGPT.” https://www.youtube.com/watch?v=7xTGNNLPyMI.
Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv. https://doi.org/10.48550/arXiv.2203.02155.