Makemore 1: bigram model

Author

Vikas Gorur

Published

May 26, 2025

In this series of posts we will follow along Andrej Karpathy’s series of videos about building a language model. In this post we’re implementing the first model from the video:

The spelled-out intro to language modeling: building makemore.

Consider text in a language to be a stream of tokens. The set of all possible tokens is the vocabulary. A token can be any sequence of bytes. The simplest possible tokens are just single ASCII characters.

The task of language modeling is to learn the probability distribution:

\[ P(t_{n+1} | t_1 ... t_n) \]

where:

\(t_1 ... t_n\) is the context, a sequence of previous tokens.
\(t_{n+1}\) is the next token.

Dataset

In this series we’ll use a dataset of Bollywood movie names with the goal of making more movie names (hence “makemore”).

import pandas as pd

def read_movie_names() -> list[str]:
    movies = pd.read_csv("data/movies.csv")

    def has_special_chars(name: str) -> bool:
        AZ = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ ")
        return len(set(name) - AZ) > 0

    return [
        n.upper()
        for n in list(movies.query("Language == 'hindi'")["Movie Name"])
        if not has_special_chars(n.upper())
    ]

MOVIES = read_movie_names()
print(f"Number of names: {len(MOVIES)}")
sample_names = "\n\t".join(MOVIES[5584:5589])
print(f"Sample names:\n\n\t{sample_names}")

Number of names: 12900
Sample names:

    BAAZIGAR
    PARDES
    ANURAG MAURYA ACT
    BLACK MARKET
    AKHIYON SE GOLI MAARE

We’ll write a small utility function to “unit-test” code in this notebook.

def verify(cond: bool):
    if cond:
        print("✅")
    else:
        print("❌")

Bigram character model

The first model we build will be based on bigrams, two-character strings. The set of tokens will be: {'A', 'B', ..., 'Z', ' ', '.'}. The size of this vocabulary is 28. The special token . will be used to indicate the beginning and end of each word.

The probability function “learned” by this model will just be the frequency of bigrams occuring in the training set. For that we first need a way to turn tokens into numbers and vice-versa.

Encoding and decoding

We’ll map tokens to numbers using the functions below. The unit-test assertions explain how they work.

def stoi(s: str) -> int:
    assert len(s) == 1
    if s == ".":
        return 27
    elif s == " ":
        return 26
    else:
        return ord(s) - ord("A")

verify(all([
    stoi('A') == 0,
    stoi('Z') == 25,
    stoi(' ') == 26,
    stoi('.') == 27
]))

✅

def itos(i: int) -> str:
    if i == 27:
        return "."
    elif i == 26:
        return " "
    else:
        return chr(i + ord("A"))

verify(all([
    itos(0) == 'A',
    itos(25) == 'Z',
    itos(26) == ' ',
    itos(27) == '.'
]))

✅

Training

The “training” of this model just involves counting the frequencies of each bigram in the training corpus (set of movie names) and computing probabilities by normalizing. The code below is vectorized using PyTorch for performance. The broadcasting semantics of PyTorch/NumPy code can be tricky, so we have to be careful when using arguments like keepdim=True. In the video Karpathy mentions how leaving out keepdim=True in the call to sum() runs without crashing but yields the wrong probabilities.

import torch

def train(corpus: list[str]) -> torch.tensor:
    N_TOKENS = 28
    counts = torch.zeros((N_TOKENS, N_TOKENS))
    for name in corpus:
        name = "." + name + "."
        for c, c_next in zip(name, name[1:]):
            counts[stoi(c), stoi(c_next)] += 1

    # Normalize the counts into probabilities on each row
    return counts / counts.sum(1, keepdim=True)

model = train(MOVIES)

We can visualize the probability distribution computed by this model.

import matplotlib.pyplot as plt

plt.imshow(model)

Sampling

Given a context (a single letter), we generate the next token (letter in this case) by taking a sample from the probability distribution of the next token conditioned on the context. This distribution is the row vector model[stoi(context), :] and multinomial takes a sample from it.

def sample(model: torch.tensor, context: str) -> str:
    return itos(
        torch.multinomial(
            model[stoi(context), :],
            replacement=True,
            num_samples=1
        )
    )

print(sample(model, '.'))

Generating

To generate a new movie name we start with the special token . and sample to get the next token. We sample again using this new token as the context, and keep going until the sampling returns the special token . again. We do this n times to generate many movie names.

def generate(model: torch.tensor, n: int) -> list[str]:
    names = []
    for i in range(n):
        result = ""
        token = "."
        while True:
            token = sample(model, token)
            if token == '.':
                break

            result += token

        names.append(result)
    return names

generate(model, 5)

['BAAJJAADABHEEKAN I SHO PYAHOBRAATAKAISSUGUL ANTUS BAAUBARESWA',
 'SASCAAAM E',
 'ERON',
 'EE DIJ BLOFASWHIAL',
 'PUN']

Given that we’re only using bigrams it’s unlikely that the model will even come up with real words, but we can see that it’s getting close to generating syllables that kinda sound like Hindi/Urdu.

Loss function

Let us consider the entire set of movie names and call it the corpus. We can think of each bigram in it as a training example. Since our language models only look at their limited context, it doesn’t matter that these examples come from different “documents” (individual movie names).

Now consider an arbitrary language model \(M\) (i.e., not necessarily bigram). As defined above, \(M\) should allow us to compute the probability of any \((t_1 ... t_n, t_{n+1})\) sequence. That is, given a context \(t_1 ... t_n\), the probability of every possible next token \(t_{n+1}\).

Given such a model, we can define the likelihood that it produces our corpus \(c\). This likelihood is the probability that all possible (context, next_token) pairs that are found in the corpus occur at the same time. Thus the likelihood is computed as the product of the model’s probabilities for all such pairs.

\[ L(c) = p(x_1) \cdot p(x_2) \cdot p(x_3) ... p(x_k) \]

where each \(p(x_i)\) is the probability of the \(i\)th training example.

A couple of things to note here:

When a bigram appears multiple times in the corpus, there is one term for each occurence when computing \(L\).
Since we’re using the product rule to multiply the probabilities, there is an implicit assumption that all training examples are independent. That is, we’re assuming that the occurence of one bigram has no influence on others.

Let’s work out the loss for the bigram model. Consider the corpus SHOLAY. It consists of the bigrams {'.S', 'SH', 'OL', 'LA', 'AY', 'Y.'}. We’ve added an extra token . to signify the beginning and end. Assume now that M is a Python function that returns the probability of the next letter given a letter, so for example M('S', 'H') is the probability that H follows an S. The likelihood of SHOLAY under this model is therefore:

L = M('.', 'S') * M('S', 'H') * M('O', 'L')
    * M('L', 'A') * M('A', 'Y') * M('Y', '.')

(This might seem a bit pointless and circular for the bigram frequency language model. But this is setting up a concept that’ll be useful later for the neural network models).

We’ll now make a couple of tweaks to make the floating point math better. Instead of working with likelihoods that are very small numbers (because they are the products of many numbers less than \(1\)), we will instead compute the negative log-likelihood.

\[ \mathrm{NLL}(c) = - (\log p(x_1) + \log p(x_2) + \log p(x_3) \ldots + \log p(x_k)) \]

Once we have this definition for a single document, we can compute the \(\mathrm{NLL}\) for every document in the training set and take an average.

\[ \mathrm{NLL_avg} = - \frac{1}{N} \sum_{i=1}^{N} \log p(x_i) \]

where \(N\) is the number of training examples.

The average negative log-likelihood is our loss function and has an intuitive meaning. It is a single number that allows us to measure how well the model has learnt to approximate its training set.

Let’s compute the loss for our bigram model:

import math

def nll_loss(names: list[str]) -> float:
    logprobs = 0
    count = 0
    for name in names:
        for c, c_next in zip(name, name[1:]):
            logprobs += math.log(model[stoi(c)][stoi(c_next)])
            count += 1

    return -logprobs / count

nll_loss(MOVIES)

2.491346227086267

Neural network

We’ll use the loss function to train the simplest possible neural network to do the same job of predicting the next character. This “network” will consist of a single matrix of weights, \(W\).

We want the output of the network to be a probability vector, so that gives us part of the shape of \(W\).

\[ \mathrm{shape\,of}\,\, W = (?, 28) \]

It doesn’t make sense to just feed the token IDs (like 5) to the network, because there is no ordering relationship among the tokens. That is, A is not in any way “lesser than” S for our purposes here. Thus we’ll one-hot encode the input token. This means that the shape of \(W\) will be (28, 28).

Finally, since the output of the network will have arbitrarily large numbers (because the parameters in \(W\) are arbitrary), we need to turn the output vector into a probability vector (sums to 1) by using the softmax operation.

We’ll now put all of this together into a single training function.