Makemore 1A: exercises

Author

Vikas Gorur

Published

August 9, 2025

This is a follow up to the post makemore 1. In this post we’ll do the exercises suggested in the video.

E01

Train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

We’ll rewrite the counting based training code to be generic and work on n-grams. First, a function to return all the n-gram training examples as a tensor.

def ngrams(corpus: list[str], n: int) -> torch.tensor:
    "Returns the n-grams present in the corpus as a tensor, one row per n-gram"
    # Add start/end tokens to each name
    padded_names = ["." * (n-1) + name + "." for name in corpus]
    
    # Initialize tensor to store all n-grams
    # Each n-gram will be represented as a row of n integers (STOI mappings)
    total_ngrams = sum(len(name) - n + 1 for name in padded_names)
    result = torch.zeros((total_ngrams, n), dtype=torch.long)
    
    # Fill the tensor with n-grams
    idx = 0
    for name in padded_names:
        for i in range(len(name) - n + 1):
            # Extract the n-gram and convert each character to its STOI index
            ngram = name[i:i+n]
            for j, char in enumerate(ngram):
                result[idx, j] = STOI[char]
            idx += 1
            
    return result

Next, given the training set of ngrams, return the tensor of probabilities:

def train_ngram(ngrams: torch.Tensor, n: int) -> torch.tensor:
    """
    Train an n-gram model using pre-computed n-grams tensor.
    
    Args:
        ngrams: Tensor of shape (N, n) where N is number of n-grams and
                each row contains n STOI indices representing an n-gram
        n: The size of n-grams (e.g., 2 for bigrams, 3 for trigrams, etc.)
    
    Returns:
        Tensor of shape (V, V, ..., V) containing normalized n-gram
        probabilities, where V is vocabulary size
    """
    N_TOKENS = len(VOCAB)
    
    # Create a sparse tensor of counts using index_put_
    # Create a tuple of n dimensions, each of size N_TOKENS
    shape = tuple([N_TOKENS] * n)
    counts = torch.zeros(shape)
    
    # Split the ngrams tensor into n columns for index_put_
    indices = tuple(ngrams[:, i] for i in range(n))
    counts.index_put_(
        indices,
        torch.ones(len(ngrams)),
        accumulate=True
    )
    
    # Normalize the counts into probabilities
    # Add a small epsilon to avoid division by zero
    # Sum over the last dimension for normalization
    return counts / (counts.sum(-1, keepdim=True) + 1e-10)

Finally, the loss function for n-grams.

def ngram_loss(model: torch.Tensor, data: torch.Tensor, n: int) -> float:
    indices = tuple(data[:, i] for i in range(n))
    
    probs = model[indices]
    logprobs = torch.log(probs + 1e-10)
    
    return -logprobs.mean().item()

Now we can train a trigram model and compute its loss.

bigrams = ngrams(MOVIES, 2)
trigrams = ngrams(MOVIES, 3)

bmodel = train_ngram(bigrams, 2)
tmodel = train_ngram(trigrams, 3)

print(f"bigram loss = {ngram_loss(bmodel, bigrams, 2)}")
print(f"trigram loss = {ngram_loss(tmodel, trigrams, 3)}")
bigram loss = 2.5086231231689453
trigram loss = 2.1968438625335693

The loss is lower for the trigram model as we would expect, since it has the opportunity to pick up more of the structure present in the dataset.

E02

Split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

We’ll write a function to split the dataset.

def split_dataset(
        X: torch.tensor,
        train: float,
        dev: float,
        test: float
    ) -> Tuple[torch.tensor, torch.Tensor, torch.Tensor]:

    "Split X by row into the 3 datasets"

    assert abs(train + dev + test - 1.0) < 1e-5, "Proportions must sum to 1"
    n = len(X)
    
    # Calculate indices for splits
    train_idx = int(n * train)
    dev_idx = train_idx + int(n * dev)
    
    # Create random permutation of indices
    perm = torch.randperm(n)
    
    # Split the data using the permuted indices
    train_data = X[perm[:train_idx]]
    dev_data = X[perm[train_idx:dev_idx]]
    test_data = X[perm[dev_idx:]]
    
    return train_data, dev_data, test_data

Now we’ll write a function to train an n-gram model on just the training set and compute the loss for all the splits.

def train_dev_test(n: int):
    data = ngrams(MOVIES, n)
    train_set, dev_set, test_set = split_dataset(data, 0.8, 0.1, 0.1)

    model = train_ngram(train_set, n)

    train_loss = ngram_loss(model, train_set, n)
    dev_loss = ngram_loss(model, dev_set, n)
    test_loss = ngram_loss(model, test_set, n)

    print(f"Training set loss: {train_loss:.4f}")
    print(f"Development set loss: {dev_loss:.4f}")
    print(f"Test set loss: {test_loss:.4f}")

For bigrams:

train_dev_test(2)
Training set loss: 2.5090
Development set loss: 2.5243
Test set loss: 2.5005

For trigrams:

train_dev_test(3)
Training set loss: 2.1925
Development set loss: 2.3611
Test set loss: 2.3867

What do we learn from this?

The difference between training and test loss is greater for the trigram model than the bigram model. I interpret this as the trigram model picking up “more” of the structure that’s present specifically in the training set, leading it to do worse on predicting the test set.

E03

Use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

We will control regularization by adding this line to the train_ngrams function:

counts = torch.fill_(torch.zeros(shape), r)

This is filling the counts tensor with a provided value r instead of just zero. We can now try training the trigram model with different values of r and compute the training and dev set losses.

r =  0.01  losses: [train: 2.1891, dev: 2.2900]
r =  0.10  losses: [train: 2.1958, dev: 2.2773]
r =  0.50  losses: [train: 2.2183, dev: 2.2857]
r =  1.00  losses: [train: 2.2405, dev: 2.3014]
r =  5.00  losses: [train: 2.3547, dev: 2.3995]
r = 10.00  losses: [train: 2.4445, dev: 2.4817]

0.1 is not a “count” obviously but it doesn’t matter. We can think of r simply as a parameter that affects the final probabilities. We’ll now evaluate the test set loss:

2.1236484050750732

We got lucky somehow and the test loss is lower than the training and dev losses.

E04

We saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

(Refer to the train function from the previous post).

X is a 1-d tensor containing the integer code (0-27) of the first character of every bigram training example. Xenc is the tensor (N, 28) where each row is a one-hot encoded training example. W has shape (28, 28). For a given training example, Xenc @ W just picks out one row from W and returns that as the probabilities. Thus the current code is:

Xenc = F.one_hot(X, num_classes=len(VOCAB)).float()

# In forward()
logits = Xenc @ W

This can be replaced with a single line using PyTorch’s advanced indexing. We will use X as the index into W. This results in each element of X picking out a row from W and adding it to the resulting tensor.

logits = W[X]

E05

Look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we’d prefer to use F.cross_entropy instead?

E06

Meta-exercise! Think of a fun/interesting exercise and complete it.

What is a language model that can achieve a loss of \(0\) on the training set?