Path to LLMs

Last updated: Oct 18, 2024.

This post is my attempt to draw the shortest path from knowing a little bit of ML to understanding state of the art language models. It includes both milestone papers and the best resources I’ve found to understand a concept. I also like knowing the history of things so there will be a bunch of papers that might really only be of historical interest.

This is a personal path, with the goal of being a reasonably good practitioner of ML, not a researcher. Finally, “path” is a misnomer. It’s more like a garden to get lost in.

Math background

There is no end to the amount of math one could learn before studying ML, and usually the more I learn the more it seems to help. However, I’ve also found that it’s ok to “lazy-load” the required math once you’ve acquired a decent intuition in each of the major areas. This section therefore is just going to be a list of the areas of math that can be helpful and the best resources I’ve found for learning them.

Ever since I discovered computers my identity has been “programmer”. The book by Jeremy Kun (2021) changed my relationship to math and gave me the confidence to read the ML textbooks and papers. It helped me reconnect with my teenage self that found math playful and was excited by it rather than scared by notation. This is a life-changing book.

Probability

Probability is the foundation for all of ML, statistics, and science. It’s also way more complicated than our brief encounter with it in high school or college makes us believe. I’m always on the look out for books and articles that help in developing a good intuition for probability.

The textbook by (Hamming 1991) is one of the best introductions. It is rigorous enough for us engineers but more importantly has long passages that explain the intuition behind ideas.

Information Theory

Information seems like the most natural concept to try to understand ML and stats. Many of the questions of interest can be posed as information theory questions: “what has a model learnt?”, or “what did this experiment tell us?”, “how much can a model of a certain size learn?”

(Cover and Thomas 2005) and (MacKay 2003) are two useful textbooks.

Linear Algebra

Linear Algebra has the worst branding in all of math. It’s more exciting to think of the subject as “thinking in high-dimensional spaces”. Everything in ML deals with vectors with impossibly high dimensions (for example, each token in GPT3 is represented as a vector in a ~50,000 dimension space).

The video series “Essence of Linear Algebra” by (3blue1brown 2016) was the first time linear algebra made any intuitive sense to me.

Calculus

ML papers are full of complicated equations with symbols from multivariate and matrix calculus. This might give the impression that one needs a full undergrad course in these topics before making any progress, but I don’t buy it. I think one can get by for a long time with just the intuition of the concept of a derivative (gradient) for complicated functions and the chain rule for computing them.

Optimization in ML

The goal of all ML training is to find an acceptably low value of the loss function. This is the part of ML that I find it the easiest to treat as a black box.

(Bottou, Curtis, and Nocedal 2016) is a great overview of the various optimization methods used in ML.

Automatic Differentiation

AD is the key to training large neural networks. AD libraries automatically figure out the gradient of the loss function as long as the computation of the loss function is expressed in a form that the library expects. For example, in PyTorch the computation is expressed as tensor operations.

(Baydin et al. 2015) is a great survey of the various AD methods. For ML training we care about “reverse mode”. (Paszke et al. 2019) describes PyTorch, the most widely used library for deep learning in production.

What are neural networks?

The first neural network was the perceptron (see Nilsson 2010, sec. 4.2.1), a single-layer network built to identify objects in 20x20 pixel images. I find it fascinating to note that most of the early work on neural networks was done by people trying to understand human cognition by building a model of computation different from the familiar digital (von Neumann) computer. From that perspective, current LLMs running on GPUs are just one physical realization of the model of computation.

The key algorithm for training neural networks is backpropagation. This algorithm has apparently been invented independently many times. (Rumelhart, Hinton, and Williams 1986) is one of the widely cited descriptions of it.

(LeCun et al. 1989) is one of the first examples of using neural networks and back propagation to solve the recognizably modern problem of handwriting recognition. An interesting companion piece is the blog post (Karpathy 2022a) that re-implements the network described in the original paper and illustrates the massive difference in training time made possible by modern hardware.

Another milestone in the deep learning revolution is AlexNet (Krizhevsky, Sutskever, and Hinton 2012) where a deep learning model beat all other previous computer vision models on image recognition by a significant margin. This paper also illustrates the coming together of three factors that make deep learning practical and are true to this day: (1) massive datasets (2) GPUs for efficient matrix computations (3) libraries to do automatic differentiation easily.

What is language modeling?

The task of language modeling is to learn a probability distribution about a corpus. The distribution is the conditional probability of the next token given a sequence of previous tokens. A short introduction to language modeling is in (Hang Li 2022).

The roots of this go back to Markov analyzing Pushkin’s poetry to settle a debate about free will(!), described in the article by (Hayes 2013).

The classic (Shannon 1948) paper that invented information theory also considers language modeling, as does his subsequent paper (Shannon 1951). In the second paper he describes an experiment to figure out the entropy of English language by giving humans (his wife and another couple) the task of predicting the next word of a short sentence, essentially treating them like modern LLMs!

The Shannon living-room experiments story is related in this entertaining profile: (Horgan 1992)

How is language modeling done with neural networks?

(Bengio et al. 2003) introduced the ideas of using a neural network to model language as well as the idea of a “distributed representation”, also known as word embeddings. The goal of embedding is to turn words and phrases into vectors in a high-dimensional space.

A big step forward in embeddings was Google’s word2vec paper (Mikolov et al. 2013), which contained the famous example vec("King") - vec("Man") + vec("Woman") ~= vec("Queen"). Embeddings just on their own are an incredibly useful tool in building products because they capture a general notion of semantic “distance” between words, sentences, or entire documents.

Recurrent Neural Nets (RNNs) were one solution to the problem of capturing the sequential nature of language. The historical roots of this approach are in the cognitive science paper (Elman 1990). The blog post (Andrej Karpathy 2015) illustrates the “unreasonable effectiveness” of RNNs.

The playlist “Neural Networks: Zero to Hero” (Karpathy 2022b) is a step-by-step walkthrough to building something like GPT-2 starting from nothing but knowledge of Python. This entire post is in a sense is all the supplementary reading I’m doing to finish understanding all the videos in this playlist.

Large Language Models

Everything in this section is just the starting point for deeper rabbit holes.

“Attention is all you need” (Vaswani et al. 2017) contains the core DNA of all current LLMs. Everything I described in this post above is my attempt to get to a full understanding of this landmark paper.

“State of GPT” (Andrej Karpathy 2023) is the best 1-hour introduction to the architecture, training and capabilities of LLMs. This talk is accessible to any working programmer, it doesn’t need any previous knowledge of LLMs or neural networks.

(3blue1brown 2024) is a great series of videos on neural networks and deep learning, with recent videos focusing on LLMs.

The papers on open source LLMs have a wealth of detail on the training data and methodology. See LLAMA2 (Touvron et al. 2023), Mistral 7B (Jiang et al. 2023).

The effectiveness of neural networks is extremely dependent on the quantity and quality of the training data. This fact is apparently discovered again and again so often that it has a name: “the bitter lesson” (Rich Sutton 2019). An intriguing related fact about LLMs is the existence of “scaling laws” that describe the optimal model size and number of training tokens for a given compute budget (Hoffmann et al. 2022).

The wide applicability of LLMs is a result of their ability to learn to perform tasks with just a handful of examples (“few-shot learning”). This discovery is related in the GPT2 (Brown et al. 2020) and GPT3 papers (Kojima et al. 2022).

Training LLMs is an incredibly complicated systems engineering problem. This blog post by the lead of PyTorch (Chintala 2024) and the infrastructure section in the LLama3 paper (Dubey et al. 2024) provide insight into what it takes.

[ to be continued … ]

References

3blue1brown. 2016. “Essence of Linear Algebra.” https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab.

———. 2024. “Neural Networks.” https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi.

Andrej Karpathy. 2015. “The Unreasonable Effectiveness of Recurrent Neural Networks.” https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

———. 2023. “State of GPT.” https://www.youtube.com/watch?v=bZQun8Y4L2A.

Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2015. “Automatic Differentiation in Machine Learning: A Survey.” https://doi.org/10.48550/ARXIV.1502.05767.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. “A Neural Probabilistic Language Model.” J. Mach. Learn. Res. 3 (null): 1137–55.

Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2016. “Optimization Methods for Large-Scale Machine Learning.” arXiv. https://doi.org/10.48550/ARXIV.1606.04838.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv. https://doi.org/10.48550/ARXIV.2005.14165.

Chintala, Soumith. 2024. “How to Train a Model on 10k H100 GPUs?” https://soumith.ch/blog/2024-10-02-training-10k-scale.md.html.

Cover, Thomas M., and Joy A. Thomas. 2005. Elements of Information Theory. 1st ed. Wiley. https://doi.org/10.1002/047174882X.

Dubey, Abhimanyu, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. 2024. “The Llama 3 Herd of Models.” arXiv. https://doi.org/10.48550/ARXIV.2407.21783.

Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14 (2): 179–211. https://doi.org/10.1207/s15516709cog1402_1.

Hamming, Richard W. 1991. The Art of Probability: For Scientists and Engineers. 1st ed. CRC Press. https://doi.org/10.1201/9780429492952.

Hang Li. 2022. “Language Models: Past, Present, and Future.” Communications of the ACM 65 (7): 56–63. https://doi.org/10.1145/3490443.

Hayes, Brian. 2013. “First Links in the Markov Chain.” American Scientist 101 (2). https://doi.org/10.1511/2013.101.92.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2203.15556.

Horgan, J. 1992. “Claude E. Shannon [Profile].” IEEE Spectrum 29 (4): 72–75. https://doi.org/10.1109/MSPEC.1992.672257.

Jeremy Kun. 2021. A Programmer’s Introduction to Mathematics. https://pimbook.org/.

Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. “Mistral 7B.” arXiv. https://doi.org/10.48550/ARXIV.2310.06825.

Karpathy, Andrej. 2022a. “Deep Neural Nets: 33 Years Ago and 33 Years from Now.” https://karpathy.github.io/2022/03/14/lecun1989/.

———. 2022b. “Neural Networks: Zero to Hero.” https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ.

Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. “Large Language Models Are Zero-Shot Reasoners.” arXiv. https://doi.org/10.48550/ARXIV.2205.11916.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, edited by F. Pereira, C. J. Burges, L. Bottou, and K. Q. Weinberger. Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.

LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.

MacKay, David J. C. 2003. Information Theory, Inference, and Learning Algorithms. 22nd printing. Cambridge: Cambridge University Press.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” arXiv. https://doi.org/10.48550/ARXIV.1310.4546.

Nilsson, Nils J. 2010. The Quest for Artificial Intelligence: A History of Ideas and Achievements. Cambridge ; New York: Cambridge University Press.

Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” arXiv. https://doi.org/10.48550/ARXIV.1912.01703.

Rich Sutton. 2019. “The Bitter Lesson.” http://www.incompleteideas.net/IncIdeas/BitterLesson.html.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.

Shannon, Claude. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27. https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf.

———. 1951. “Prediction and Entropy of Printed English.” Bell System Technical Journal 30 (1): 50–64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x.

Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv. https://doi.org/10.48550/ARXIV.2307.09288.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv. https://doi.org/10.48550/ARXIV.1706.03762.