Why Machines Learn
Last updated: Nov 27, 2024.
These are interesting things mentioned in Why Machines Learn: The Elegant Maths behind Modern AI, Ananthaswamy (2024).
Imprinting: Konrad Lorenz discovered that ducklings imprint on the first moving thing they see after hatching. More interestingly, they can imprint on relations. If upon birth they see two moving red objects, they will later follow two objects of the same color, even if the color is different. More about this in his Nobel lecture (Konrad Lorenz 1973a) and biography (Konrad Lorenz 1973b).
The first artificial neuron: The paper about the first artificial neuron was a collaboration between McCulloch, a professor in his mid-40s and Pitts, a teenage prodigy who was hanging around a university and was adopted into the McCulloch home. The paper itself (McCulloch and Pitts 1943) is impenetrable, written in a formal math style reminiscient of Principia Mathematica. The important conclusion though is that combinations of the artificial neuron can implement any boolean logic.
Hebbian learning can be understood as the memorable phrase “neurons that fire together, wire together”.
The Mark I perceptron was a hardware implementation that could recognize handwritten characters from a 20x20 image. It was a 3-layer neural network, although only one layer had adjustable weights (in hardware, using DC motors to drive potentiometers, essentially volume knobs!). The operator’s manual (Hay, Lynch, and Smith 1960) has all the fascinating details.
William Rowan Hamilton discovered quaternions and etched it on a bridge in Dublin. He’s also responsible for notions of “scalar” and “vector”.
A hyperplane, such as the one learnt by a perceptron, can be uniquely described by a vector that is orthogonal to it. This is in fact the vector of weights, \(w\).
The perceptron learning rule is simple, but it’s remarkable that it always converges if the dataset is linearly separable. The lecture notes (Kilian Weinberger 2024) have an accessible proof of this theorem, also reproduced in the book.
When two dialup modems are trying to establish a connection they send a pre-agreed signal that lets them configure (learn) an adaptive filter to filter out the noise particular to that line. This is some of the weird sounds we used to hear! The course notes in (UC Berkeley 2011) have more details.
Adaptive filters use the mean-squared error (MSE) as the cost function. When reading about linear regression I’ve always wondered why we can’t just use the absolute difference. One reason to prefer MSE is that it’s differentiable everywhere. Another reason mentioned in this book is that the MSE punishes outliers much more than the absolute difference.
The idea of stochastic gradient descent was already invented by the ADALINE project at Stanford in the 60s, which tried to solve some of the same problems as the perceptron machine.
Paul Erdős wasn’t convinced at first that switching doors in the Monty Hall problem was the right solution. There is hope for all of us!
Bayesian statistics was used by Frederick Mosteller and David Wallace in the 1940s to figure out the authorship of The Federalist Papers.
A concise way to remember Principal Component Analysis (PCA): The eigenvectors of a covariance matrix are the principal components of the original data matrix.
The support vector machine (SVM) overcomes the linear separability limitation of the perceptron. It finds an optimal separating hyperplane by projecting the dataset to a much higher dimension and finding a plane there. The algorithm to find this hyperplane works by minimizing a cost function related to the weight vector while simultaneously satisfying a set of constraints, one per data point.
Constrained optimization required by SVMs uses something called the technique of Lagrange multipliers. It consists of defining a new function, the Lagrange function, that encodes all the constraints and then finding the extrema of it.
The optimal separating hyperplane of the SVM depends only on the dot produces of a few “support vectors” that anchor the margin, hence the name. However, computing the dot products in higher dimensions can be expensive (this is not very convincing to me – is it still true?). The solution is to use the kernel trick.
A kernel is a function such that given two vectors \(x_i\) and \(x_j\) and a function \(\phi(x)\) that transforms each vector to a higher dimension, the kernel \(K(x)\) allows one to compute the dot product in the higher dimension while only working with the lower-dimension vectors:
\[ K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j) \] The kernel trick was suggested by a French scientist Isabelle Guyon, working with Bernard Boser and Vladimir Vapnik. The trick apparently can work even when projecting to an infinite dimensional space, called a Hilbert space. The kernel in that case is called the “radial basis function” (RBF).
An RBF kernel can always find a linearly separable hyperplane in some infinite-dimensional space. This means that SVMs are also universal function approximators, just like deep neural networks.
The original SVM paper is (Boser, Guyon, and Vapnik 1992).
Glass is a disordered solid — without an ordered crystalline structure yet not a liquid. By analogy, certain magnetic materials that have atoms or ions with randomly oriented magnetic moments (which arise due to ‘spin’) are called spin glasses.
A simple mathematical model of spin glasses assumes that the spin of each element in a 2d or 3d array depends only on the spins of its neighbors. If such a material starts out in an arbitrary state, the spins will flip until the entire system reaches the state of lowest energy, which happens when the spins are all aligned.
The physicist John Hopfield (Nobel in Physics, 2024) was thinking about the problem of associative memory. How is it that when given a fragment of an image or a hint of a smell, we can recall an entire vivid memory?
The solution was the Hopfield network. The connections in such a network are arranged similarly to the spin glasses. For a given configuration, the weights of the neurons represent the “memory” of the network. If the memory puts the network into its lowest energy state, any distorted (noisy) version of the same memory would put the network in a higher energy state. The network however can find its way back to the lowest energy state through a dynamical process, like any physical system finding its equilibrium. Thus we can think of the Hopfield network as a system that stores a memory and can retrieve that memory when given a fragment of it.
Hopfield could only publish his paper (Hopfield 1982) because he was a member of the Academy of Sciences and thus had privileges to publish without peer review. “No refereed journal would have accepted it”. 🙃
George Cybenko proved in 1989 that a neural network with just one hidden layer and an arbitrarily large number of neurons can approximate any function. This is the universal approximation theorem.
The proof uses the idea that functions are vectors in an infinite dimensional space. It’s not a constructive proof but one by contradiction. It assumes that a network with a single hidden layer cannot span the entire vector space of functions and arrives at a contradiction.
A deterministic algorithm for updating the weights of a neural network suffers from the problem of symmetry. If the initial weights are all assigned the same value (say, 0), they will be updated by the learning rule in the same way and thus effectively become redundant. The simple way to solve this problem is to initialize the weights randomly, an idea that first occured to Rumelhart.
The backprop paper (Rumelhart, Hinton, and Williams 1986) was the “last time” it was discovered. See (Liu 2023) for a detailed history of backpropagation.
David Hubel and Torsten Wiesel did experiments on cats and figured out essentially that the neurons in the brain respond to certain “features” in the visual field, like edges.
Yann LeCun at Bell Labs in 1988 designed a neural network to recognize handwritten digits from a US Postal Service dataset (is this the origin of the famous MNIST dataset?). He wrote a compiler in Lisp that would take the architecture of a neural network and generate C code to implement it.
His network, LeNet, was a convolutional neural network. A 2d convolution of an image involves moving a small kernel (say 4x4) over the pixels of the image and generating a new pixel through some operation (say, average or max). The end result of this is a slightly smaller image. While these kernels could be handcrafted, it’s more scalable to let the neural net learn them. This is how a neural net learns which features are important.
In 2009, Fei-Fei Li presented an immense dataset of labeled images, ImageNet and an associated challenge. Use the 1.2 million images, binned into 1000 categories, to train an algorithm and test it on 100,000 unseen images. AlexNet, using GPUs for training won the contest in 2012 by a wide margin, kicking off the modern deep learning revolution.
Grokking is a rather strange properly of deep neural networks. The cutting edge models today have parameters in the billions or trillions, sometimes outnumbering the instances of data used to train them. In theory these networks should simply overfit the data and not generalize, but that’s not true.
One hypothesis is that some kind of implicit regularization is happening in the training process. See also the illuminating short paper (breimanReflectionsRefereeingPapers2018?) that questions the usefulness of “theory” when doing ML research.
LLMs are an instance of self-supervised learning. The pre-training helps the network learn some structure of language (or images, speech, …) and fine-tuning later guides them towards a purpose.
The theory of all this is far, far behind experimentation.