Quanta Magazine

The machine learning industry is plagued by the sheer size of artificial neural networks, and the resulting outsized successes. This creates conceptual problems. AlexNet, a network that won the annual image recognition contest in 2012, had 60 million parameters. These parameters were fine-tuned over training and allowed AlexNet recognize images it hadn't seen before. VGG, a network with over 130 million parameters, won the competition two years later. Artificial neural networks (or ANNs) now have billions upon billions of parameters.
These huge networks, which are incredibly successful in tasks like recognizing speech, classifying images, and translating text from one to the other language have started to dominate machine learning. They remain mysterious. It is not clear what their extraordinary power is.

Researchers have shown that the mathematical equivalent of idealized networks of this powerful learning system is older, simpler kernel machines. This equivalence may be extended to other idealized neural networks and could explain how practical ANNs can achieve such amazing results.

Artificial neural networks are a part of what makes them so fascinating. They seem to challenge traditional machine learning theories that heavily rely on statistics and probability theory. Machine learning models, including neural networks, are trained to recognize patterns in data and make predictions about future data.

Too many parameters can lead to a learned model that is too simplistic and fails to capture the details of the data it was trained on. Too many parameters can make the model too complex. It will learn the patterns from the training data at such fine granularity it is unable to generalize when asked for new data. This phenomenon is called overfitting. This is balancing between fitting your data too well or not fitting it at all. Mikhail Belkin, a researcher in machine learning at the University of California San Diego, stated that you want to be somewhere in the middle.

Deep neural networks such as VGG are, by all accounts, too complex and should not be overfit. They don't. These networks are able to adapt astonishingly well to new data, and no one has ever figured out why. This was not because it wasn't worth the effort. Naftali Tisby, a computer scientist at the Hebrew University of Jerusalem, died in August. He argued that deep neural network first fit the training data, then discard the irrelevant information (by going through a information bottleneck), which allows them to generalize. Others have claimed that this does not happen in all deep neural networks. This idea is still controversial.

The mathematical equivalence between kernel machines and idealized neural network is giving clues as to how or why these over-parameterized networks arrive (or converge) to their solutions. Kernel machines are algorithms which find patterns in data and project the data into very high dimensions. Researchers are studying mathematically tractable kernel alternatives of idealized neural network to discover why deep nets, despite being so complex, can converge during training to solve problems that extend to unseen data.

A neural network can be described as a Rube Goldberg machine. Belkin said that you don't know which part is the most important. Because kernel methods don't have all the complexity, I believe we can isolate the engine of what is going on by reducing them to kernel methods.

Find the line

Kernel methods or kernel machines rely on a field of mathematics that has a long history. It all goes back to 19th-century German mathematician Carl Friedrich Gauss who created the Gaussian kernel. This maps a variable, x, to a function that has the familiar bell curve shape. Kernels were first used for solving integral equations by James Mercer, an English mathematician. In machine learning, kernels were used to deal with data that was not easily classified by simple methods. This was in the 1960s.

To understand kernel methods, you need to start with linear classifiers, which are algorithms in machine learning. To distinguish between cats and dogs, you will need to use data in two dimensions. Let's say we have the size of the nose, which can be plotted on the x-axis and the size the ears which can go on the y axis. This labeled data should be plotted on the xy plane. Cats should be in one group and dogs in another.

The labeled data can be used to train a linear classifier. This will allow you to create a straight line between the two clusters. This requires you to find the coefficients of equation that represents the line. It is now possible to determine if it is a cat or a dog from the unlabeled data.

However, cat and dog lovers would be horrified at this oversimplification. A linear separator is unlikely to be able to separate the actual data about the ears and snouts of different breeds of cat and dog. When data is not easily separated, it can be converted or projected to a higher-dimensional space. One simple way to accomplish this is to multiply the values of two features to create another; perhaps it's the relationship between the sizes and shapes of the ears and snouts that distinguishes cats from dogs.

It is easier to locate a linear separator when you look at data in higher-dimensional spaces. This hyperplane will be projected to lower dimensions and take the form of a nonlinear function, with curves or wiggles, that separates lower-dimensional data into two groups.

However, when working with real data it is often computationally inefficient or impossible to find hyperplane coefficients in high dimensions. It is possible with kernel machines.

Kernel of Truth

Kernel machines are able to accomplish two things. They map every point in a low-dimensional set of data to a point in higher dimensions. This hyperspace's dimensionality can vary depending on how it is mapped. However, this can present a problem. To find the coefficients of the separating Hyperplane, one must calculate an inner product for each pair high-dimensional features. This becomes more difficult when the data are projected into infinite dimensions.

The second thing that kernel machines do is to use a kernel function to generate a number equal to the inner product the higher-dimensional features. The algorithm can use this trick without ever having to step into high-dimensional space, and it can find the hyperplane's coefficients.

The best thing about the kernel trick, however, is that all computations take place in low-dimensional space and not the potentially infinite-dimensional space. This was stated by Bernhard Boser at Berkeley University.

Boser and his co-workers Isabelle Guyon, Vladimir Vapnik, created a class of kernel machine called support vector machines (SVMs). They were invented in the late 1980s/early 1990s at Bell Labs in Holmdel in New Jersey. Although there were many types of kernel machines, they had been influential in machine learning since the 1960s. But it was the invention of SVMs which made them the center of attention. SVMs were extremely powerful. They were widely used in a variety of fields, including bioinformatics (for detecting similarities between protein sequences and predicting their functions), machine vision, and handwriting recognition.

SVMs dominated machine learning until AlexNet, which introduced deep neural networks in 2012, brought about the rise of AlexNet. SVMs suffered as the machine learning community shifted to ANNs. However, they (and kernel machines in general) still remain powerful models with much to learn. They can do much more than use the kernel trick for finding a separating hyperplane.

A powerful kernel allows you to map the data into a kernel space that's almost infinitely large and powerful. This is Chiyuan Zhang, a researcher at Google Researchs Brain Team. There is always a linear separator within this hidden space, and there are many possibilities. Kernel theory allows you to choose the best linear separator by restricting the search space. Regularization is a way to reduce the number of parameters within a model to avoid it from becoming too fitted. Zhang wondered if deep neural network might be doing something similar.

Layers of artificial neurons make up deep neural networks. Each layer has an input layer, an out layer, and at most one hidden layer between them. The deeper the network, the more hidden layers are. These parameters represent the strength of the connections among these neurons. A network can be trained to recognize images by repeatedly showing it images it has correctly categorized and then determining the values of its parameters. The ANN is a model that converts an input (say an image) into an output (a label or a category).

Zhang and his colleagues conducted empirical tests on networks such as VGG and AlexNet in 2017 to determine if the algorithms used to train these ANNs were reducing the number tunable parameters. This was known as implicit regularization. This means that the training regime rendered these networks incapable to overfit.

This was not the case, according to the team. Zhang's team demonstrated that AlexNet and similar ANNs can overfit and not generalize by using cleverly manipulated data. However, the networks that were trained with the same algorithm did not overfit. Instead they generalized well to unaltered data. This implicit regularization is not the solution. Zhang said that the finding required a better explanation of generalization in deep neural network.

Infinite Neurons

Studies have shown that neural networks with larger dimensions are more effective at generalization than those with smaller ones. This was a clue that ANNs might be explained by using a strategy from physics. For example, studying limiting cases can often simplify a problem. Yasaman Bahri is a researcher on Google Research's Brain Team. These situations are often solved by physicists who look at extreme cases. For example, what happens when the number particles in a system reaches infinity? Bahri said that statistical effects may be easier to handle in these limits. Mathematically speaking, what would happen to a neural system if the number of neurons within a single layer was infinite and the layers were all the same width?

Radford Neal (now a professor emeritus of the University of Toronto) asked this exact question of a network that had one hidden layer in 1994. Radford Neal showed that the network's weights could be initialized with certain statistical properties and then at initialization (before any training), it was mathematically equivalent in function to a well-known kernel function, called a Gaussian process. Bahris and two other groups showed, more than 20 years later, that the same is true for idealized infinite-width deep neural network with many hidden layers.

This had an unexpected implication. Even after a deepnet has been trained, it is not possible to use an analytical mathematical expression to predict unknown data. It is as simple as running the deep net to see what it returns. In the ideal scenario, the initialization of the network would be equivalent to a Gaussian process. It is possible to just train the kernel machine and forget about your neural network.

Bahri said that once you have mapped it to a Gaussian process, you can calculate analytically what the prediction should look like.

Although this was a significant result, it did not mathematically describe the commonest form of training that is used in practice. It was not clear how the solution could be generalized in this setting.

Start the descent

One of the mysteries revolved around how deep neural networks are trained. This algorithm is called gradient descent. The term "descent" refers to how the network traverses a high-dimensional, complex landscape filled with hills and valleys during training. Each location represents an error by the network for a particular set of parameters values. Once the parameters are properly tuned, the ANN eventually reaches the global minimum. This means that it is as accurate as possible in classifying the training data. The optimization of training a network involves finding the global minimum. This is where the trained network represents an almost optimal function that maps inputs and outputs. It is a complicated process and difficult to analyze.

Simon Du, a specialist in machine learning at the University of Washington, Seattle, stated that no theory has ever proven that an algorithm such as gradient descent can converge to the global minimal. We began to understand the reasons by 2018.

As is the case with many scientific breakthroughs, several groups came up with a possible solution simultaneously. This was based on mathematical analysis of infinite-width networks, and how they relate the better-understood Kernel Machines. Arthur Jacot, a young Swiss graduate student, presented his group's work at NeurIPS 2018, which was held around the same time as Dus and other papers were published.

Although the work of the different teams was varied in terms of the details and framing, the core idea was the same: Deep neural networks with infinite width and whose weights were initialized with certain statistical properties, are identical to kernels, not only at initialization but throughout the training process. The weights are expected to change little in training, though the net effect of infinitely small changes is significant. Jacot and his collaborators at the Swiss Federal Institute of Technology Lausanne proved that an infinite-width deep neural net is equivalent to a kernel that does not change during training. It doesn't even depend on the training data. The kernel function is dependent only on the structure of the neural network (such as its depth or type of connectivity). Based on certain geometric properties, the team called their kernel the neural-tangent kernel.

Jacot stated that neural networks can behave in certain cases like kernel methods. This is the first step in comparing these methods to see the similarities and differences.

How to Get to All ANNs

This result has the most important outcome: it explains why deep neural network, at least in an ideal situation, can converge to a solution. It is hard to prove mathematically that an ANN is convergent when we consider its parameters and complex loss landscape in parameter space. The idealized deep net can be used to train the kernel machine or deep net. Each will eventually find the near-optimal function that converts inputs into outputs, because they are equivalents.

The evolution of the infinite-width neural networks' function corresponds to the evolution of their counterparts in the kernel machine's function. The neural network and the equivalent kernel machine roll down simple bowl-shaped landscapes in hyper-dimensional space when viewed in function space. It is easy to show that gradient descent will take you to the bowl's bottom. Du stated that global convergence can be proved, at least in this hypothetical scenario. This is why people in the learning theory community are so excited.

Some people are skeptical that the equivalence of kernels and neural network will be valid for practical neural networks. These networks have finite widths and can be subject to significant training changes. Zhang said that there are still some dots to connect. Zhang also feels disappointed that neural networks are reduced to mere kernel machines because of their mystique. It makes it less interesting to me that the old theory is still viable.

Others are more excited. Belkin believes that kernel methods, even though they are an old theory, are still poorly understood. His research has demonstrated that kernel methods don't overfit and can be used to test data with no need for regularization. This is similar to neural networks, and not what you would expect from traditional learning theories. Belkin said that understanding kernel methods will give us the key to unlocking this magical box of [neural network] magic.

Researchers have a better mathematical understanding of kernels which makes it easier to use them to analyze neural nets. They are also empirically simpler to work with than neural network. Kernels are simpler, require less random initialization of parameters and perform better. Researchers are eager to explore the links between kernels and realistic networks.

Belkin stated that if absolute and complete equivalence is established, it could change the game.