The principles of Diffusion were used by Sohl-Dickstein. The idea is to teach the system how to reverse the process of turning complex images into simple noise and then turn the noise into images.

The way it works is explained here. The training set is what the image from the first step is taken from. We can plot the image as a dot in million-dimensional space if each of the million dots has some value. The noise is equivalent to the spread of ink after a single small step. The values of the pixels are less related to the values in the original image as the process goes on. At each time step, the zero value on all those axes is pushed by the algorithm. The nudging prevents the values from growing too large.

A simple, normal distribution of dots around the origin can be created by doing this for all images in the data set.

The sequence of transformations takes a long time to turn your data distribution into a loud noise ball. You can sample a distribution from this process.

Photo of a young man in a white collared shirt, a dark dress jacket, and round glasses

Next is the machine learning part, which involves training a neural network to predict the less noisy images that came one step earlier. You adjust the parameters of the network to make it work better. The neural network can reliably turn a noisy image, which is representative of a sample from the simple distribution, into an image representative of a sample from the complex distribution.

The network is a model of its own. If you have a full mathematical description of the simple distribution, you can sample from it directly. This sample can be turned into a final image by the neural network.

The first outputs of the model were recalled. He said that if he squinted he would think that the colored blob looked like a truck. I had spent a lot of time looking at different patterns and trying to figure out what was going on. I was really happy.

Envisioning the Future

The images looked worse, the process was too slow, and the models were not able to sample over the entire distribution. This was not seen as exciting at the time.

It would take two students who didn't know each other to connect the dots from this initial work to modern day models. Song was the first person. He and his adviser published a method for building generative models that didn't estimate the probability distribution of the data Think of it as the slope of the high-dimensional surface.

Song's technique works best if he first perturbed each image in the training data set with increasing levels of noise, then asked his neural network to predict the original image using the distribution of the noise. His neural network could take a noisy image from a simple distribution and turn it into an image representative of the training data set. His machine learning model was slow to sample, even though the image quality was great. He did this without knowing about Sohl-Dickstein. Song was unaware of the models. Jascha sent me an email after our paper was published. Our models have strong connections, he told me.

The second student realized that Song's work could improve the models after seeing the connections. At the University of California, Berkeley, Jonathan Ho worked on generative modeling after finishing his PhD work. He thought it was the prettiest part of machine learning.

Some of Song's ideas and advances from the world of neural networks were used to redesign and update the Sohl-Dickstein's model. In order to get the community's attention, I needed to make the model generate great looking samples. I was certain that this was the most important thing I could do.

His instincts were correct. The new and improved diffusion model was announced by Ho and his colleagues. Researchers now refer to it asDDPM, because it became such a landmark. According to a benchmark of image quality, these models matched or surpassed all other generative models. It wasn't long before the players noticed. DALLE 2, Stable Diffusion, Imagen and other models all use some variation ofDDPM

A close up of a young man with dark hair and dark eyes looking into the camera with a slight smile

Large language models are one of the main ingredients of modern models. The models are trained on text from the internet. A research scientist at a stealth company and his colleague at a research company showed how to combine information from an LLM and an image-generating model to use text. The success of DALLE 2 is due to the process ofguided diffusion.

Ho said that they are far beyond his expectations. I am not going to pretend I knew what was about to happen.

Generating Problems

The images from DALLE 2 are not perfect. Racist and sexist biases can be reflected in the text of a large language model. Texts taken off the internet are often racist and sexist. The same biases are ingrained in LLMs that learn a probability distribution. Un-curated images taken off the internet can be used to train Diffusion models. Combining LLMs with today's models can result in images reflecting society's ills.

He has experienced it. She was shocked when she tried to create a stylized version of herself. She said that many of the images were sexually suggestive. She's not the only one.

These biases can be mitigated by checking both the input and outputs of the models, which is an extremely difficult task. Ho said nothing is a substitute for carefully and extensively safety testing. This is a big challenge for the field.

The power of generative modeling is something that is believed by Anandkumar. She liked the quote from Richard Feynman, "What I cannot create, I do not understand." Synthetic training data of under-represented classes for predictive tasks, such as darker skin tones for facial recognition, can be produced, thanks to an increased understanding. Generative models can give us insight into how our brains deal with noise. Building more sophisticated models could help the artificial intelligence.

The possibilities of what we can do with generative artificial intelligence are just beginning.