Imagine going to your local hardware store and seeing a new hammer. Many other hammers have been rendered obsolete due to the fact that this hammer pounds faster and more accurately than others. And there is more! With a few changes, the tool can be changed into a saw that can cut at least as fast and as accurately as any other option out there. The convergence of all tools into a single device might be a result of this hammer.

There is a story playing out among the tools of artificial intelligence. That versatile new hammer is a kind of artificial neural network that learns from existing data and uses it to do a task. It was originally designed to handle language, but has begun impacting other areas.

The transformer first appeared in a paper in which it was declared that attention is all you need. In a language model, words that are close to each other would be grouped together. Every element in the input data is connected to every other element by the transformer. As soon as it starts training, the transformer can see the entire data set.

The computer scientist said that natural language processing was a latecomer before transformers came along. That was changed by Transformers.

Word recognition applications like Transformers focus on analyzing and predicting text. It led to a wave of tools, like OpenAI's Generative Pre-trained Transformer 3 (GPT-3), which trains on hundreds of billions of words and creates consistent new text to an unnerving degree.

The success of transformers made the crowd ask what else they could do. Researchers say that transformers are proving to be surprisingly versatile. Neural nets that use transformers have become faster and more accurate than nets that don't. The emerging work suggests that transformers can handle even more.

"Transformers seem to be quite transformational across many problems in machine learning, including computer vision." said Vladimir Haltakov, who works on computer vision related to self-driving cars at BMW.

10 years ago, the disparate subfields of the artificial intelligence had little to say to each other. The computer scientist Atlas Wang of the University of Texas, Austin said that the transformer suggests the possibility of a convergence.

From Language to Vision

One of the most promising steps toward expanding the range of transformers began just months after the release of "Attention Is All You Need." He was one of the few people in the field who worked with CNNs, which for years had propelled all major leaps forward in deep learning and computer vision.

Photo of Alexey Dosovitskiy in a shirt and sweater against a white background

CNNs use filters to build up a recognition of features in an image. The photo apps can tell you apart from a cloud or organize your library by faces. CNNs were used for vision tasks.

One of the biggest challenges in the field was to scale up CNNs to train on larger data sets without piling on the processing time. He said that they were clearly inspired by what was happening. We wondered if we could do something similar in vision, after all, if transformers could handle big data sets of words, why not pictures?

The researchers presented the Vision Transformer at a conference in May 2021. The architecture of the model was almost the same as that of the first transformer, with only minor changes allowing it to analyze images instead of words.

The team knew they couldn't mimic the language approach since it would take too much computing time. The larger image was divided into square units. The size of the token is determined by the resolution of the original image, so it can be larger or smaller. By applying self-attention to each group, the ViT was able to quickly spit out increasingly accurate classifications.

The transformer classified images with over 90% accuracy, a far better result than anything Dosovitskiy expected, propelling it quickly to the top of the pack at the ImageNet classification challenge, a seminal image recognition contest. It was suggested by the success that maybe convolutions aren't as fundamental to computer vision as researchers thought.

In the future, I think it is likely that CNNs will be replaced by vision transformers or derivatives, according to Neil Houlsby, who worked with Dosovitskiy to develop ViT. He said that the future models may be pure transformers or approaches that add self-attention to existing models.

These predictions are strengthened by additional results. At the start of 2022, an updated version of ViT was second only to a newer approach that combines CNNs with transformers. CNNs without transformers barely made it to the top 10.

How Transformers Work

The results of the ImageNet showed that transformers could compete with leading CNNs. Maithra Raghu, a computer scientist at the Mountain View office of the search engine, wanted to know if they were the same way CNN was. Neural nets are notorious for being indecipherable black boxes, but there are ways to peek inside, such as examining the net's input and output, layer by layer, to see how the training data flows through. The group of Raghu did this.

Her group found ways in which self-attention leads to different ways of perception. The power of a transformer comes from how it processes the image data. A CNN can identify features like corners or lines by building its way up from the local to the global. Even the first layer of information processing makes connections between distant image locations. A transformer slowly brings the whole fuzzy image into focus if a CNN approach is like starting at a single point and zooming out.

In the realm of language, the difference is simpler to comprehend. The owl spied a squirrel. It tried to grab it with its talons but only got the end of its tail. A CNN that focuses only on the words immediately around the words would struggle, but a transformer connecting every word to every other word could discern that the owl did the grabbing, and the squirrel lost part of its tail.

An infographic explaining the differences in how a convolutional neural network and a transformer process images.

Researchers were more excited now that it was clear that the images were processed differently by transformers. The transformer's ability to convert a one-dimensional string, like a sentence, into a two-dimensional array, like an image, suggests that it could handle data of many other flavors. Wang thinks the transformer may be a big step toward achieving a kind of convergence of neural net architectures, resulting in a universal approach to computer vision as well.

Convergence Coming

Researchers want to use transformers to create new images. New text can be generated from the training data of language tools. In a paper presented last year, Wang combined two transformer models in an effort to do the same for images. New facial images were created when the double transformer network trained on the faces of more than 200,000 celebrities. The celebrities created by CNNs are more convincing than the celebrities invented by them, according to the inception score, a standard way of evaluating images generated by a neural net.

The transformer's success in generating images is even more surprising than the ability to classify images. The transformer approach is similar to the classification approach.

The potential for new uses of transformers in multimodal processing has been seen by Wang and Raghu. A way to combine multiple input sources is suggested by transformers.