Text-to-image generators are a new hot trend in artificial intelligence. Feed these programs any text you like and they will generate stunningly accurate pictures that match the description. They can match a range of styles, from oil paintings toCGI renders and even photographs, and in many ways the only limit is your imagination.
The leader in the field has been DALL-E, a program created by OpenAI. DALL-E was overtaken in the quality of its output by Imagen, which was announced yesterday.
To understand the amazing capability of these models, it is best to look over some of the images they can generate. You can see more examples at the dedicated landing page of the search engine.
The text at the bottom of the image was the prompt fed into the program and the picture above was the output. That's all it takes. You type what you want to see and the program will generate it. It's pretty fantastic, right?
These pictures are impressive in their accuracy and coherence, but they should be taken with a pinch of salt. The best results are usually cherry-picked by the research teams. The average output of the image system may not be represented by these pictures.
Remember: Google is only showing off the very best images
Images generated by text-to-image models often look unfinished, smeared, or blurry, and we have seen problems with pictures generated by Openai's DALL-E program. There are more trouble spots for text-to-image systems in this interesting thread. It shows the tendency of the system to misunderstand prompts and struggle with both text and faces.
According to a new benchmark created for this project named DrawBench, Imagen produces better images than DALL-E 2.
DrawBench is a simple metric: a list of 200 text prompt that was fed into Imagen and other text-to-image generators, with the output from each program judged by human rate. Humans preferred the output from Imagen to that of rivals, as shown in the graphs below.
It will be hard to judge this for ourselves, as the imagen model is not available to the public. There is a good reason for this. Text-to-image models have a range of troubling applications. Imagine a system that creates images for fake news, hoaxes, or harassment, for example. These systems are often racist, sexist, or toxic, and their output is often racist, sexist, or toxic.
The old wisdom still applies to AI: garbage in, garbage out
A lot of this is due to how the systems are programmed. They are trained on huge amounts of data which they study for patterns and learn to replicate. The models need a lot of data, and most researchers have decided that it's too difficult to filter this input. They get huge amounts of data from the web, and as a result their models ingest all the bile you would expect to find online.
Large scale data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped dataset. These datasets tend to reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups.
The adage of computer scientists still applies in the world of artificial intelligence: garbage in, garbage out.
According to the model used by Imagen, there is an overall bias towards generating images of people with lighter skin tones, as well as a tendency for images of people with darker skin tones.
Researchers have found this as well while evaluating DALL-E. Almost all the subjects will be women if you ask DALL-E to generate images of a flight attendant. You get a bunch of white men when you ask for pictures of a CEO.
Openai decided not to release DALL-E publicly, but it did give access to some people. The model is being used to generate racist, violent, or pornographic imagery and it also filters certain text inputs. The history of artificial intelligence tells us that text-to-image models will become public at some point in the future, with all the troubling implications that wider access brings.
The company says it plans to develop a new way to benchmark social and cultural bias in future work after concluding that Imagen is not suitable for public use at the moment. We will have to be satisfied with the company's upbeat selection of images. That is just the beginning. If Imagen wants to have a chance at generating that, it would have to be from the consequences of technological research.