DALL-E 2 results for “Teddy bears mixing sparkling chemicals as mad scientists, steampunk.”
DALL-E 2 results for “Teddy bears mixing sparkling chemicals as mad scientists, steampunk.”
OpenAI

The new version of DALL-E is a text-to-image generation program. DALL-E 2 has a higher-resolution and lower-latency version of the original system. New capabilities include editing an existing image. The tool is not being released to the public. Researchers can sign up to preview the system online, and OpenAI hopes to make it available for use in third-party apps.

The original DALL-E was created by the artist salvador Dal and the robot WALL-E. It was a limited test of the ability of Artificial Intelligence to visually represent concepts, from mundane depictions of a mannequin in a flannel shirt to a giraffe made of a turtle. OpenAI said at the time that it would continue to build on the system while looking at potential dangers like bias in image generation or the production of misinformation. It is attempting to address those issues using technical safeguards and a new content policy while also reducing its computing load and pushing forward the basic capabilities of the model.

A DALL-E 2 result for “Shiba Inu dog wearing a beret and black turtleneck.”
A DALL-E 2 result for “Shiba Inu dog wearing a beret and black turtleneck.”

One of the new DALL-E 2 features applies DALL-E's text-to-image capabilities on a more generalized level. Users can start with an existing picture, select an area and tell the model to change it. A vase of flowers on a coffee table can be replaced with a different picture if you block out a painting on a living room wall. The model can fill or remove objects while accounting for details. Variations is an image search tool for pictures that don't exist. Users can create a range of variations on the starting image. They can combine two images and make pictures with elements of both. The generated images are a bit larger than the original model.

DALL-E 2 is a computer vision system that was announced last year. The word-matching didn't capture the qualities humans found most important, and the process limited the realism of the images. An inverted version of CLIP was created by OpenAI, which was designed to look at images and summarize their contents the way a human would. Dhariwal describes the process of generating the image as starting with a bag of dots and then filling in a pattern with greater and greater detail.

An existing image of a room with a flamingo added in one corner.
An existing image of a room with a flamingo added in one corner.

A draft paper on unCLIP says it is resistant to a very funny weakness of CLIP, which is that people can fool the model by labeling one object with a word indicating something. The variations tool is still able to generate pictures of apples with high probability even when using a mislabeled picture. The model never produces pictures of iPods, despite the very high predicted probability of this caption.

DALL-E's full model was never released publicly, but other developers have built their own versions of it over the past year. One of the most popular mainstream applications is Wombo's Dream mobile app, which is used to generate pictures of whatever users describe in a variety of art styles. OpenAI isn't releasing any new models today, but developers could use its technical findings to update their work.

A DALL-E 2 result for “a bowl of soup that looks like a monster, knitted out of wool.”
A DALL-E 2 result for “a bowl of soup that looks like a monster, knitted out of wool.”

Some built-in safeguards have been implemented. The model was trained to limit its ability to produce objectionable content after some objectionable material was weeded out. There is a watermark that shows the nature of the work, although it could theoretically be removed. The model can be used to create faces based on a name, and even return a variant of the actual face from the painting.

DALL-E 2 will be tested by some partners. Users are banned from uploading or generating images that could cause harm, such as hate symbols, nudity, or obscene gestures. We hope to keep doing a staged process here so we can keep evaluating from the feedback we get how to release this.

JamesVincent has additional reporting.