The work on Imagen Video, an artificial intelligence system that can generate video clips with a text prompt, was detailed today by the search giant. The results aren't perfect, but the looping clips the system generated tend to have artifacts and noise.
Text-to-video systems have been around for a long time. A group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence released a tool that can translate text into high fidelity short clips. A significant leap over the previous state-of-the-art is what Imagen Video appears to be.
Matthew Guzdial, an assistant professor at the University ofAlberta studying artificial intelligence and machine learning, said it was an improvement. Even though the comms team is selecting the best outputs there is still weird blurriness. This won't be used in animation or TV in the near future. It is possible that it could be embedded in tools to speed things up.
The image is from the internet search engine, GOOGLE.
The image is from the internet search engine, GOOGLE.
The Imagen Video system is similar to Openai's DALL-E 2 and Stable Diffusion. The imagen model is used to generate new data by learing how to destroy and cover existing data. The model gets better at recovering data when it is fed the existing samples.
The image is from the internet search engine, GOOGLE.
The system takes a text description and creates a 16-frame, three-frames-per-second video at a resolution of 24 by 48 inches. A final 128-frame, 24-frames-per-second video is produced by the system.
The image is from the internet search engine, GOOGLE.
The image is from the internet search engine, GOOGLE.
Imagen Video was trained on 14 million video-text pairs and 60 million image-text pairs, as well as the publicly available LAION-400M image-text dataset, which enabled it to generalize to a range of aesthetic preferences. Stable Diffusion was trained on a portion of LAION. They were able to create videos in the style of Van Gogh's paintings. They claim that Imagen Video demonstrated an understanding of depth and three-dimensionality, allowing it to create videos like drones that fly through the air and capture objects from different angles.
Today's image-generating systems can't render text as well as Imagen Video can. Stable Diffusion and DALL-E 2 struggle to translate prompt like "a logo for 'Diffusion'" into readable type, at least judging by the paper.
That doesn't mean that Imagen Video isn't limited. The clips cherrypicked from Imagen Video are jittery and distorted in parts, as Guzdial mentioned, with objects that blend together in physically unnatural ways.
The problem of text to video is not solved and we are unlikely to get something like DALL-E 2 or Midjourney in quality soon.
To improve upon this, the Imagen Video team plans to combine forces with the researchers behindPhenaki, a system that can turn long, detailed prompts into two-minute-plus videos.
There is a chance that a collaboration between the teams could lead. Coherency and length are prioritized byPhenaki. A scene of a person riding a motorcycle can be turned into a film by the system. It is amazing to me how closely they follow the long and nuanced text descriptions that prompted them.
Here is a prompt fed toPhenaki.
Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion’s face, inside the office. Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city.
Here's the generated video.
The image is from the internet search engine, GOOGLE.
The researchers note that the data used to train the system contained problematic content, which could lead to violent or sexually explicit clips. The imagen video model and source code won't be released until the concerns are mitigated, and unlike Meta, it won't be giving a sign-up form.
With text-to-video tech progressing at a rapid clip, it might not be long before an open source model emerges.