In imitation learning, neural networks are trained how to perform tasks by watching humans do them. Artificial intelligence can be trained to drive cars or control robot arms.
There is a lot of video on the internet. The researchers hope to use this resource to learn what GPT3 did for big language models. In the last few years we have seen amazing capabilities come from big models trained on enormous swathes of the internet. Modeling what humans do when they go online is one of the reasons for that.
The problem with existing approaches to imitation learning is that video demonstrations need to be labeled at each step: doing this makes this happen, doing that makes that happen, and so on A lot of work is involved in annotating by hand. Baker and his colleagues wanted to find a way to use the millions of videos online.
The video pre-training approach gets around the problem of imitation learning by training another neural network to label the videos. They hired crowd workers to play the game and recorded their keyboard and mouse clicks. The researchers used the 2000 hours of annotated play to train a model to match actions to outcomes in the game. The character swings its axe when it clicks a mouse button.
The next step was to use this model to create action labels for tens of thousands of hours of unlabelled video taken from the internet and then train theMinecraft bot on this larger dataset.
Peter Stone is the executive director of Sony Artificial Intelligence America and has previously worked on imitation learning.