This type of intuitive reasoning is called common sense. Knowing that the world is three-dimensional and that objects don't disappear when out of view is included. Predicting the location of a bouncing ball or a speeding bike is possible. If we hear a metallic crash from the kitchen, we can make an educated guess that someone has dropped a pan because we know what kind of objects make that noise.
Common sense tells us which events are more likely to happen. We can see the consequences of our actions and plan accordingly.
It's difficult to teach common sense to machines. Neural networks need to see thousands of examples before they can spot patterns.
Common sense is able to predict what will happen next. The essence of intelligence is what LeCun is talking about. In order to train their models, he and a few other researchers use video clips. Imagine holding up a pen and letting it go. The pen will fall, but not the exact position it will end up in, according to common sense. crunching some tough physics equations would be required to predict that
LeCun is attempting to train a neural network that can only focus on the relevant aspects of the world and not on how to fall. This network is similar to the world model that animals use.
LeCun claims to have built an early version of the world model that can recognize objects. He wants to train it to make predictions. He doesn't know how the configurator should work. LeCun thinks that the neural network is the controller. It would decide what kind of predictions the world model should be making and what level of detail it should focus on to make those predictions possible.
LeCun doesn't know how to train a neural network to do the job that he thinks is needed. He says that they don't have a good recipe yet to make this work.
The world model and configurator are two key pieces in a larger system, known as a cognitive architecture, that includes other neural networks.