Facebook introduces dataset and benchmarks to make AI more ‘egocentric’

GamesBeat Summit Next will be held November 9-10. Find out more about the next step.
Facebook announced today Ego4D as a long-term project that aims to solve AI research problems in egocentric perception or first-person perspectives. It is intended to help AI systems understand and interact with the world in the same way humans do, rather than in a third-person, omniscient manner that most AI does.

Facebook's claim that AI that can understand the world in first person could allow for previously unimaginable augmented and virtual reality experiences (AR/VR). Computer vision models that would be the foundation of this AI have previously learned from millions upon millions of videos and photos taken in third-person. Facebook claims that next-generation AI systems may need to learn from different types of video data to have truly egocentric perceptions.

Ego4D is a collaboration of nine universities and labs from nine countries. It collected over 2,200 hours worth of first-person video, featuring more than 700 people in 73 cities as they go about their day. Each university was awarded an academic grant by Facebook to fund the project. Researchers from Facebook Reality Labs (Facebooks AR and VR-focused research division), used Vuzix Blade smartglasses as an addition to the work to collect 400 hours more first-person video data in simulated environments in research laboratories.

Collecting data

Kristen Grauman is the lead researcher at Facebook. She says that computer vision systems today don't relate to third-person perspectives in a similar way as people. A computer vision system strapped onto a rollercoaster will likely not know what it is looking at, even though it has been trained on thousands of images and videos of rollercoasters from the ground.

Grauman stated in a statement that AI systems must interact with the world in the same way as humans to be able to do so. This means that AI must be taught to see everyday life through the eyes of humans in the contexts of interaction, real-time motion, and multisensory observations.

Ego4D was created to address challenges in embodied AI. This field is aimed at developing AI systems with physical or virtual embodiments, such as robots. The theory of embodied AI is based on embodied cognition. This holds that many aspects of psychology, whether human or not, are shaped by the whole body of an organism. Researchers hope to apply this logic to AI to improve the performance and interaction of AI systems such as chatbots and robots, autonomous cars, smartglasses, and smartglasses with people and their environments.

Ego4D recruited partner universities teams to distribute off-the-shelf head-mounted cameras (including ZShades and WeeViews), and other sensors to research participants. This allowed them to capture unscripted, first-person videos of their everyday lives. These universities were:

University of Bristol Georgia Tech Carnegie Mellon University Indiana University International Institute of Information Technology King Abdullah University of Science and Technology University of Minnesota National University of Singapore University of Tokyo University of Catania Universidad de los Andes

Participants were asked to record eight-minute clips that captured everyday activities such as grocery shopping, cooking, playing games, or engaging in group activities with friends and family. Ego4D records where the camera wearer chooses to look in a particular environment, how they interact with others and what they do with their hands (and any objects they have in front of them).

Some footage was combined with 3D scans and motion data from inertial measuring units. Eye tracking was also used. The three-step process of de-identification involved human review of all video files and automated reviews. Finally, a human review of the automated blurring was performed. Facebook said that only participants who consented to sharing their audio and unblurred faces were included in the process.

Potential bias

Poor representation in computer vision datasets can lead to harm, especially since the AI field lacks clear descriptions about bias. Research has shown that ImageNet and OpenImages, two large image databases, are U.S.-centric and Euro-centric. They encode human-like biases about race and ethnicity as well as gender and weight. These datasets are less effective than models trained with images from the Global South. Images of grooms in images from Ethiopia and Pakistan are more accurate than images from the United States. Because images of words such as wedding and spices are often presented in different cultures, object recognition software can have trouble classifying many of these objects if they come from the Global South.

In the past, tech giants have used flawed models in production. People with darker skin tones have been disfavored by Zoom's virtual backgrounds and Twitter's automatic photo-cropping tools. Google Photos used to label Black people gorillas. Google Cloud Vision, Googles cloud vision service, also labeled images of dark-skinned people holding thermometer guns while labeling another image with a person with light-skinned people electronic devices. An audit has revealed that OpenAIs Contrastive Language Image Pre-training (CLIP) is vulnerable to biases against certain genders and ages.

Facebook claims that Ego4D participants were recruited through word of mouth, advertisements, community bulletin boards, and ads from the U.K. and other countries. They were from a variety of ages (97 were over fifty years old), and professions (bakers and carpenters, landscapers and mechanics). ), and genders (45% of participants were female, one was identified as nonbinary and three chose not to give a gender). According to the company, it is expanding the project in order to include data from partners in other countries such as Colombia and Rwanda.

Facebook refused to comment on whether accessibility was taken into consideration for users with mobility problems or those with disabilities. People with disabilities might have different gaits or patterns of movement than an algorithm that was trained from footage of people who are able-bodied. People with disabilities may also have a stagger, slurred speech, or mental or emotional disturbances. These characteristics can cause algorithms to perform worse if they aren't able to include enough footage.

Facebook researchers and other contributors admit that there are biases in the Ego4D data. They write that the locations are far from covering the entire globe. However, the majority of camera-wearing individuals are located in urban areas or college towns. The pandemic resulted in plenty of footage for stay at home scenarios like cleaning and cooking. However, there was less video available at public events. Ego4D videos tend to have more active parts of participants' days, due to the fact that battery life is not possible for daylong filming.

Benchmarks

Grauman believes that Ego4D adds to the data collection by introducing new benchmarks for tasks. Grauman believes this is just as important as data collection. She said that the project's major achievement was to define what it means for intelligent egocentric perception. This is where we can recall the past, predict the future, and interact directly with people and things.

These are the benchmarks:

Episodic memory: AI can answer questions freeformally and expand personal memory by retrieving key moments from past videos. The model must be able to locate the answer to a question within the past video frames, and, if relevant, provide 3D spatial directions of the environment. Forecasting: AI can predict how camera wearers' actions will affect the future world. This could include where they are likely to move and what objects that they might touch. Forecasting actions involves more than just recognizing what happened, but also anticipating the next move. Hand-object interaction: It is essential to understand how people interact with everyday objects in order to instruct and coach others. AI must recognize objects, detect grasps and detect changes in object state. Robot learning is another motivator for this thrust. A robot could also gain experience through the observation of people's video. Audiovisual diarization (AVD): Humans use sound to comprehend the world and identify who said it. The future AI could do the same. Understanding social interaction is a key component of any intelligent AI assistant. Socially intelligent AIs would be able to recognize who is talking to whom, and who is paying attention.

These benchmarks were built by annotating Ego4D data with labels. The hallmarks of inequality are also evident in labels, which are the annotations that AI models use to learn relationships in data. Amazon Mechanical Turk is a major platform for crowdsourcing labeling work. However, less than 2% are from the Global South. The vast majority of Mechanical Turk workers originate from the U.S. or India.

Facebook claims it used third-party annotators. They were instructed to watch a clip of five minutes, then summarize it and then rewatch the video, pausing to write sentences about what the camera wearer did. According to Facebook, the company collected many different types of labels, including narrations that describe the wearer's activity, spatial and time labels on objects and actions, as well as multimodal speech transcription. The consortium collected thousands of hours worth of video and compiled millions of annotations.

Crowdsourced workers are responsible for Ego4D annotations at two locations in Africa. The Ego4D researchers explained in the paper that this means that the language-based narrations will have some subtle biases towards local word choices.

Future steps

Although it is still early days, Facebook claims that assistant-inspired research prototypes are being developed. These prototypes will be able to draw on knowledge from the physical environment and help users understand the world better. Grauman stated that AI will be able to better understand the world around them and could even personalize your trip by learning your preferences for coffee mugs or your family's itinerary.

Facebook claims that the Ego4D university consortium is going to release its data in the next few months. The company will launch a challenge to invite researchers to create AI that can understand daily activities from the perspective of the first person.

These efforts are in line with the recent rebranding Facebook's VR social network, Facebook Horizon to Horizon Worlds. Horizon Worlds is still in closed beta. It aims to provide developers with tools that allow them to create environments similar to the ones in other apps such as VRChat, AltSpace (Microsoft-owned AltSpace), Rec Room and AltSpace. If Ego4D succeeds in its goals, it could give Facebook an edge in a lucrative market Rec Room, Microsoft-owned AltSpace, and VRChat, which have billion-dollar valuations, even though they are not yet in revenue.

For now, this is a large and clean dataset. It is not very interesting or notable in isolation. It does however imply significant investment in the future egocentric AI and the idea that cameras could record our lives first-person, Mike Cook, an AI researcher from Queen Mary University, explained to VentureBeat via email. I would argue that this isn't actually solving a pressing problem or problem in AI, unless you are a major tech company that wants to sell smart watches. Although it does give you some insight into Facebook's future plans, that doesn't necessarily mean that they will make significant investments in it.

Facebook's vision of the metaverse, a virtual world that offers entertainment and games, is not limited to its egocentric, perspective-aware AI and high-quality graphics. Its Quest VR headsets, as well as its upcoming AR glasses, are a key part of Facebook's vision. The social network has recently launched Ray-Ban Stories. This smartglasses, which were developed in collaboration by Ray-Ban, capture photos and videos using built-in microphones and cameras. And Facebook continues to refine the technologies it acquired from Ctrl-labs, a New York-based startup developing a wristband that translates neuromuscular signals into machine-interpretable commands.

However, Facebook's vision for the metaverse has been slowing down because of technical and political obstacles.

Mark Zuckerberg, CEO, recently described AR glasses as one of the most difficult technical challenges of the next decade. He also said that the technology is still years away from consumers and that Facebook's VR headset has yet overcome the limitations plaguing the wider industry such as blurry imagery and virtual reality sickness.

Uncertain is also the impact that a slowdown in internal product development might have on Facebook's metaverse-related efforts. The Wall Street Journal reported last week that Facebook delayed rolling out products due to articles and hearings about internal documents indicating harms caused by its platforms. The article states that a company team is reviewing all internal research that could damage Facebook's image if it was made public. They are also conducting reputational reviews to see how Facebook might be criticised.

Facebook is seeking proposals to conduct research in order to avoid criticisms of its VR and AR programs. Ego4D is not planned to be made public, but the company insists that researchers must request limited access to the data in order to review and agree to each Ego4D partner's license terms. Facebook claims that it has put restrictions on the use images from the dataset to prevent the training of algorithms using headshots.