Facebook is researching AI systems that see, hear, and remember everything you do

Facebook has been investing a lot in augmented reality. This includes building its own AR glasses together with Ray-Ban. These gadgets are currently able to record and share images, but what do they see these devices being used for in the near future?
The scope of Facebook's ambitions is evident in a new research project that was led by the AI team. The company envisions AI systems that continuously analyze people's lives through first-person video. They will record what they see, hear, and do in order to assist them with daily tasks. The researchers at Facebook have identified a number of skills that they want these systems to acquire, including episodic memories (resolving questions such as "Where did you leave your keys?") and audio-visual diary (remembering who said which when).

There are many possibilities that this type of research could be used in the future.

Facebook emphasizes that the current tasks cannot be reliably achieved by any AI system. It is clear that AR computing will be able to provide functionality similar to these, however. Kristen Grauman, a Facebook AI researcher, said that augmented reality is a way to think about the future and how we would like to use it.

These ambitions can have serious privacy implications. Privacy experts are already concerned about the potential for Facebook's AR glasses to be used to covertly monitor members of the public. These concerns will be magnified if future versions record and analyze footage, transcribe it and turn wearers into surveillance robots.

Ego4D is the name of Facebook's research project. It refers to the analysis and interpretation of egocentric (or first-person) video. It is composed of two main components: an open dataset of Egocentric Video and a set of benchmarks that Facebook believes AI systems should be capable of handling in the future.

Facebook collected 3,205 hours worth of first-person footage around the globe

This dataset is the largest of its kind, and Facebook collaborated with 13 universities to collect it. The footage was recorded by 855 people from nine countries. These data were collected by universities and not Facebook. Some participants were paid and wore GoPro cameras with AR glasses to capture unscripted activities. These included everything from baking and construction to socializing with friends and playing with pets. The universities de-identified all footage, including blurring faces of passersby and removing personally identifiable information.

Grauman claims that the dataset is unique in scale and diversity. These AI systems now have access to footage from other countries, including the United Kingdom and Sicily.

Ego4D's second component is a set of tasks or benchmarks that Facebook encourages researchers to attempt using AI systems from its data. These are described by the company as:

Episodic memory. What happened (e.g. Where are my keys?) Forecasting: What are my chances of doing next? (e.g. Wait, you've already added salt in this recipe). Hand and object manipulations: What's my purpose (e.g. Teach me how the drums are played)? Audio-visual diarization (e.g., Who said what during class? Social interaction: Who's interacting with whom? (e.g., Can you help me hear the person speaking at this loud restaurant?)

AI systems would find it extremely difficult to tackle any of these issues right now. However, creating benchmarks and datasets is a proven method to encourage AI development.

The recent AI boom is often credited to ImageNet's annual competition and one specific dataset. ImagetNet datasets are pictures of many objects that researchers trained AI systems on to recognize. The 2012 winner of the competition used deep learning to outperform its competitors, launching the new era in research.

Facebook hopes that its Ego4D project can have similar results for the world of Augmented Reality. Ego4D-trained systems could one day be used not only in wearable cameras, but also as home assistant robots. These robots also rely on first person cameras to navigate the world.

Grauman says that the project can really stimulate work in this area in a way that isn't possible. Our field will be able to move from being able to analyze piles and videos that were taken by humans with a very specific purpose to this fluid, continuous first-person visual stream that AR robots need to understand in context of ongoing activity.

Many will be concerned by Facebook's AI surveillance system development.

While the tasks Facebook has outlined seem to be practical, many will be concerned about the company's interest in this field. Facebook's privacy record is poor, with data leaks and $5B in fines from FTC. The company places growth and engagement over users well-being across many domains, as has been repeatedly shown. It is concerning that the Ego4D benchmarks do not have prominent privacy safeguards. Audio-visual diarization tasks (trancribing different people's conversations) do not mention removing personal data from people who don't want it.

A spokesperson for Facebook stated that privacy safeguards would likely be added to the site in the future when it was asked about these issues. The spokesperson said that companies will use the benchmark and data to create commercial applications. We expect them to have safeguards. AR glasses could enhance someone's voice by asking permission. They could also limit the device's range so that it only picks up sounds from people who are within my immediate proximity or with whom I am having a conversation.

These safeguards are hypothetical at the moment.