Microsoft's Project Alexandria parses documents using unsupervised learning

Which place does your company rank in the AI adoption curve for AI? To find out, take our AI survey.Microsoft's Project Alexandria research initiative was launched in 2014 by its Cambridge research division. It focuses on discovering information entities and their properties. Alexandria was created from the work of the research labs in knowledge mining research using probabilistic program. Its goal was to automatically build a complete knowledge base out of a set document.The recently launched Microsoft Viva Topics uses Alexandria technology to organize large amounts of content and expert knowledge in an organization. The Alexandria team is responsible for identifying topic and metadata. They also use AI to extract the content from datasets.VentureBeat interviewed Naomi Moneypenny (Viva Topics director for product development), John Winn (Alexandria project leader) and Yordan Zaykov (Alexandria engineering manager) via email to get a feel for how far Alexandria has come. They discussed the goals of Alexandria, major breakthroughs made so far, as well as challenges faced by the development team that could be overcome with future innovation.Understanding knowledgeIt can be difficult to find information within an enterprise. Numerous studies have shown that this can negatively impact productivity. One survey found that employees could save up to four to six hours per week by not having to search for the information they need. Forrester also estimates that new employee onboarding could take 20% to 35% less time in common scenarios.Alexandria addresses this issue in two ways: topic linking and topic mining. Topic mining is the process of finding topics in documents, and maintaining and updating those topics as they change. Topic linking is the integration of knowledge from different sources into one knowledge base.Machine learning was used primarily to create arrays of images and audio when I first started this job. Winn stated that machine learning was something Winn was interested in. Winn explained that machines can be applied to more structured objects, such as collections, strings and objects with types or properties. This machine learning is well-suited for knowledge mining because knowledge has a complex and rich structure. This structure must be captured in order to accurately represent the world and to meet our users' expectations.Alexandria was created to automatically extract knowledge from a knowledge base. At first, it was focused on using Wikipedia knowledge mining. The project became an enterprise a few years back, which works with data like emails and messages.It was exciting to make the transition to the enterprise. Public knowledge allows for manual editors to maintain and create the knowledge base. Winn stated that it is important to have a knowledge base created automatically within an organization to make it discoverable and useable for work. The knowledge base can be manually edited to fill in any gaps or correct any errors. We have actually designed Alexandria machine learning to learn and improve from such feedback.Knowledge miningAlexandria uses probabilistic programming to achieve topic mining and linking. This machine learning approach is called probabilistic programming. It describes how topics and their properties are listed in documents. To extract topics from documents, the same program can also be run backwards. This approach has the advantage that the probabilistic program includes information about the task, not labeled data. This allows the process to run without supervision, meaning that it can complete these tasks automatically, with no human input.Since its inception, a lot has been accomplished. We created many statistical types that can extract and represent large numbers of entities and properties using machine learning, Zaykov stated. A rigorous conflation algorithm was also created to ensure that information from multiple sources is the same entity. Engineering advancements required us to parallelize the algorithm and distribute them across machines so they could operate with large data such as the entire internet or all documents in an organization.Alexandria runs a query engine to reduce the amount of information to be processed. It can scale up to more than a billion documents and extracts snippets from every document that has a high probability of containing knowledge. If the engine was to parse a document about a company initiative called Project Alpha it would extract phases that are likely to contain entity information. Project Alpha will be released by Jane Smith on the 9/12/2021.Parsing text requires that you identify which parts correspond to particular property values. In this approach, the model looks for a set of patterns templates such as Project {name will be released on {date}. By matching a template to text, the process can identify which parts of the text correspond with certain properties.|This approach involves the model looking for patterns templates, such as Project name will release on date. The process matches a template to the text and can determine which parts correspond with specific properties.} Alexandria uses unsupervised learning to create templates using both structured and unstructured texts. The model is able to work with thousands upon thousands of templates.Next comes linking. This is where duplicates or overlapping entities are identified and merged using a clustering process. Winn states that Alexandria typically merges thousands or hundreds of items to create entries. Each entry also includes a description of the extracted entity.Alexandria's probabilistic program can also be used to correct human errors, such as incorrectly recorded project owners in documents. The linking process can also analyze knowledge from other sources even if it wasn't extracted from a document. It doesn't matter where the information is from. All of it can be linked together to create a single, unified knowledge base.Real-world applicationsAlexandria shifted to the enterprise and the team started exploring ways that employees could benefit from organizational knowledge. One of these experiences became Viva Topics. This module is part of Viva, Microsoft's collaboration platform that brings together knowledge and communication.Viva Topics taps Alexandria in order to organize information through topics delivered via apps such as SharePoint, Microsoft Search and Office, Yammer and Teams, and Outlook. Contextually aware cards present extracted projects, events, organizations, with related metadata about people and content, acronyms and definitions, as well as conversations.Viva Topics allows companies to make use of our AI technology to do a lot of the heavy lifting. Moneypenny stated that this allows them to contribute their own views and generate new knowledge and ideas from the work of others. Viva Topics customers include organizations of all sizes that face similar challenges, such as when they are trying to onboard new employees, change their roles, scale individuals' knowledge, or transmit the knowledge faster between teams.Alexandria faces technical challenges, but there are also opportunities, Winn and Zaykov say. The team plans to develop a schema that is tailored to each organization's needs in the near future. This would allow employees to find all events that are of a particular type (e.g. Machine learning talk) taking place at a specific time (the next two week) in a particular place (the downtown building, for example).The Alexandria team also aims to create a knowledge base that automatically provides information relevant to users' goals and leverages their understanding. Winn refers to this as switching from passive to proactive knowledge use. The idea is to move from passively recording knowledge in an organisation to actively supporting work being performed.Winn said that we can learn from past experiences what steps are needed to reach particular goals. This will help us track and assist with these steps. This is especially useful for someone who is starting a new task. It allows them to draw from the organization's knowledge about how to complete the task, the actions that are required, and what worked and didn't work in the past.