Everyone should decide how their digital data are used ' not just tech companies

Meredith Whittaker, Minderoo Research Professor at New York University, is co-founder and faculty Director of the AI Now Institute in New York.Jathan Sadowski is a researcher fellow in the Emerging Technologies Research Lab at Monash University in Melbourne, Australia, and in the Centre of Excellence for Automated Decision-Making and Society at Monash University.The data collected by smartphones, sensors, and consumer habits can reveal a lot about society. Too few people have the right to control how these data are used and created.Taiwan's innovative civic data culture has shaped its swift and effective response to the pandemic. Credit: Ceng Shou Yi/NurPhoto via GettyIn the past, researchers could have surveyed hundreds of people to find out how bad the weather affects commuting patterns, the modes of transport they use, and the times they travel.It is now possible to view data about the movements of millions, often in real-time, from location trackers installed in vehicles or phones. This data can be combined to analyze COVID-19 vaccines to examine the impact of commuters returning home to work. Weather data can also be used to see if more people work remotely when it rains than they did a few years back.This is all in theory. This vision is often not realized in reality.The vast majority of data that computational social scientists have access to today is generated to answer questions unrelated to their research. Instead, the data bear the marks of their original purpose: targeted advertising or personalized insurance premiums. Although these data can be repurposed with caution to answer other questions, such as those related to obesity, crucial gaps often remain. Scientists often resort to workarounds in order to extract meaning from the data they have.Analysts trying to answer questions about transport patterns in Greater Sydney, Australia had to use low-quality spatial data and temporal data that mobile phones ping. They also had to pay a high price for these data from a telecommunications provider.The current model in which digital traces of our daily lives are monopolized and controlled by corporations threatens society's ability to conduct independent, rigorous research to address pressing issues. It also limits the information that can be accessed, and the questions that can easily be asked. This hinders progress in understanding complex phenomena such as how vaccination coverage affects behavior and how algorithms impact the spread of misinformation.Instead, we advocate for the creation, management, and curation in public data trusts of behavioural data.Access deniedSocial scientists are in an awkward position because of the political economy of data. Access is subject to conditions. Companies have an active interest both in the research questions they ask (or don't), as well as the data that can be accessed and the analysis of it. Scientists are often unable to identify what information was missing when access is granted by gatekeepers, or how they were generated.This can at best have a chilling impact on scholarship. If the studies could be detrimental to the reputation of data providers or their bottom line, some studies will not be undertaken. Researchers may feel pressured to align their research and findings with the priorities and values of technology companies. Researchers could be denied data access due to unflattering results. This could threaten their ability to continue their research and their standing with peers and in their institution.A March report by the Responsible AI team at Facebook revealed that the type of problems and solutions that they could investigate was limited. Their work was not focused on removing hate speech and disinformation that can affect engagement.Relying on the wealth of private companies can also undermine scientific rigour. Researchers may be restricted by contract to reproduce and validate other results. Health researchers found that there was a significant racial bias in data used to train a commercial algorithm. This meant that US$1,800 per year was spent treating Black patients, while white patients had the same level. Researchers did an independent audit on records at a large university hospital to uncover the bias that is being disputed by the company.The status quo presents serious problems. The reputation and credibility of long-term techniques that collect demographic data, predict risk factors, and study behavior patterns is being eroded by big techs' unscrupulous practices. The dominance of few data-rich areas is influencing computational social science. Many PhDs and tenure are granted based on industry partnerships, which provide funding, data, publications, and prestige.Data pipelineWe are not limited to accessing proprietary data. Fundamental questions remain about the entire pipeline and how these data originate and go.Companies can make it difficult to access the data they consider valuable. As a new asset, tech giants value behavioural information about individuals. Because the data is readily available, this can influence research agendas. Social-media data is often used by computational social scientists as a proxy for other factors, such as mobility and health. However, they are not ideal for answering their questions.Data that was often constructed with harmful biases and incorrect assumptions can also contaminate insights. AI researchers discovered that large data sets like ImageNet, which were used to assess and train machine-learning systems over a decade, contained racist and sexist stereotypes. These stereotypes then get carried forward into software6,7.Democratic governanceThese fundamental issues cannot be addressed without a radical change in the data monopolization. It is necessary to establish systems that allow for analysis of social phenomena in a more ethical, fair, and scientifically sound manner. As patent knowledge becomes public after intellectual-property rights expires, so should companies' behavioural data.Collective stewardship of data pipelines in public trusts subject to scientific oversight and democratic accountability would be a model that provides better control. Existing work opens the door to such instruments. Element AI and Nesta have released a report that outlines trusts as a policy tool to pool the rights of data subjects and set terms of use. (see go.nature.com/3decirk, S.V. I was also a participant in the workshop that prompted the report.Francesca Bria, data sovereignty champion, helped to create Barcelona's Smart City initiative. This will give residents control of their data. Credit: Matthias Balk/dpa/AlamyA promising approach has been pioneered in Barcelona, Spain. It created a city commons in 2017 that gave residents the ability to control how data was produced about their community and allowed them to take part in decisions about governance. The Open Data portal contains 503 data sets on the municipality, including real time information about the city's bicycle-sharing program.This democratic control is necessary to ensure the safety of those these data are intended to serve. Public governance grants additional rights and rules, including the right to anti-discrimination and due process, as well as greater accountability. These protections are generally more extensive than private obligations in most cases. However, there are differences between countries and regions.Collective stewardship can highlight the socially beneficial aspect of information. It is not about what we know about someone, but about what it reveals about our shared humanity and connections8. Instead of focusing on individual rights, a public trust should represent the values and interests of all groups that are affected by downstream data products. ClearView AI in New York used photos from cloud-storage sites to create powerful facial recognition software. However, people who were photographed didn't know this was happening. What about the police and businesses who purchased the package?The ownership of data by public institutions is not without its challenges. Sometimes governments use data to cause serious harms such as targeting marginalized communities. They can also escape accountability by using authoritarian measures. Public trusts should be built for democratic governance. They should be representative and responsive to the communities they are created for.To ensure that the public data pipeline cannot be influenced or accessed by any other government agencies, such as the military or police, strict silos must be in place. For example, Singapore used GPS data from mobile phones to track contact during the COVID-19 pandemic. However, citizens lost trust when it became clear that the same data was used by police during murder investigations.Three stepsThree steps are recommended by us to policymakers and institutions for safeguarding behavioural data as a public benefit.Establish public infrastructure. To support large data sets suitable for qualitative and quantitative research, measurement, computing and storage systems must be maintained and funded. These resources should be given to communities and organizations already engaged in such practices, as well as Indigenous peoples who are trying to manage, classify, and control their knowledge according to principles of data sovereignty. So that the people who are affected by the data can set the agenda and challenge or remedy any inaccurate or harmful use, the infrastructure must be strengthened with robust participation mechanisms.You have the power. You will need policies to transfer data from private entities to public institutions. These policies must include details about the measurement methods, collection processes, and storage environment.Private companies can already be granted restricted rights to non-tangible assets. This will eventually make them available to the public. The HatchWaxman Act, for example, governs intellectual property in generic drug manufacturing. We propose that companies are granted a limited monopoly on data creation and ownership. These data become public resources after a certain period of time, say three years.This policy could also be applied to models the data were used to train or to inform people, as they could pose an undue threat to them if they are retained. This is also a precedent: In May, the US Federal Trade Commission ordered the destruction of facial recognition algorithms that were based on deceptive photos. Companies could be offered incentives and terms if they give data sets and metadata over to archives, universities, or other public institutions.Increase governance. It is important to establish dedicated institutions that have the capacity to manage data in the public's interest. You don't have to start over. The Library of Congress, National Science Foundation, and National Institutes of Health in the United States could all be used as models for institutions and each have representatives on public trusts.These institutions would have database managers who are familiar with the ethical standards of library science. They will be trained to balance the benefits of knowledge curation and the risks associated with sharing information. Experts in quantitative and qualitative measurement methods can develop new ways to generate data. They will work closely with communities and researchers to identify the socially-minded questions to ask.Following the example of the US Census Bureau's sworn statistic officers, computational social scientists would assess the source's sensitivity. Data from low-sensitivity sources might be published in aggregated, anonymized data. Access to sensitive data would be protected, including personal identifiable information. A public trust for data could invite advocacy groups and community groups to assist in defining protocols for consent and dispute, agendas for data construction and research goals, and requirements for accessing data.Change in the demandThis fight is not our only one. Look no further than the extensive investigations into antitrust actions against platforms like Google, Facebook and Amazon in the United States of America, Australia, China, and the European Union. Also, momentum has been provided by the COVID-19 pandemic. The science academies in the G7 group demanded a mechanism that would oblige private and public organizations to share data during emergencies. (see go.nature.com/2sjqj2v).Scientists whose work is dependent on large data sets of proprietary data should raise awareness on social media and at conferences like NeurIPS about corporate data gatekeeping. They should also share their personal experiences with these difficult ethical decisions. They should press universities to change their data-ownership policies and join community groups that are fighting for justice for the harm caused by surveillance.To develop policies for public data trusts, representatives from both academic associations and government agencies such as national libraries and census offices should be part of an inter-disciplinary working group. As public stewards of a valuable resource that allows us to know ourselves and our societies, computational social scientists should play their part.