What? Me Worry? What to Do About Privacy, Big Data, and Statistical Research
This column is written to inform ASA members about what the ASA is doing to promote the inclusion of statistics in policymaking and the funding of statistics research. To suggest science policy topics for the ASA to address, contact ASA Director of Science Policy Steve Pierson at email@example.com.
Just as statistical scientists bring invaluable skills to Big Data from the perspective of data quality and analysis, they are essential from the privacy perspective, as well. In this guest column, Julia Lane and Victoria Stodden—chair and member of the ASA Privacy and Confidentiality Committee, respectively—discuss the complex privacy issues inherent in Big Data and outline the challenges to statistical scientists for addressing these issues.
Julia Lane is a senior managing economist at the American Institutes for Research, a professor of economics at BETA University of Strasbourg CNRS, Chercheur, Observatoire des Sciences et des Techniques, Paris, and a professor at Melbourne Institute of Applied Economics and Social Research, University of Melbourne. She is the chair of the ASA’s Committee on Privacy and Confidentiality.
Victoria Stodden is an assistant professor of statistics at Columbia University. She developed the award-winning reproducible research standard and serves on the ASA’s Committee on Privacy and Confidentiality. She is a member of the National Science Foundation’s Advisory Committee on Cyberinfrastructure and the Mathematics and Physical Sciences Directorate Subcommittee on Support for the Statistical Sciences.
Big Data have not only brought statistics to increasing relevance and importance, but have also led to references of statisticians and/or statistics being sexy! The seemingly unlimited potential for new types of data to predict human behavior has changed the practice of business and government, enabling marketing experts to tell stores your daughter is pregnant before she has even told you, permitting city managers to optimize city evacuations in disasters, and allowing some data savvy politicians to predict the outcome of political races. Statistical research can be transformed and used to inform the new open government imperatives, but only if statisticians act wisely.
The Big Data euphoria must be tempered by an examination of the critical privacy issues raised by the collection of massive amounts of data on human beings—often without their knowledge, much less their explicit consent. The release of crowd-sourced pictures after the Boston Marathon bombings had devastating consequences for at least one innocent person, even as it helped identify the alleged murderer. Although there is little evidence statistical research would generate similar effects, the resulting privacy concerns have the potential to substantially inhibit important statistical analysis. Following are three reasons for concern:
- Privacy concerns could stop bona-fide data collection and statistical research in its tracks.
- Institutional review boards, uncertain of appropriate rules and safe dissemination practices, could overprotect or under protect statistical data. The current reliance on HIPAA rules, which identify a subset of data elements that are privacy protected, are neither necessary nor sufficient to protect confidentiality.
- Research might not be replicated because research data are held in the hands of private data collectors, who cite privacy concerns and therefore do not make the data broadly accessible.
As statisticians, we should worry. It is imperative we develop a sensible structure for data access that ensures the goal of good science is attained while protecting confidentiality and respecting individual agency.
We know the risk to privacy will continue to increase. The volume and type of data used for social and behavioral science research will have many new types of re-identifying elements, and the potential for re-identification will increase with more and better types of matching tools and algorithms. Fortunately, the same technological change that has led to increased potential for loss of confidentiality and other harms also has led to enormous advances in the tools available to protect confidentiality.
We need an aggressive research agenda that builds an understanding of the legal and regulatory framework within which data are being collected, used, and shared. We need to ask and answer key questions. What does informed consent mean in the new environment? Do people “own” the data collected on them? What are researchers’ and institutions’ duties to protect the data held? How do we design effective studies within the context of Big Data with confidentiality concerns? How can we facilitate verifiability in findings? What are the practical options?
We also need a deep analysis of the relationships between data sets and the potential for re-identification. Communities exist, but they are substantially siloed into activities in different research areas and practical applications, including the successful development of remote access secure data enclaves.
We need to build an understanding of the features of reproducible science, particularly in the computer science and legal communities. Mechanisms for sharing data safely, perhaps with different privacy-protecting layers, need to be developed. Data and software archiving questions naturally arise, including data compression, warehousing infrastructure, and the need for software executability and long-term viability. Replication of Big Data results raises new statistical questions around reliability, stability of methods, and confidence in findings all affected by bounding information flows due to privacy concerns. Innovative “walled gardens” and other quasi-open structures must be developed to maximize the potential for verification of findings in the face of privacy restrictions. Finally, legal issues regarding code development and data procurement, curation, sharing, and ownership need to be resolved and reconciled with existing privacy protections. Rights to re-use data and code for research validation purposes and original research, while protecting privacy, must be established.
Active engagement by the research community is vital to shape policy and build much stronger and responsible research infrastructures. The American Statistical Association’s Committee on Privacy and Confidentiality, on which we both serve, has contributed by sharing information, organizing sessions at the Joint Statistical Meetings, and producing (together with NYU’s Center for Urban Science and Progress) a book about the topic. But the entire community must respond to the challenge of Big Data and actively exploit the potential to advance science. An exciting, and maybe even sexy, future lies ahead for statistics.