Home » Columns, Science Policy

What? Me Worry? What to Do About Privacy, Big Data, and Statistical Research

1 December 2013 1,486 views One Comment
This column is written to inform ASA members about what the ASA is doing to promote the inclusion of statistics in policymaking and the funding of statistics research. To suggest science policy topics for the ASA to address, contact ASA Director of Science Policy Steve Pierson at pierson@amstat.org.

Julia-LaneJulia Lane is a senior managing economist at the American Institutes for Research, a professor of economics at BETA University of Strasbourg CNRS, Chercheur, Observatoire des Sciences et des Techniques, Paris, and a professor at Melbourne Institute of Applied Economics and Social Research, University of Melbourne. She is the chair of the ASA’s Committee on Privacy and Confidentiality.

Stodden_scipolVictoria Stodden is an assistant professor of statistics at Columbia University. She developed the award-winning reproducible research standard and serves on the ASA’s Committee on Privacy and Confidentiality. She is a member of the National Science Foundation’s Advisory Committee on Cyberinfrastructure and the Mathematics and Physical Sciences Directorate Subcommittee on Support for the Statistical Sciences.

Just as statistical scientists bring invaluable skills to Big Data from the perspective of data quality and analysis, they are essential from the privacy perspective, as well. In this guest column, Julia Lane and Victoria Stodden—chair and member of the ASA Privacy and Confidentiality Committee, respectively—discuss the complex privacy issues inherent in Big Data and outline the challenges to statistical scientists for addressing these issues.

Big Data have not only brought statistics to increasing relevance and importance, but have also led to references of statisticians and/or statistics being sexy! The seemingly unlimited potential for new types of data to predict human behavior has changed the practice of business and government, enabling marketing experts to tell stores your daughter is pregnant before she has even told you, permitting city managers to optimize city evacuations in disasters, and allowing some data savvy politicians to predict the outcome of political races. Statistical research can be transformed and used to inform the new open government imperatives, but only if statisticians act wisely.

The Big Data euphoria must be tempered by an examination of the critical privacy issues raised by the collection of massive amounts of data on human beings—often without their knowledge, much less their explicit consent. The release of crowd-sourced pictures after the Boston Marathon bombings had devastating consequences for at least one innocent person, even as it helped identify the alleged murderer. Although there is little evidence statistical research would generate similar effects, the resulting privacy concerns have the potential to substantially inhibit important statistical analysis. Following are three reasons for concern:

  • Privacy concerns could stop bona-fide data collection and statistical research in its tracks.
  • Institutional review boards, uncertain of appropriate rules and safe dissemination practices, could overprotect or under protect statistical data. The current reliance on HIPAA rules, which identify a subset of data elements that are privacy protected, are neither necessary nor sufficient to protect confidentiality.
  • Research might not be replicated because research data are held in the hands of private data collectors, who cite privacy concerns and therefore do not make the data broadly accessible.

As statisticians, we should worry. It is imperative we develop a sensible structure for data access that ensures the goal of good science is attained while protecting confidentiality and respecting individual agency.

We know the risk to privacy will continue to increase. The volume and type of data used for social and behavioral science research will have many new types of re-identifying elements, and the potential for re-identification will increase with more and better types of matching tools and algorithms. Fortunately, the same technological change that has led to increased potential for loss of confidentiality and other harms also has led to enormous advances in the tools available to protect confidentiality.

We need an aggressive research agenda that builds an understanding of the legal and regulatory framework within which data are being collected, used, and shared. We need to ask and answer key questions. What does informed consent mean in the new environment? Do people “own” the data collected on them? What are researchers’ and institutions’ duties to protect the data held? How do we design effective studies within the context of Big Data with confidentiality concerns? How can we facilitate verifiability in findings? What are the practical options?

We also need a deep analysis of the relationships between data sets and the potential for re-identification. Communities exist, but they are substantially siloed into activities in different research areas and practical applications, including the successful development of remote access secure data enclaves.

We need to build an understanding of the features of reproducible science, particularly in the computer science and legal communities. Mechanisms for sharing data safely, perhaps with different privacy-protecting layers, need to be developed. Data and software archiving questions naturally arise, including data compression, warehousing infrastructure, and the need for software executability and long-term viability. Replication of Big Data results raises new statistical questions around reliability, stability of methods, and confidence in findings all affected by bounding information flows due to privacy concerns. Innovative “walled gardens” and other quasi-open structures must be developed to maximize the potential for verification of findings in the face of privacy restrictions. Finally, legal issues regarding code development and data procurement, curation, sharing, and ownership need to be resolved and reconciled with existing privacy protections. Rights to re-use data and code for research validation purposes and original research, while protecting privacy, must be established.

Active engagement by the research community is vital to shape policy and build much stronger and responsible research infrastructures. The American Statistical Association’s Committee on Privacy and Confidentiality, on which we both serve, has contributed by sharing information, organizing sessions at the Joint Statistical Meetings, and producing (together with NYU’s Center for Urban Science and Progress) a book about the topic. But the entire community must respond to the challenge of Big Data and actively exploit the potential to advance science. An exciting, and maybe even sexy, future lies ahead for statistics.

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)

One Comment »

  • william e winkler said:

    The authors have done a very nice job of addressing some of the potential legal/regulatory issues.

    Being able to provide suitable quality for valid analytic properties while preserving privacy is been an open problem for 30+ years.

    In the mid 1990s, Latanya Sweeney (then a CS Ph.D. student at MIT, now a Harvard professor) took an ‘anonymized’ set of health data from Massachusetts State employees and showed how to re-identify most of them using a Massachusetts voter registration database. The health data had been anonymized by removing individual’s names, SSNs, heath insurance IDs, doctor’s names, hospital names and just about any other identifiers that individuals could think of. For analytic purposes, ZIP codes, sex, and date-of-birth were left in the file and these fields were used to re-identify more than 70% of the individuals in the file, including the Governor.

    The first issue with any public-use file is providing a file that allows reproduction of 1-2 (but hopefully more) analyses that might be performed the the original, non-public file prior to the ‘masking’ of certain fields to prevent re-identification. If the file has valid analytic properties, then the data producer should (attempt to) justify that re-identification of a small proportion of individuals is exceptionally difficult or impossible.
    Unfortunately, many CS and other researchers have shown how to re-identify with seemingly innocuous files.

    A. Narayanan, V. Shmatikov. De-anonymizing Social Networks – Proceeding SP ’09 Proceedings of the 2009 30th IEEE Symposium on Security and Privacy, 173-187, http://www.cs.utexas.edu/~shmat/shmat_oak09.pdf

    There are many other examples. At present, there is no known method for anonymizing a file and preventing re-identifying. A promising method is differential privacy which has absolute guarantees on privacy but is being researched for significant enhancements to try to assure valid analytic properties.

    Dwork, C. (2006), “Differential Privacy,” 33rd International Colloquium on Automata, Languages and Programming – ICALP 2006, Part II, 1-12.
    Dwork, C. (2008), “Differential Privacy: A Survey of Results,” in (M. Agrawal et al., eds.) TAMC 2008, LNCS 4978, 1-19.
    Dwork, C. and Yekhanin, S. (2008), “New Efficient Attacks on Statistical Disclosure Control Mechanisms,” Advances in Cryptology—CRYPTO 2008, to appear, also at http://research.microsoft.com/research/sv/DatabasePrivacy/dy08.pdf .

    During the 1980s, the NSF very heavily funded database research into privacy. By 1990, researchers at leading universities and at IBM had concluded that the problem was likely impossible. I had refereed a survey/overview paper on the subject which was of substantial interest to the data-producing agencies because they wanted to give out public-use data rather than the tabulations (in published reports) that they had done for years.

    Dinur and Nissim provided rigorous theory demonstrating how to assure privacy by adding (very carefully chosen) noise to the queries. The number of queries had to be also restricted.
    Dinur, I., and Nissim, K. (2003), “Revealing Information while Preserving Privacy,” ACM PODS Conference, 202-210.