Home » A Statistician's View, Departments, Section News, Statistical Consulting

Finding the De-Anonymization Needle in the SEER Haystack

1 September 2022 1,415 views 2 Comments

Chris Barker, Statistical Consulting Section Chair-Elect and Monica Johnston, M. Lee & Company

    As chair-elect of the Statistical Consulting Section, I present my motivation for one of my several forthcoming section initiatives. My initiative arises because I recently needed a crash course in concepts entirely new to me about data privacy, anonymization/de-anonymization, identification/de-identification and re-identification, and statistical disclosure. Based on what I learned in my crash course, I am inviting interested statisticians to help develop a “data privacy toolbox” that members of the consulting section and certainly any/all statisticians at the ASA can use in their day-to-day work. The toolbox may be used at a leisurely pace, rather than as a crash course. Volunteers need not be section members, though I encourage joining the section.

    Defining and measuring the success of the toolbox is an additional objective for the group.

    Privacy in the 21st century may no longer exist. Bill Gates stated in a 2013 Wired Magazine article, “Historically, privacy was almost implicit, because it was hard to find and gather information.” Today, de-anonymization (i.e., re-identification)—the practice and relative ease of ‘adversaries’ matching anonymized data (i.e., de-identified data) with publicly available, or auxiliary data—may lead to identifying that “anonymous” person in terms of actual name, address, employer, etc.

    The ethics of de-anonymization and its implications for those de-anonymized has been studied by independent ethicists and other experts, including those at the US Census Bureau. One data use ethics issue is the unambiguous violation of explicitly stated terms of use for the Surveillance and Epidemiology End Results (SEER) and clinicaltrials.gov (CTG) databases.

    The specific paper, “Do Firms Underinvest in Long-Term Research? Evidence from Cancer Clinical Trials” in the American Economic Review by Eric Budish, Benjamin Roin, and Heidi Williams, clearly states the researchers and organizations linked SEER and CTG with no reference to the terms of use. The authors, as well as Nobel Prize–winning journal editor Esther Duflo and Nobel Prize–winning president of the American Economic Association David Card, when asked, did not provide proof that the authors had permission of any kind from the federal agencies overseeing SEER and CTG to violate the terms of use. This creates a risk that adversaries will re-identify oncology patients by linking to auxiliary data.

    Statisticians working with any data from humans may need to update their understanding of anonymization, de-anonymization, and statistical disclosure.

    The Needle in the SEER Haystack

    Paraphrasing Bruce Schneier in his Wired article, “Why ‘Anonymous’ Data Sometimes Isn’t,” what we have learned about anonymization of patient data and “statistical disclosure” may be completely outdated or possibly wrong.

    SEER data is “anonymized” and information permitting identification (name, address, credit cards, salary, etc.) of individual patients has been removed. However, in 2008, Netflix created the Netflix Prize and provided anonymized customer data to the public. According to Arvind Narayanan and Vitaly Shmatikov in their IEEE Symposium on Security and Privacy paper, “Robust De-Anonymization of Large Sparse Datasets,” a team of computer scientists was able to link the Netflix database with Internet Movie Database (IMDb), identify the Netflix client’s actual name and address, and receive confirmation of correctly identifying the individuals from Netflix management.

    A critical caveat emptor to my work here is there is no direct proof of a de-anonymization, since that can only be achieved by directly contacting the patients involved. Briefly, I inspected (using SAS and R) data sets prepared by the authors and available for download by anyone with internet access. No password is required and there is no method to track the download. I found 40 unique clinical trials with exactly one patient (sample size n=1) linked to SEER patient-level data with a large number of covariates that can be used by an adversary to link to auxiliary data for de-anonymization. I have no way to contact the individual patients, and I turned over my discovery to the experts at SEER and CTG.

    The patient data I found is at potentially very high risk of de-anonymization. Given the detailed data available, it may be possible to de-anonymize the 40 unique cases of patients in clinical trials n=1.

    I specifically asked the authors, Duflo, and Card to carry out the turnover to be compliant with the SEER terms of use to notify SEER of de-anonymizations. In the absence of their replies, I intervened—guided by the ethical principles of the ASA—and reported the matter to SEER and CTG.

    As a courtesy, I specifically informed the Division of Cancer Control and Population Sciences, National Library of Medicine director, and privacy experts at SEER and CTG that I did not expect or need to know how the matter was handled.

    Proof of Concept of De-Anonymization of SEER Using Certain CTG Trials

    My background is in pharmaceutical clinical trials, where we routinely blind patients, investigators, and sponsors. Anonymization and de-anonymization differ from blinding and unblinding. The two share a common characteristic in that individual patient identifiers are removed by an anonymization algorithm, sometimes referred to in the privacy literature as “catch and release.” The two differ in that de-anonymization may occur only for a single patient, several patients, or possibly all patients. Clinical trial unblinding is applied to all patient data at one time, called a “database unlock.” The patient identifiers are not included in journal publications. Protection of pharmaceutical clinical trials patient data is addressed by the European Medicines Agency (2019) and European General Data Protection Regulation.

    I believe I have discovered the first-ever publication of the “proof of concept” for a de-anonymization algorithm in a prominent peer-reviewed journal of economics policy, the American Economic Review. The proof of concept of de-anonymization of SEER anonymized data is caused by combining two crown jewel data sets of the federal statistical system—SEER and CTG—in violation of the terms of use. In fact, as many as four federal-level databases may be involved.

    Clinicaltrials.gov, in a small number of situations, has a type of “patient-level” data, specifically a small number of clinical trials in which the final sample size (number of patients) is one (n=1). Based on my experiences, I pose broader questions that I do not attempt to answer. How important is the discovery of a proof of concept to the ongoing initiatives by pharmaceutical companies to provide data sets of pristine anonymized clinical trial data to external experts? And does the proof of concept increase the risk that some of those experts may attempt to de-anonymize the data? Last, does the proof of concept scale up to larger clinical trials?

    I completely turned over the matter to the Division of Cancer Control and Population Sciences/SEER and National Library of Medicine/CTG for the privacy experts to address.

    Synergism with Other ASA Committees

    At the outset, I recognized the concept of a data privacy toolbox might overlap with the work of other ASA committees. To avoid duplication of effort, I invite Statistical Consulting Section members and members of ASA committees such as Data Privacy, Record Linkage, Epidemiology, and Ethics to collaborate on this initiative.

    For information, contact me, Chris Barker.

    An earlier version of this article did not include Monica Johnston as co-author. This has been corrected.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
    Loading...

    2 Comments »

    • Stanley Guinn, ASA member said:

      To the authors,
      This reads like a great needed project, being that I work with government entities and data as a Data Analyst, I look forward to seeing what fruits come from this data security analysis….. Awesome article by the way!… Stan

    • Monica Johnston said:

      Thank you, Stan. I believe that using the data privacy toolbox will become an important part of working ethically and a source of support for statistical consultants, statisticians, and data scientists/analysts. Ideally, we’ll have broad input on the development of the data privacy toolbox. Please reach out if you’re interested in co-developing the data privacy toolbox; there are various ways to contribute even if you can make only a small time/duration commitment.