Home » Additional Features

JSM Session Tackles Differential Privacy

1 November 2020 No Comment
Saki Kinney, ASA Committee on Privacy and Confidentiality

    With the Census Bureau beginning to transition to differential privacy to protect against re-identification of individuals from its numerous data products, increasing attention is being given to the effect of differential privacy on data users.

    Differential privacy (DP) is a framework involving perturbative methods of statistical disclosure control that provides a formal privacy guarantee—a quantifiable measure of disclosure risk that does not rely on assumptions about information held by potential attackers attempting record linkage. It also allows users to make inferences from the data that take into account the data protection methods applied to the data, something not typically true when methods like top-coding, suppression, or data swapping are used. This transition to differential privacy entails a wholesale change in the generation and consumption of statistical information. There remain many unsolved challenges, and addressing them is an active area of research.

    The ASA Committee on Privacy and Confidentiality (CPC) saw this as a critical and interesting area of discussion and organized an invited session for JSM 2020, “Private Data for the Public Good: Formal Privacy in Survey Organizations,” chaired by Tom Krenzke from Westat. In this session, experts from different sectors discussed active research and challenges for agencies and data users.

    Frauke Kreuter of the University of Maryland, University of Mannheim, and IAB Germany kicked off the session by discussing the impact of DP on social science research. In a DP world, she said, social science research will be substantially transformed. Researchers will need to learn new methods to compensate for new nonsampling errors and may need to limit the scope and complexity of their research in some cases. Another challenge Kreuter mentioned is that the push toward differential privacy coincides with a push to integrate data from multiple sources, including administrative data, which adds a layer of complexity and is also an area of active work. The addition of nonsampling error will also increase confidence interval widths, necessitating larger sample sizes, which can translate to large amounts of money in a survey setting. Without additional funding to increase sample sizes, widespread DP adoption could be detrimental to social science research, according to Kreuter.

    Quentin Brummet of NORC discussed practical issues for users of DP methods—such as interpreting epsilon and choosing tuning parameters—and gave an example of estimating child care costs from the 2012 National Survey of Early Care and Education. He provided empirical results showing how well different approaches worked at preserving regression coefficients. Like Kreuter, he found that if users adjust their analysis process, DP will perform better, and thus there is another tradeoff to be made (in addition to the traditional risk-utility tradeoff) between privacy protection (as measured by epsilon) and the range of analyses that can be performed. He found that implementing DP effectively requires deep subject matter knowledge.

    Aleksandra Slavkovic of Penn State described a framework to optimize statistical inference under DP using statistical principles from measurement error, robustness, and likelihood-based inferences. As mentioned by the previous speakers, new methods are needed to analyze DP-protected data that account for added bias and variance. In the proposed likelihood-based framework, privacy mechanisms are a family of conditional probability distributions. Since the parameters of DP are disclosable, they can be used to obtain valid inferences by incorporating the marginal likelihood of the privacy mechanism. The likelihood is generally intractable but can be approximated by different methods.

    The last presenter was John Abowd of the Census Bureau, who is leading the bureau’s transition to DP. He pointed out that many criticisms of DP also apply to other methods of statistical disclosure limitation. User feedback is critical and should be built into the agency process. It will be easier to integrate DP into new products where the alternative is no data, and indeed, DP will be more successful if it is engineered from the start, rather than tacked on to a survey post hoc. Abowd also mentioned that moving to DP will simplify the workflow due to a single protection method protecting a single file that would be used for all data products. A prototype model being developed for the American Community Survey (ACS), which initially may use synthetic data without DP until DP methods can handle complex survey data, includes the use of a validation server, which would allow users to submit specific analyses conducted on the synthetic data to be run on the gold standard data and then released, subject to disclosure review.

    The discussant was Jerry Reiter of Duke University. He echoed Kreuter and Brummet’s points about the impact of DP on social science research. He liked the idea of coupling (potentially differentially private) synthetic data with a validation server and suggested a closely related idea—verification servers (which he proposed in 2009) that provide a measure of similarity between the original and synthetic data—might be easier to implement and use less privacy budget.

    Several speakers mentioned methods for analyzing DP-protected data that account for noise, which are straightforward to implement for simple parametric inferences; however, DP often requires post-processing for analytic consistency, which can introduce biases that are more difficult to capture without consuming more privacy budget. One suggestion is to consider the use of multiple implicates, dividing the privacy budget among each implicate. This is an ongoing research area.

    A number of questions were raised in the discussion period. A feature of online JSM was there was additional discussion in the chat window. It was raised that the ACS system described is a prototype (a recent Council of Professional Associations on Federal Statistics (COPAFS)/Federal Committee on Statistical Methodology (FCSM) webinar gave a target date of 2025) and consistency requirements are imposed when the tabulation microdata file is created. PUMS public-use microdata are created via sampling, which means they are immediately inconsistent with tabulations.

    A concern about data integration also was raised. DP methods and budgets put into place for individual data sets may see a total budget increase when used as auxiliary sources for other data sets. Opportunity Atlas and Post-Secondary Employment Outcomes were provided as examples of privacy loss budgets applied to complicated linked data products, as was the aforementioned COPAFS/FCSM webinar.

    Related legal issues of opt-in consent models (used by the European GDPR) and legal penalties, which already exist and only work when they are enforceable, were also raised by speakers and attendees as options to increase privacy protection without compromising data quality. However, they come with their own set of challenges and limitations.

    The slides for each presentation in the JSM invited session are available. View a related COPAFS/FCSM webinar on privacy in the American Community Survey.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

    Comments are closed.