Home » Additional Features

New Developments in Synthetic Data Generation: Privacy Day Webinar Summary

1 April 2022 No Comment
Leah von der Heyde

On January 28, the ASA Committee on Privacy and Confidentiality hosted a webinar, “New Developments in Synthetic Data Generation,” in observance of International Privacy Day. The webinar was moderated by committee chair, Saki Kinney.

As the world grapples with how best to share confidential data, there is a rising interest in synthetic data methods. Synthetic data methods are used to protect the confidentiality of microdata units by replacing observed values with simulated ones. The webinar featured three perspectives and examples of ways in which synthetic data can be leveraged to increase public access to information from sensitive data, such as clinical data and official statistics.

First, Brett Beaulieu-Jones, instructor of biomedical informatics at Harvard Medical School, gave a talk titled, “The Potential of Privacy-Preserving Generative Deep Neural Networks to Support Clinical Data Sharing.”

Putting forward the example of clinical trials data, Beaulieu-Jones highlighted the challenge faced by researchers when sharing medical microdata for accelerating scientific progress while at the same time preserving privacy.

He explained how using pairs of deep neural networks, called generative adversarial networks (GANs), as an auxiliary classifier (AC) for creating synthetic data could overcome this challenge. However, as he pointed out, the usefulness of synthetic data hinges on its privacy-preserving properties: GANs are not automatically immune to privacy issues, as membership inference attacks can be trained with only blackbox access to the target model. Therefore, Beaulieu-Jones trained the AC-GANs under differential privacy to generate simulated, synthetic data that closely resembles the trial data of the patients.

As a result, when presented with the synthetic data created by the AC-GANs, human experts had trouble differentiating synthetic blood pressure data from real data, pointing to the potential of synthetic data for sharing individual-level patient data while preserving privacy.

Next, Monika Hu, assistant professor of statistics at Vassar College, gave a talk titled, “Incorporating Disclosure Risk in Designing Data Synthesis Models.”

Hu first explained the tradeoff between utility and disclosure risk of synthetic data. To avoid designing new synthetic models every time the disclosure risk becomes too high, Hu presented a novel approach: incorporating disclosure risk as a weight in the likelihood function for an already existing synthesis model. As any record with high risk needs more privacy protection, the likelihood contribution of such a record can be reduced by adding a low weight that provides more protection in the resulting synthetic data.

Illustrating the method with a model for family income data from the Bureau of Labor Statistics’ Consumer Expenditure Survey, Hu explained this weight could be applied to any Bayesian synthesis model with high utility, presenting a framework to achieve the desired tradeoff balance.

In presenting results, she pointed out that while using unweighted synthetic data already significantly reduces the average and top risk when compared to real data, weighting synthetic data in such a way further reduces the risk.

Finally, Aaron R. Williams, senior data scientist in the Income and Benefits Policy Center at the Urban Institute, presented “Fully Synthetic Microdata for Public Policy Analysis.” Williams elaborated on the advantages of tree-based methods, specifically classification and regression trees (CART), for generating synthetic tax data, which does not fit common distributions and contains complex nonlinear relationships between variables. He went on to explain how to add more noise for values from percentiles with greater variation, which are therefore potentially more sensitive than values from percentiles with lower variation, by not drawing directly from the confidential data.

Applying these methods to two IRS data sets of different size and complexity, Williams showed that the large majority of the original and synthetic variables match rather closely when looking at correlation matrices and tax-microsimulations. He pointed out more work is needed to optimize the synthetic data’s accuracy for large and complex data sets.

Closing his talk, Williams spoke about ongoing work on a formally private validation server, highlighting its usefulness for extending access to government data when combined with fully synthetic data.

After their presentations, the authors answered questions from the audience.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

Comments are closed.