Home » Featured

Summary of JSM 2019 Session on Formal Privacy: Making an Impact at Large Organizations

1 January 2020 1,770 views No Comment
The ASA Privacy and Confidentiality Committee is sponsoring a webinar on Privacy Day, January 28. The speaker is Michael Hawes, senior advisor for data access and privacy at the US Census Bureau. Details will be provided on the committee’s website.

With the growing amount of data collected every day, data confidentiality is increasingly at risk. Many of the traditional approaches to statistical disclosure control are no longer deemed sufficient to protect the confidentiality of the data. Formal privacy guarantees are provable privacy guarantees that typically hold, regardless of assumed knowledge and attack strategy of a malicious user. The formal privacy guarantees are especially important for large producers of statistics, such as national statistical agencies or large private companies. These organizations are increasingly designing and engineering systems with improved disclosure limitation systems, with strong consideration for formal privacy.

To learn more about this, the Committee on Privacy and Confidentiality organized a Joint Statistical Meetings topic-contributed session, Formal Privacy: Making an Impact at Large Organizations. The session brought together four experts from large organizations who have developed, proposed, and implemented formal privacy models or variants of differential privacy. The presentations described challenges, discussed how the challenges were met, and provided an outlook for future implementation of formal privacy.

Lars Vilhuber of Cornell University, a member of the Committee on Privacy and Confidentiality, organized the session. The committee’s co-chair, Aleksandra Slavkovic of The Pennsylvania State University, moderated the panel.

Simson Garfinkel of the US Census Bureau gave a talk titled “Deploying Differential Privacy for the 2020 Census of Population and Housing.” The 2020 decennial census requires an actual enumeration. The data is collected under a pledge of confidentiality.

The 2010 Census data released to the public used a disclosure avoidance technique called household swapping. Swapping was limited to households within a state and of the same size. However, the swapping rate is confidential.

More recently, the Census Bureau conducted a reconstruction attack of the 2010 Census and re-identified data from 17% of the US population. The Census Bureau began to look for new approaches and has adopted differential privacy for the 2020 Census and Economic Census. The bureau is also working toward a similar solution for the American Community Survey, though no final decisions have been made.

Garfinkel noted that, despite its size, the decennial census is the easiest to make differentially private. There are only six variables per person: age, sex, race, ethnicity, relationship to householder, and location. There are no weights since it is a census.

The Disclosure Avoidance System (DAS) developed by the bureau allows it to enforce global confidentiality protections that rely on injections of formally private noise. The advantages of noise injection with formal privacy are transparency, tunable privacy guarantees (privacy guarantees do not depend on external data), protection against accurate database reconstruction, and protection of individual data. The challenges are that the entire country must be processed at once for best accuracy and every use of confidential data must be tallied in the privacy-loss budget. To do this, the Census Bureau created new differential privacy algorithms and processing systems (the aforementioned DAS) that produce accurate statistics for large populations (e.g., states and counties), constructed protected microdata that can be used for any tabulation without additional privacy loss, and fit the system into the decennial census production system.

The basic approach to creating a differentially private decennial census is to treat the entire census as a set of queries on histograms. The selected queries measure six geolevels (nation, state, county, tract, block group, block) and allow thousands of queries per geounit, resulting in billions of queries overall. Each histogram therefore has billions of cells.

The Census Bureau first created a block-by-block algorithm designed to independently protect each block by measuring queries for each block, privatizing queries, and then converting results back to microdata. It also developed a top-down mechanism by first generating a national histogram without geographic identifiers and then allocating counts to each geography from the “top down.” This approach is easy to parallelize, and each geo-unit can have its own strategy selection. Using high dimensional matrix mechanism, there is parallel composition at each geolevel and reduced variance for many aggregate regions.

The Census Bureau then tested both algorithms on the 1940 census data, available at IPUMS. It turns out the advantages of the “top down” mechanism outweigh the disadvantages when compared to the “block-by-block” mechanism on various measures, and the Census Bureau has opted to implement the “top-down” algorithm. Various runs of the 1940 data through the DAS, covering various values of the privacy parameter epsilon, were released to the public and are available to researchers.

Garfinkel also noted several organizational challenges. For one, all uses of confidential data need to be tracked and accounted for. Ideally, all desired queries (tables) should be known in advance, together with their desired accuracy. Furthermore, the verification of correct implementation is a check. Finally, traditional tabulations rely on data quality checks, but under differential privacy, these must be conducted without looking at the confidential raw data! The largest policy challenge, however, is the choice and allocation of the privacy budget.

Finally, the data user concerns are even more challenging, as is the determination of the right value of epsilon. See Disclosure Avoidance and the 2020 Census for more information about differential privacy and the 2020 Census.

Ilya Mironov, recently at Google and now at Facebook, gave a talk titled “Differential Privacy in the Industry: Challenges and Successes.” A differential privacy framework measures the privacy guarantees provided by an algorithm. In this context, he described modalities of privacy, as practiced at Google. To frame the discussion, he provided a cross-classification of various algorithms by where the data are stored (distributed or centrally) and by what use is made of the algorithm and the data (statistics/analytics or machine learning). Mironov said, “Statistics is old school and machine learning is where industry is heading.” This raises an important question for our statistics community: Why such a perception of statistics?

The goal for distributed data analytics is to learn about the data from distributed sources, such as individual devices (or other distributed data or databases setting). Mironov described the use of the RAPPOR (randomized aggregable privacy-preserving ordinal response) algorithm in the Google Chrome browser. It has inspired new theory and applications. The main challenges are that the absolute error increases with the square root of N and there is privacy loss over time.

He then went on to describe the development of a new software stack called Cobalt as part of the new Fuchsia operating system, still within the context of statistical analysis of distributed data (distributed analytics). It is also based on randomized response. The main challenge is who is anonymizing the data. The anonymization methodology must be transparent. There are various options enforced by organizational methods.

Turning to data analytics on centrally stored data, which, according to Mironov, is the “standard setting” in the differential privacy world. Examples include privacy integrated queries (PINQ), an early implementation of a data analysis platform designed to provide unconditional privacy guarantees for the records of the underlying data sets. The main challenges and risks are mission creep and expense of implementing the platform over time, forcing the analysts to make choices.

There are two main approaches to differentially private machine learning (in the context of centrally stored data): a family of algorithms called private aggregation of teacher ensembles (PATE) and the differentially private stochastic gradient descent (DP-SGD) method. PATE uses a collection of hundreds of models to train a student model. DP-SGD trains each gradient using differential privacy. According to Mironov, DP-SGD is a better fit for standard machine learning pipeline.

Mironov also said, “Right now, machine learning is more of an art than a science, which requires adjustments to models to train the models for privacy.” Again, this is a sentiment familiar to the statistics community and often heard when describing data analysis with real data versus pure mathematical modeling.

Juan M. Lavista Ferres of Microsoft gave a talked titled, “Differential Privacy in Windows 10, and Why Many DP Implementations Fail.” Introduced in 2015, Windows 10 is a series of personal computer operating systems produced by Microsoft. Microsoft collects metrics in an anonymous way as part of telemetry, a service that contains technical data about how the Windows 10 devices and its related software are working and sends this data periodically to Microsoft to fix issues that occur. Users have the option to opt out from telemetry. There are 100s of millions of devices that don’t opt out. The problem, as Ferres showed, is that the information from opt-out machines is not missing at random.

In telemetry, data is systematically collected many times across the lifetime of a device, which results in a privacy leakage problem. The solution is to discretize the numbers into buckets—that is to represent or approximate (a quantity or series) using a discrete quantity or quantities. To address this challenge, Microsoft developed a solution that could provide them with the signal without affecting the privacy of the individuals. Using a new approach to the local differential privacy (LDP) model, differential privacy is adapted for repeated collection of counter data and happens before the data is transmitted. Windows 10 includes an API allowing developers to leverage a built-in differential privacy solution.

Turning to implementation challenges, Ferres stated that many differential privacy projects fail because customers do not understand the solution. Ninety percent of developers surveyed had never heard of DP. Once introduced to it, they then think it is a magic box that can solve all their problems. A common frustration is that they can query global models, but not the individual data. The data is not accessible in a raw format.

Ferres also explained that he is passionate about DP because it can provide data-driven input to health issues such as Sudden Infant Death Syndrome (SIDS). The current approach for accessing data for research at the US Centers for Disease Control and Prevention requires writing scripts, submitting them to a trusted curator, seeking approval, and finally being able to run the script. This process takes three months and $900 for each script. Juan says, “Research doesn’t work if every query takes three months to run.” He concluded by noting, “Differential privacy can be an amazing tool for opening these data sets while preserving the privacy of the individuals.”

Shiva Kasiviswanathan of Amazon stated that differential privacy provides provable protection and allows clear quantification of privacy losses; however, there are challenges with implementing differential privacy at Amazon. Some are technology-oriented, while others are based on human and cultural factors:

  • Different teams own different services, so differential privacy products have to be negotiated across teams
  • The teams do not have proper differentially private data cleaning and exploration tools
  • Software developers want code they can start with, not technical papers
  • Explaining the legal implications of differential privacy is challenging

There is a large body of research that has been developed to design algorithms and tools to achieve differential privacy, understand the privacy-utility tradeoffs in different data access setups, and integrate differential privacy with machine learning and statistical inference. Amazon is working to address privacy challenges, especially by building differential privacy tools that are accessible to developers (both within and outside of Amazon).

Kasiviswanathan mentioned the autoDP package maintained by Yu Xiang on GitHub. It implements Rényi DP (which goes back to Mironov) and is particularly useful when the data set is accessed by a sequence of randomized mechanisms. This approach weighs the tradeoffs through a privacy calibrator that numerically calibrates noise to privacy requirements. They are working to integrate this with the Apache MXNet, a fast and scalable training and inference framework with an easy-to-use, concise API for machine learning.

Kasiviswanathan briefly described other privacy projects at Amazon, such participant roles and analysis of false discovery rates.

Aleksandra Slavkovic of The Pennsylvania State University moderated the discussion at the end of the session. There was a focus on topics including achieving higher accuracy in large aggregations (e.g., large cities), defining federated learning (combining traditional and differential privacy methods), and how the privacy-loss budget will be set.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Comments are closed.