Home » Community Analytics, Featured

Prescribing Privacy: Human and Computational Resource Limitations

1 September 2022 No Comment

Jingchen (Monika) Hu is an associate professor of statistics at Vassar College. Her research focuses on statistical data privacy methods, mainly synthetic data and differential privacy. She teaches a senior seminar on statistical data privacy and engages undergraduate students in learning cutting-edge methods.


Claire McKay Bowen is a principal research associate in the Center on Labor, Human Services, and Population and leads the Statistical Methods Group at the Urban Institute. Her research focuses on developing and assessing the quality of differentially private data synthesis methods and science communication. 


What Statisticians and Data Scientists Can Do

President Biden’s day-one Executive Order on Advancing Racial Equity and Support for Underserved Communities Through the Federal Government committed federal agencies and White House offices to actively pursuing more equitable engagement and outcomes for people of color and underserved communities. However, many statistical agencies do not collect or release detailed demographic data and statistics due to growing concerns about disclosure risks. For instance, people of color with low incomes are more susceptible to privacy attacks because they more heavily rely on smartphones for internet access and provide more personal information for free cell phone app services, according to Mary Madden in the report Privacy, Security, and Digital Inequality. Such information collection makes them more easily identifiable, especially if they are located in rural geographic locations.

To address these issues, some public policymakers propose agencies review and update their privacy protection methodologies with more modern data privacy and confidentiality techniques. Yet, many agencies cannot update their privacy protection policies due to the lack of both human and computational resources. This leaves some asking, “Why?” and, more importantly, “What can we do?”

Human Resource

On the human resource side, there exists a gap between the growing demand for professionals in modern data privacy and confidentiality techniques and the training of such professionals at educational institutions and in the workplace. These professionals should be experts in privacy and confidentiality techniques who can design and implement tailored approaches to specific data sets, evaluate the effectiveness of the approaches, and potentially provide training on the methods to colleagues. 

The workforce demand spreads across statistical agencies, local and state government entities, and private sector organizations. At the federal statistical agency level, there are trained experts who routinely design, implement, and evaluate the techniques and approaches and sometimes a disclosure review board to make final decisions. However, in many other agencies with fewer resources, such as local governments, the data privacy and confidentiality work requires establishing consulting relationships with privacy and confidentiality experts and a disclosure review board is far away from being created. At private sector organizations, large and small, active recruiting of trained experts in data privacy and confidentiality has been ongoing despite stalled recruitment efforts overall.

A search on LinkedIn with the keyword “privacy” showed more than 210,000 results at the time of this writing, which includes privacy, security, and decentralized learning at Microsoft Research; privacy engineering at Amazon Business; information governance and privacy at PwC; and privacy solutions architect at Google. This search alone demonstrates the enormous demand for all types of data privacy experts, such as those in privacy law, cybersecurity, and statistical privacy.

When it comes to training professionals in data privacy and confidentiality techniques, little is happening in statistics and data science, especially at the non-PhD levels. Most of the data privacy and confidentiality courses focus on differential privacy and appear in computer science PhD programs for graduate computer science students. At the undergraduate level, there are occasional seminar courses taught by professors who conduct research in the area.

These advanced-level courses would cover the nuts and bolts of learning and implementing the techniques, although not necessarily the theoretical underpinnings. Yet, given the growing interest and demand, most undergraduate courses on data privacy and confidentiality are at the introductory level, not necessarily designed and taught by professors trained in this area and open to students from all backgrounds.

This means students in these courses typically do not get into the details of how to perform the techniques in practice. Nevertheless, it is encouraging to see that many technical online courses on this topic are being offered for professionals, which again demonstrates the enormous workforce demand to train more professionals.

Computational Resource

On the computational side, not having readily available computational tools will hinder the accessibility for professionals to implement more modern data privacy and confidentiality methods. They might not have the proper computing environment to run these methods or the technical background (expert knowledge and/or programming skills) to hand code them. Moreover, hand coding is more prone to errors and might be less efficient.

As mentioned before, trained statistics and data science professionals in data privacy and confidentiality should understand the nuts and bolts of these methods, but not necessarily the theoretical underpinnings. For example, we do not need to know how to build a bike in order to ride it. 

If all we need are bikes, then are there enough bikes for people to ride? Unfortunately, few bikes exist.

In the statistical field, one software tool is synthpop, an R package that implements synthetic data generation and creates a ‘fake’ data set based on a statistical model that aims to have the same statistical features and data structure as the confidential data. The synthpop R package also measures data usefulness. However, it lacks the functionality to evaluate the level of protection the generated synthetic data sets provide.

Another research group out of Harvard University started OpenDP, which it describes as “a community effort to build trustworthy, open-source software tools for statistical analysis of sensitive private data.” OpenDP has partnered with Microsoft, engaged with the broader data privacy and confidentiality community, and created a GitHub repo. However, their platform is still under development and not ready for primetime.

Despite the demand for data privacy and confidentiality software, challenges in developing these tools include not enough funding and time to support this type of work.

What can we, as statisticians and data scientists, do to address these resource challenges?

There is much that could be done to advance the field. Following are a few we recommend starting with:

  • Incorporate data privacy and confidentiality into undergraduate curricula that goes beyond the basic introduction, such as applying appropriate methods to real data and evaluating their effectiveness.
  • Take more of a presence in the space through research, teaching, and science communication. It often feels like 1 to 20 for statistics vs. computer science.
  • Focus on how to translate theory to applications and deployment, rather than only the theory.
  • Advocate for more funding for applied research and deployment (i.e., computational tools and educational resources), instead of only on new method development.

We think taking these steps alone will not solve all the human and computational resource limitation problems, but they will help alleviate them.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

Comments are closed.