Jordan Urges Both Computational, Inferential Thinking in Data Science
Steve Pierson, ASA Director of Science Policy
Michael I. Jordan—chair of the University of California, Berkeley Department of Statistics—presented at the National Science Foundation (NSF) in late January for the NSF Data Science Seminar Series with a talk titled “Computational Thinking, Inferential Thinking, and Data Science.” He spoke to a room full of NSF program officers and outside attendees, as well as those attending via webinar.
In a concise 25-minute presentation, Jordan made his case for why one should think of data science as the combination of computational and inferential thinking, noting that the most appealing challenge in Big Data for him is the potential for personalization. He started with the challenge that the core theories in computer science and statistics were developed separately and there is an oil and water problem to be surmounted. As an example, he noted that core statistical theory does not have a place for runtime and other computational resources while core computational theory does not have a place for statistical risk.
To lay the basis for this theme, Jordan defined computational thinking as including abstraction, modularity, scalability, robustness, and similar concepts. He described inferential thinking as (i) considering the real-world phenomenon behind the data; (ii) considering the sampling pattern that gave rise to the data; and (iii) developing procedures that will go “backward” from the data to the underlying phenomenon. Inferential thinking, he said, is not merely computing “statistics” or running machine-learning algorithms; there must be a focus on error bars and confidence intervals on any outputs.
To convey the importance of the combination of inferential and computation thinking, Jordan looked at examples of database queries in which privacy should be protected with the capability of tuning the amount of privacy protection specified by an individual—perhaps little protection where the data may be used to improve a medical treatment for a loved one or a lot of protection in the case of a company trying to sell a product. The inferential aspect of the problem is to consider the relationship of those in the database to the population, while the computational side is to consider the relationship of the results from a privatized database to those from the originating database. The goal is to make privatized database outputs as close to what one would get if one had access to the data of all the population (and not just the originating database as those ignoring the inferential aspects might be inclined to do.)
For the remaining 35 minutes of his seminar, Jordan took questions from the audience. The questions ranged from the role of NSF in bringing together the two ways of thinking and the recent advance regarding the Chinese game Go to the role of industry in advancing data science and the data science challenges in academia and education. For the latter, he noted the University of California, Berkeley’s new Data Science Education Program, which has been well received by the students. When referencing that Bayesian thinking is included in the course, he said he thinks of Bayesian thinking as a combination of cognitive science and statistics.
Later, Jordan noted the large discrepancy in the data science community between those who know the Fast Fourier Transform—one of the key accomplishments to come out of the signal processing community in the last 100 years—and those who know the bootstrap method—one of the key accomplishments to come out of the statistics community in the last 100 years—and urged all data scientists to know the latter method just as they know the former.
To further explain inferential thinking, Jordan relayed the personal story of how he became involved in data science when a medical doctor reported concerns from an imaging diagnostic. He questioned the geneticist thoroughly about the imaging and asked about the risk factors involved. Through further questioning and research, he learned that the data processing algorithms weren’t updated when the resolution of the camera was increased considerably. In the end, he learned that the signals about which his doctors were concerned were just noise—an artifact of the outdated data processing algorithms.
As part of the ASA’s effort to raise the profile of statistics in data science, ASA staff recommended to the NSF that a statistician be included as a speaker for its data science webinar series.