Statistics for Data Science
Many years ago, I had the good fortune to benefit from my MS adviser’s encouragement to pursue a PhD—and further benefit from his wide familiarity with many departments of statistics. I learned statistics departments varied in size (from tiny to large), emphasis (almost entirely theoretical, very applied, and somewhere in between), and placement of students (academe versus industry or government). And so it is with many fields: Some departments of French include other languages, while others do not. Or some focus on 19th century literature versus existentialism. I suspect we will see the same diversity in data science programs.
The University of Virginia recently announced a large gift to start a school of data science. One of the first questions raised by the faculty was, “What exactly is data science?” One definition emerges from the National Science Foundation (NSF)–funded Transdisciplinary Research in Principles of Data Science (TRIPODS) projects, which bring together statistics, mathematics, and theoretical computer science. To this, I would add the subject-matter discipline for which the data were collected and the ethical foundations for appropriate use. (Ethics is an essential part of statistics, too.)
Most of us would agree the primary reason one collects data is to draw inferences and insights from it. So, our discipline plays a central role in data science, as many others have noted. But the relative contributions of mathematics, computer science, and statistics—and the disciplines of application—will inevitably shape the styles of departments or schools of data science. Just as statistics departments have distinguished themselves with different emphases, we can expect to see diversity develop in data science programs. In any data science program, however, statistics must play some role. Data science must include solid probabilistic and statistical foundations for drawing inferences from data. How much of a role?
To help the ASA think about the possible roles and levels of integration of statistics and data science, the ASA Board of Directors last November recommended the creation of the Ad-Hoc Advisory Committee on Statistics and Data Science. This committee will be co-chaired by ASA Board member Mark Glickman and former ASA Board member Kathy Ensor. The committee’s charge includes the development of recommendations and the creation of a plan for the ASA to interact with data science.
The members of the committee bring a wide range of talents and opinions. Some may think statistics departments are destined to become obsolete and should join data science programs now, while others may want to focus on welcoming data scientists into our discipline—just as we have welcomed mathematicians, epidemiologists, psychologists, political and social scientists, chemists, and all sorts of scientists into the statistical community.
The first question for the committee to address is: What are the unique contributions of statistics to data science? The committee’s response to this issue may itself be valuable—if only to remind us that statisticians cannot afford to approach our discipline too narrowly, nor to train our students too narrowly.
I recently heard industry relies on data scientists to formalize the process from idea-conception to production of data-focused inventions. As John Tukey said, “Finding the question is often more important than finding the answer” (cited by D.R. Brillinger in the Annals of Statistics, 2002, p.1571). Isn’t this thought process part of our training as statisticians, or have educators of statistics surrendered this thought process to data scientists?
Which route will our discipline take? Fifteen years ago, 2004 ASA President and renowned scientist Bradley Efron remarked at an ASA Board dinner, “People have been predicting the demise of statistics for years: First it was computer science, then it was artificial intelligence, then it was expert systems, then it was operations research … And guess what? We’re still here.” Would “data science” be in that list today? Or have we finally encountered the discipline that will ultimately lead to our demise?
I do not think so. As an example, J.H. Friedman, T. Hastie, and R. Tibshirani developed statistical foundations for boosting—a classifier proposed in the computer science literature—in “Additive Logistic Regression: A Statistical View of Boosting,” published in the Annals of Statistics, 28:2, pp.337-407. Right now, Efron is preparing a talk in which he relates big data prediction algorithms, deep learning, etc., to classical statistics. Similar efforts by statisticians are underway.
I hope our discipline will not be subsumed into data science, but will retain its distinctiveness. Data science needs statistics, just as our discipline benefits from interactions with mathematics and computer science. The challenge for all of us is to recognize the potential “flavors” of data science programs that will develop and identify the ways in which we want to participate in those programs. Our members, and our statistical colleagues worldwide, have adapted well to change and recognized—and embraced—the diversity of directions our discipline has taken.
I firmly believe we will do so here, as well.
More than 50 years ago, John Tukey was invited to give the commencement address to New Bedford High School (from which he would have graduated some 30 years earlier had he not been home-schooled). His remarks included the following:
Perhaps this is how “data science” arose—to tackle problems using different types of data to answer various questions. These are statistical problems that call for statistical approaches, with computer science and mathematics providing the tools. Coming up with new ways of thinking leading to possibly new approaches to address the uses of large databases is, as Tukey would say, the hard part—but this has always been the hard part. This is our job as statisticians.
Data science needs statistics, just as our discipline benefits from interactions with mathematics and computer science. If you have ideas for Mark Glickman and Kathy Ensor’s Ad Hoc Advisory Committee on Data Science, please share them by emailing Donna LaLonde or Ron Wasserstein.
Data science can be done with or without mathematics, statistics, or probability. As an ex-Ph.D. statistician turned data scientist, over the last 20 years, I have developed sound technology that don’t need statistical training to be understood or implemented. Here are three examples:
1) Gentle Approach to Linear Algebra, with Machine Learning Applications
(link: https://dsc.news/2K89vvT)
2) Confidence Intervals Without Pain
(link: https://dsc.news/2PUhCNh)
3) State-of-the-Art Machine Learning Automation with HDT
(link: https://dsc.news/2WhBK2f)
The advantage of my approach is that it can be understood by most professionals (economists, physicists, biologists, engineers) and easily automated. I will publish a book about it in the next six months: all the content is written and tested already, I just have to put everything together.