Home » President's Corner

Statistics for Data Science

1 May 2019 3,335 views One Comment
Karen Kafadar

Karen Kafadar

Many years ago, I had the good fortune to benefit from my MS adviser’s encouragement to pursue a PhD—and further benefit from his wide familiarity with many departments of statistics. I learned statistics departments varied in size (from tiny to large), emphasis (almost entirely theoretical, very applied, and somewhere in between), and placement of students (academe versus industry or government). And so it is with many fields: Some departments of French include other languages, while others do not. Or some focus on 19th century literature versus existentialism. I suspect we will see the same diversity in data science programs.

The University of Virginia recently announced a large gift to start a school of data science. One of the first questions raised by the faculty was, “What exactly is data science?” One definition emerges from the National Science Foundation (NSF)–funded Transdisciplinary Research in Principles of Data Science (TRIPODS) projects, which bring together statistics, mathematics, and theoretical computer science. To this, I would add the subject-matter discipline for which the data were collected and the ethical foundations for appropriate use. (Ethics is an essential part of statistics, too.)

Most of us would agree the primary reason one collects data is to draw inferences and insights from it. So, our discipline plays a central role in data science, as many others have noted. But the relative contributions of mathematics, computer science, and statistics—and the disciplines of application—will inevitably shape the styles of departments or schools of data science. Just as statistics departments have distinguished themselves with different emphases, we can expect to see diversity develop in data science programs. In any data science program, however, statistics must play some role. Data science must include solid probabilistic and statistical foundations for drawing inferences from data. How much of a role?

Mark Glickman

To help the ASA think about the possible roles and levels of integration of statistics and data science, the ASA Board of Directors last November recommended the creation of the Ad-Hoc Advisory Committee on Statistics and Data Science. This committee will be co-chaired by ASA Board member Mark Glickman and former ASA Board member Kathy Ensor. The committee’s charge includes the development of recommendations and the creation of a plan for the ASA to interact with data science.

The members of the committee bring a wide range of talents and opinions. Some may think statistics departments are destined to become obsolete and should join data science programs now, while others may want to focus on welcoming data scientists into our discipline—just as we have welcomed mathematicians, epidemiologists, psychologists, political and social scientists, chemists, and all sorts of scientists into the statistical community.

Kathy Ensor

The first question for the committee to address is: What are the unique contributions of statistics to data science? The committee’s response to this issue may itself be valuable—if only to remind us that statisticians cannot afford to approach our discipline too narrowly, nor to train our students too narrowly.

I recently heard industry relies on data scientists to formalize the process from idea-conception to production of data-focused inventions. As John Tukey said, “Finding the question is often more important than finding the answer” (cited by D.R. Brillinger in the Annals of Statistics, 2002, p.1571). Isn’t this thought process part of our training as statisticians, or have educators of statistics surrendered this thought process to data scientists?

Which route will our discipline take? Fifteen years ago, 2004 ASA President and renowned scientist Bradley Efron remarked at an ASA Board dinner, “People have been predicting the demise of statistics for years: First it was computer science, then it was artificial intelligence, then it was expert systems, then it was operations research … And guess what? We’re still here.” Would “data science” be in that list today? Or have we finally encountered the discipline that will ultimately lead to our demise?

I do not think so. As an example, J.H. Friedman, T. Hastie, and R. Tibshirani developed statistical foundations for boosting—a classifier proposed in the computer science literature—in “Additive Logistic Regression: A Statistical View of Boosting,” published in the Annals of Statistics, 28:2, pp.337-407. Right now, Efron is preparing a talk in which he relates big data prediction algorithms, deep learning, etc., to classical statistics. Similar efforts by statisticians are underway.

I hope our discipline will not be subsumed into data science, but will retain its distinctiveness. Data science needs statistics, just as our discipline benefits from interactions with mathematics and computer science. The challenge for all of us is to recognize the potential “flavors” of data science programs that will develop and identify the ways in which we want to participate in those programs. Our members, and our statistical colleagues worldwide, have adapted well to change and recognized—and embraced—the diversity of directions our discipline has taken.

I firmly believe we will do so here, as well.

More than 50 years ago, John Tukey was invited to give the commencement address to New Bedford High School (from which he would have graduated some 30 years earlier had he not been home-schooled). His remarks included the following:

The Chinese have a curse: “May your children live in interesting times!” My parents, your parents, and most parents for the last and next few centuries have had—or will have—children who live in interesting times. That means there have been problems, there are problems, and there will be problems—many of them very serious.
It was once fashionable to believe in progress and the near utopia that would soon be with us. Then it was fashionable to say that the world was horrible and getting much more so with inevitable rapidity. I tell you now that it is not true that problems will soon disappear—and equally not true that they will get much, much worse. They will change, which means that we will always be replacing familiar problems—problems that we know something about tackling—by new ones that we do not yet know how to deal with. The most painful things are not the problems, but the need to find new ways of thought, new things to be done, and new kinds of social organization. The need to change is ever painful, and it is the essential feature of interesting times.

Perhaps this is how “data science” arose—to tackle problems using different types of data to answer various questions. These are statistical problems that call for statistical approaches, with computer science and mathematics providing the tools. Coming up with new ways of thinking leading to possibly new approaches to address the uses of large databases is, as Tukey would say, the hard part—but this has always been the hard part. This is our job as statisticians.

Data science needs statistics, just as our discipline benefits from interactions with mathematics and computer science. If you have ideas for Mark Glickman and Kathy Ensor’s Ad Hoc Advisory Committee on Data Science, please share them by emailing Donna LaLonde or Ron Wasserstein.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

One Comment »

  • Vincent Granville said:

    Data science can be done with or without mathematics, statistics, or probability. As an ex-Ph.D. statistician turned data scientist, over the last 20 years, I have developed sound technology that don’t need statistical training to be understood or implemented. Here are three examples:

    1) Gentle Approach to Linear Algebra, with Machine Learning Applications
    (link: https://dsc.news/2K89vvT)

    2) Confidence Intervals Without Pain
    (link: https://dsc.news/2PUhCNh)

    3) State-of-the-Art Machine Learning Automation with HDT
    (link: https://dsc.news/2WhBK2f)

    The advantage of my approach is that it can be understood by most professionals (economists, physicists, biologists, engineers) and easily automated. I will publish a book about it in the next six months: all the content is written and tested already, I just have to put everything together.