## The Identity of Statistics in Data Science

Tommy Jonesis the director of data science at Impact Research, LLC. He holds an MS in mathematics and statistics from Georgetown University and a BA in economics from the College of William and Mary. He is a PhD student in the George Mason University Department of Computational and Data Sciences. He specializes in statistical models of language and time series modeling.

Data science has been generating considerable interest inside and outside of the statistics community. Within the statistics community, there is a debate about whether data science and statistics are distinct disciplines. This conversation about data science betrays an anxiety about our (statisticians’) identity.

In a July 2013 article in *Amstat News*, “Aren’t We Data Science?” former ASA president Marie Davidian summarizes these concerns: “I’ve been told of university administrators who have stated their perceptions that statistics is relevant only to ‘small data’ and ‘traditional’ ‘tools’ for their analysis, while data science is focused on Big Data, Big Questions, and innovative new methods.”

Similarly, Norman Matloff titled his November 2014 editorial in *Amstat News* “Statistics Losing Ground to Computer Science.” He raised many good points, but his title cuts to the heart of our anxiety. Does “data science” mean we’re being replaced?

I believe this anxiety stems from an overly-broad definition of statistics and an unclear definition of data science. For my part, I’ve come to see data science as supply chain management for “data products.” This supply chain starts with real-world problems and ends with a report, business decision, or software. The middle contains lots of statistics, databases, programming, communicating, etc. Data science is fundamentally multidisciplinary. “But,” you may ask, “isn’t that just statistics?”

Davidian’s article is titled “Aren’t We Data Science?” after all. Randy Bartlett answered “We Are Data Science” in a subsequent Statistician’s View. This “everything data” definition of statistics is popular among statisticians. Former ASA President Robert Rodriguez championed this view in 2012, offering ASA as a “big tent.” The popular blog Simply Statistics states, “Whenever someone does something with data, we should claim them as a statistician.”

There is historical precedent for this claim. Statistics as a discipline originated in the 18th century. Least squares dates to the early 1800s with Gauss and Legendre. We statisticians were the only data game in town, even as statistics became tied with mathematical probability in the 19th and early 20th centuries.

Yet times have changed. Judging by current statistics curricula, statistics is more closely tied to the mathematics of probability than to fundamentals of data management. Survey the requirements of most graduate statistics programs. There is a core of courses in measure-theoretic probability, theoretical statistics, and linear models. I am not saying computation, database management, and application foci are absent. But the degree to which such courses are emphasized, or even offered at all, is highly variable. What proportion of programs require a scientific databases course or a high-performance computing course? We are well trained in quantifying uncertainty and deriving asymptotics. We are poorly trained in the tools of modern data management.

What has driven this structural break? Data have proliferated. This isn’t about the volume of data in a “Big Data” sense, but rather that data are more popular. More data sets exist. More people are analyzing data. It is no longer the case (if it ever was) that only scientists, trained to deal with complexity, are the consumers of data products. The need for compelling visualizations and narratives to convey complicated stories has increased.

As models have become more accurate, they have also become more complex.

Ensembles of models are often better predictors than any single model. Ensembles are empirically accurate, but their asymptotic properties are often unknown. And an additional question arises: Asymptotic to what? One could take any or all of the number of observations, predictors, models in the ensemble, etc. to infinity and possibly arrive at different solutions. In the age of Big Data, asymptotic properties matter.

Finally, data are bigger in a Big Data sense. Storing, moving, and processing terabytes of data is neither simple nor all “statistical” in nature. There has long been a working relationship between statistics and computer science. But now software engineering knowledge is required if any useful analysis is to come from a Big Data project.

Whither statistics?

#### The More Things Change, the More They Stay the Same

In an age of Big Data, I believe statistics’ focus on probability and asymptotic properties is more valuable, not less. As we move toward more complex statistical and machine-learned models, there is still a need to understand the properties of and to get inferences from these models. A (computational) data scientist once told me “statisticians will be the ones to help us figure this mess out.” These are questions at the heart of theoretical statistics.

And in a world that is streaming data, careful research design and data collection are as important as ever. A biased sample is still biased if it has a million observations. This is especially important when the data are born of the Internet and people implicitly or explicitly opt in. These are challenges survey statisticians face regularly.

Recent research by statisticians is tackling some of these issues. Gerard Biau, Luc Devroye, and Gabor Lugosi have demonstrated the consistency of averaging classifiers. Stefan Wager, Trevor Hastie, and Bradley Efron propose methods to get prediction errors of bootstrapped and bagged learners. Abhijit Dasgupta et. al show how to estimate effect size using nonparametric, “black box” models. Andrew Womack, Elias Moreno, and George Casella have shown that a popular model for text mining is an inconsistent estimator.

#### But Sometimes, Things Just Change

While many of the fundamental problems facing statisticians are the same, the applications and environment are different. Statistics education, particularly at the graduate level, must adapt. As data get “bigger” and research and applications become more multidisciplinary, the need for statisticians to communicate and collaborate with a wide range of professionals and laypeople increases.

Statistics education should require minimum competency in fundamentals of computer science. ASA’s recent statement, “The Role of Statistics in Data Science” highlights three data science skillsets: database management, statistics and machine learning, and distributed and parallel systems. Statisticians must work closely with software engineers to develop solutions that scale. We must understand the code so that scaled solutions still have desirable statistical properties. I believe that statisticians should have minimum foundational training in database management and high-performance computing.

In addition, examples and applications in introductory statistics courses may need updates. For example, ensemble methods will be at least as important as linear regression in the coming years. We may consider teaching concepts like Zipf’s and Heap’s laws early on, as analyses of linguistic data are growing more common.

It is an exciting time to be a statistician. Statistical models and methods are applied in ways unimaginable only a decade ago. Airplanes fly themselves; doctors use statistical models to aid diagnoses; scientific research involves mining massive data sets. The importance of these tasks makes understanding our models an imperative. Yet, fundamental statistical properties of these models remain little understood.

I am not convinced that statistics is data science. But I am convinced that the fundamentals of probability and mathematical statistics taught today add tremendous value and cement our identity as statisticians in data science.

Further Reading

“Aren’t We Data Science?”

“Statistics Losing Ground to Computer Science”

“We Are Data Science”

“ASA President Robert N. Rodriguez Calls for Creating the ‘Big Tent’ for All Statisticians”

“Statistics/Statisticians Need Better Marketing”

“A Very Short History of Data Science”

“The Role of Statistics in Data Science”

Donglin Yansaid:I agree that statistics is not data science. I worked as a “data analyst” shortly after I got my MS degree in statistics. For many people in industry, when they say data analysis, they mean data cleaning, data organization and very very basic statistics and fancy charts (like means and bubble plot). Typically, this involves obtaining data from large, complex database using querying languages like SQL. In that data analyst job, I barely used any statistical models because people don’t really care about p-values. Also, with the size of current datasets, p-values are always very small. The models, analysis methods that most people learned at school are not very useful since the simple model and more valid and complex models tend to give the same conclusion when sample size is large.

Kuonensaid:Please find at https://goo.gl/pojVGJ and/or http://goo.gl/dsXco1 my view on big data, data science and statistics.

Amos Odeleyesaid:Statistics is different from Data Science. Data Science uses Statistics as one of the many tools in their toolbox (in my opinion and experience, the major tool) in addition to other tools from Mathematics, Computer Science, and so on. There is and will always be a distinct and foundational role of Statistics and Statisticians. However, we may see other evolving roles of Statistical applications as we see with Data Science. Here is how I see the whole thing called Data Science: one day, Statistics grew out of Mathematics and in our day, Data Science grew out Statistics+. I was trained “traditional” Statistician and currently practiced as a Data Scientist with majority of my tools from Statistical methods. How we practice as Statistician depends on the industry/business focus of our applications.

Thomas Billingssaid:>We are poorly trained in the tools of modern data management.

This is an understatement; most academic programs have a theoretical orientation. Meanwhile, in the real world, 75+% of the effort in doing an analysis is finding the relevant data, cleaning and vetting it. Additionally, many decisions are made in this stage that fundamentally impact the final analysis. This effort is often not documented, reducing the reproducibility of an analysis.

Statisticians should know SQL, at least 1 stat system besides R (R is required but statisticians should be able to use >1 system), and have some practical experience with databases.

Randysaid:I enjoyed reading Thomas Speidel’s take. He is another applied statistician/statistical data scientist and involved in what is going on in the field.

Time to Embrace a New Identity?

https://www.linkedin.com/pulse/time-embrace-new-identity-thomas-speidel?trk=prof-post

searchsaid:searchThe Identity of Statistics in Data Science | Amstat News

Randy Bartlett, Ph.D.said:RE: Debate anyone?

RESP: This is not a referendum on whether we need the term ‘data science.’ We do not. It has been thrust upon us like ‘Six Sigma.’ If applied statisticians were to cease their extensive involvement, then we would be excluded from many statistics problems just like we were with Six Sigma.

Here is the real debate. Are there statistics problems relabeled as ‘data science?’ If yes, then we continue. If no, then why are we discussing this?

RE: Yet times have changed. Judging by the current statistics curricula, statistics is more closely tied to the mathematics of probability than to …

RESP: … than to applied statistics. During the past decades, the training has become more relevant for academic statistics and less relevant for applied statisticians. In the field, we seldom prove the CLT or derive a new MLE–for us, this is a sideshow and not the main show. Much of our training happens after grad school. A large minority of academic statisticians realize that the curriculum is about half right. See p. 146 of ‘A Practitioner’s Guide To Business Analytics’ for references (Dr. Ronald Snee said it well).

RE: … than to fundamentals of data management

RESP: Mostly, we applied statisticians want to pursue statistics problems (no matter how they are relabeled) and not data management problems. If statistics problems are in data science, then people do us a disservice by debating whether we should be involved.

I suspect that many are disconnected from parts of the ongoing conversation. I wrote a blog series that captures most of the ongoing mischaracterizations about applied statistics, see Datafloq: https://datafloq.com/read/author/randy-bartlett/279. Their purpose is to mis-position applied statisticians in the marketplace.

RE: In the age of Big Data, asymptotic properties matter.

RESP: An interesting idea.

Vincent Granvillesaid:As a data scientist, I work on making models (actually absence of models, but rather data-driven inference systems) simpler, not more complicated, and fit for black-box processing of big data in production mode. That is, robust more so than accurate. And I also work on designing a new statistical framework that is free of mathematics, probabilities, random variables, and so on. Even to compute confidence intervals or more elaborate forecasting systems. It will be published in my upcoming book, “data science 2.0”.

biodegradable balloonssaid:biodegradable balloonsThe Identity of Statistics in Data Science | Amstat News

Going Heresaid:Going HereThe Identity of Statistics in Data Science | Amstat News