## Data Science: The Evolution or the Extinction of Statistics?

Jennifer Lewis Priestleyis a professor of applied statistics and data science at Kennesaw State University, where she is the director of the Center for Statistics and Analytical Services. She oversees the PhD program in advanced analytics and data science and teaches courses in applied statistics at the undergraduate, master’s, and PhD levels.

As a discipline, I think statistics—and by association statisticians—are going through a midlife crisis. Just look around a typical university. Where is statistics housed? Mathematics? The business school? Engineering? Humanities? All of the above? Who are we? This crisis of identity has been accelerated by this new term “data science.” Is it a discipline? Is it an application of statistics? Is it an application of computer science? Is it a buzz word just having its moment?

I agree with Tommy Jones in his *Amstat News* article, “The Identity of Statistics in Data Science,” when he says the “… conversation around data science betrays an anxiety about our identity.”

As the director of one of the country’s first PhD programs in data science and a professor of statistics, I believe data science is the full-length mirror we have needed to hold up in front of our discipline for a long time so we can examine how we look from multiple angles. As any middle-aged woman will tell you, full-length mirrors contribute to anxiety.

As we turn in front of this mirror, there are angles that are not working for us. Theoretical statistics is increasingly a bastion of academia. While there will always be a need for PhD theoretical statisticians in universities, a BS in theoretical statistics—defined by derivations of theorems and execution of formulas completely by hand with no experience with real data—does not prepare undergraduates to work in a 21st-century economy. And, as most theoretical statisticians will tell you, if someone does want to pursue a PhD in statistics, they are better served pursuing an undergraduate degree in mathematics.

The other side of this angle is “business statistics” (“statistics-for-students-who-could-not-handle-the-math-in-real-statistics” in most universities), where students work with Excel spreadsheets characterized by 100 rows and three columns and they generate means and standard deviations—and in the advanced course, pivot tables. These courses also do a huge disservice to students and, similarly, do not prepare graduates to work in a 21st-century economy.

The 100% theoretical approach to statistics and the “statistics lite” approach are both bad for our discipline because of the same issue—data. Neither approach prepares students to work with real-world data. If you scan typical job advertisements, any position hiring a statistician will likely include required skills such as programming, analytical software experience (e.g., SAS), database management (e.g., SQL), and writing and communication. This is because the days of being a “data diva” are over—statisticians are expected in most companies to have some ability to extract, transport, load, clean, analyze, model, and “tell the story” of their results. This is particularly true in small companies. And even if they don’t have to do all points in the chain for every project, developing a working knowledge of how data are collected, stored, extracted, cleaned … makes for better models … and more complete communication of results.

But as we pivot in the mirror, data science is also allowing us to show off angles of our discipline that are sorely needed—by everyone. At my university, we have an MS in applied statistics. We will have companies from diverse disciplines such as health care, retail, finance, and energy recruiting the same student. Why would companies from such different domains be interested in the same student? Because they are all trying to solve the same problem! They are all trying to translate massive amounts of data into meaningful information to solve a problem and then explain the solution to their boss or their client.

This set of requirements is almost ubiquitous—and it’s certainly multidisciplinary. I think it’s a point of evolution for our discipline and has become the definition of the 21st-century statistician—converting data into information to solve problems or discover patterns and then telling the story. More than any other academic discipline, statistics (applied statistics) is needed by every other discipline. To use a dated phrase (we are in midlife after all), “our dance card is full.”

Again, a brief example from my own university. We have an undergraduate minor (not a BS) in applied statistics. This minor requires students to take five elective courses in applied statistics. This minor is not required for any undergraduate on campus. And yet, in any given semester, we have well more than 100 undergraduate students who have declared a formal minor in applied statistics. These students come from all the colleges across campus and from dozens of departments. We have biology majors sitting next to sociology majors sitting next to finance majors all solving the same problems. It’s the most popular minor field of study in the history of the university. Again—multidisciplinary. All of a sudden, everyone wants to study applied statistics.

So what about data science? Who are these people, and how are they different from us?

The definitions of data science are converging around the intersection of mathematics, statistics, and computer science—with some area of application (e.g., finance, biology, political science). I have heard data scientists referred to equally as “the computer scientist who was the best of his peers in his statistics courses” and “the statistician who was the best of his peers in his computer science courses.”

I referenced that I am an applied statistician running a PhD program in analytics and data science. While data scientists can do a great many things I can’t do—mainly in the areas of coding, API development, web scraping, and machine learning—they would be hard pressed to compete with a PhD student in statistics in supervised modeling techniques or variable reduction methods. Earlier this year, an article on the Simply Statistics blog, “Why Big Data Is in Trouble: They Forgot About Applied Statistics,” highlighted the issue of how a rush to the excitement of machine learning, text mining, and neural networks missed the importance of basic statistical concepts regarding the behavior of data—including variation, confidence, and distributions. Which lead to bad decisions.

So, where does this leave us statisticians? I believe data science is good for us. In fact, it’s great for us. People need us in new and exciting ways—to help them translate the data into information to tell a story. The “science of data” is becoming a nascent discipline that is lifting all boats. That nascent scientific discipline needs us.

Vincent Granvillesaid:Read my article on a combinatorial fast, efficient algorithm for feature selection using predictive power to jointly select variables: it is the data science approach to variable reduction and generation. Likewise, supervised modeling – which also belongs to machine learning – is not foreign to data scientists. Read about my automated indexation / tagging algorithm, used for taxonomy creation or cataloguing: it performs clustering on n data points in O(n), and can cluster billions of web pages in very little time. It is also used to turn unstructured data into structured data.

Andrew Ekstromsaid:One of the biggest differences I know of between “data scientists” and statisticians is how they access their data. A statistician will use a program like R or SAS to upload the data sets they want to use into their personal computer and manipulate it there. A Data scientist will use a server to do the same thing, then, maybe, upload the data onto their computer and run their analyses. While this sounds like a small difference, the size of the data sets is massive!

What a lot of the statisticians don’t understand is that their stats software is terrible at joining data sets and manipulating data. The first step in any join is to create a euclidean join. This means every single tuple (row) from one table (data set) matches up with every other tuple in the other table(s). Suppose you have a pair of data tables that are 1,000,000 tuples. When you join these tables, the first step is to create a new table with (1,000,000)^2 tuples. The next step is to reduce this table to only the tuples of interest.

Because the table is so large, your desktop computer might have issues dealing with it. It might run slow while performing the operations you requested. Meanwhile, a database program (like oracle or Sql Server) will use parallel processing of your data. If the data scientist specified that the 1,000,000 tuple table get broken down into 10 100,000 tuple tables, and joined those tables, they end up with 10 (100,000)^2 tuple tables. 10*(100,000)^2 is 10 times less than (1,000,000)^2. So, your desktop will need 10 times less RAM and processor power. (P.S. This is how Hadoop works… but better!) Because the data is broken up into 10 tables, it can be processed on 10 different cores of your processor. So, by breaking the original data set into 10 equal sized data sets, and using good database software that allows parallel processing, you use 10% of the RAM of SAS or R, which speeds up processing. (It’s about 10 times faster.) By allowing your 10 data tables to be processed on 10 different cores on your processor, you speed up processing by about 10 more times.

Now imagine how great it would be if you could get 1,000,000 tuple data table joined 100 times faster! What would you do with your extra time? Now, imagine if you broke your table into 1,000 tables of 1,000 tuples. You’ll get things done at least 10,000 times faster!

When it comes to most stats programs for processing data, the algorithms the program uses have a lot to do how fast the data will be analyzed. A computer that has an Intel I7 processor, the CPU has the 4 cores and 4 hyperthreads (minimum as far as I know). This is the equivalent of 8 cores, usually called “logical cores”. The software will only use one of those cores. So, it’s like having a V8 engine and only using 1 piston at a time. (That’s just stupid.)

As a user of R, I performed a benchmark by taking a 10,000×10,000 matrix and raising it to the 100th power. On basic R, it took about 2 hours on my laptop. R only uses one of the 8 logical cores. When I upgraded the BLAS and LAPACK, to ones that allow parallel processing, that same program in R was done in 6 minutes.

A data scientists knows all these “hacks” for speeding up their data analysis. The typical statistician doesn’t. With the size of data sets growing faster than the power offered by a desktop computer, soon, the typical statistician will become less and less relevant and could go extinct. They’ll seem like a cranky old person who thinks, “Back in my day we…..” Unless that statement is followed by, “Thankfully, we do things a lot better now.” The typical statistician may go extinct. At minimum, the typical statistician will have to bend to the will of computer science. Because they need to. It’s not a matter of will or want. It’s a necessity!

Sladesaid:Specialists in one field of any science are replaceable by expert system technology and soon coming advances in cybernetics, neurocomputational science and AI producing statistical analysis 24/7 nonstop number crunching sending the future of all education into neurological studies related to universalized science in both mathematics and physics. AI systems will be able to format, unify and universalize science and their mathematics and physics into one epistemology machine streamlining all disciplines of science. Of course this is blasphemy to say everything you ever learned will become obsolete since machines can think at a higher human level in this case, but feelings aside this is what intelligists are theorizing as we speak.

Steve Roemermansaid:Wow.

This is spot on. I love the points, and the point of view.

Applied mathematics has been in crisis for a long time. Ronald Howard at Stanford personifies these “where do we fit” questions. At INFORMS, you see a lot of folks doing papers which without much reference to computers – odd since OR is done on computers. But, to publish papers, academics love closed form equations.

Same thing happens at AI conferences to a large degree.

Are they math department wannabes?

Perhaps, but some of this is the pace of change. Academia is scrambling to be relevant in a world where applied math and statistics is being done outside the ivy covered walls.

This is not new of course. The intelligence world has been quietly doing this for decades. But now thanks to the power of computing, the benefits can be deployed for profit, for all to see.

The great endowments of industrial revolution wealth (Carnegie, Kettering… ) helped create American academia – not the other way around, and it looks like this pattern is about to repeat.

My advice to universities – just try and keep up, but don’t pretend you are leading unless someone else tells you it’s true.