## We Are Data Science

“I keep saying that the sexy job in the next 10 years will be statisticians.”

*Hal Varian
Chief Economist at Google*

“I believe statistics has many cultures.”

*Emanuel Parzen
Distinguished Professor, Texas A&M University *

I wish to advance the points that statisticians are already involved and must expand their involvement in business analytics, Big Data, data mining, data science, machine learning, and predictive modeling (analytics). Also, I wish to clarify that any topic involving data analysis, necessitates statistical thinking, statistical techniques, and statistical assumptions.

The current bizarre and restrictive pronouncements about what we do aim to limit us to a rigid set of ‘small data’ tools. This distasteful propaganda, in the form of straw-man characterizations of the established fields, is intended to differentiate some new vision of how to work with data without knowing statistics.

Experiencing similar discouraging statements from fellow ASA members has its own special flavor. I recently attended JSM 2013 and read articles in *Amstat News*—“Aren’t We Data Science?” and “The ASA and Big Data” —and *Significance*—“Big Data and Big Business: Should Statisticians Join In?” As a result, I have a new list of things I am not doing and cannot do—primarily data science, Big Data, and data mining.

These comments from ASA members limit our credibility in the eyes of employers and recruiters—providing aid and comfort to those who covet our role in the corporation. This comes at a time when ASA’s left hand (Ron Wasserstein, et al.) is offering the long-awaited PStat® and Conference on Statistical Practice, and when statistics departments are offering an MS in analytics. Of course, many ASA members do not think in self-limiting ways or confine themselves to classical techniques. As a group, we statisticians are not homogeneous. The unstated issue is, however, whether we can remain one profession.

#### The Business Analytics Role

Corporations house two data functions. Roughly put, the IT silo manages data and business quants analyze the data. I will use the less-tainted term, business quants, to denote those econometricians, industrial engineers, operations researchers, statisticians, etc., who apply the tools of complete and incomplete information. Our jobs are to help run a business. This involves making and supporting decisions, and this requires mastering the business and extracting information by any means possible.

We must augment our publication-centric education to master all three toolboxes—mathematics, statistics, and algorithms (logic, heuristics, optimization)—or be crowded out by more strident professionals. The mathematics we use consists of numerical tools for making deductions from complete numbers, as in E = MC^{2}.

Statistics consists of making inferences based upon incomplete information arising from incomplete or poorly measured data. The pride of our most powerful and indispensible statistical assumptions is the error term, as in E = MC^{2}+ ε.

No statistics means no error term, no inference, and no corresponding statistical assumptions for incomplete information.

#### Data Science and Big Data Require Statistics

We reside in a global community possessing a low statistical literacy. As Deming said, “The nonstatistician cannot always recognize a statistical problem when he sees one.” We should expect depictions of data science and Big Data void of an understanding of statistics.

The business press and Big Data vendors are portraying Big Data as complete information. Instead, it is often excessive incomplete information enabling a paradigm shift in approach and methodology for certain applications, but not in statistical thinking or statistical assumptions. Non-quants are unfamiliar with our three old friends from the statistics tool box: missing values, missing Xs (the wrong data), and measurement error.

Also, we do not want unrefined Big Data! We want information, and this often requires us to reduce Big Data. eBay’s approach to Big Data is typical: keep buying more hardware storage. This allows for searching, reporting, counting/summarizing, and, at a slightly higher conceptual level, segmentation. However, this light analysis is merely descriptive in character; it will take the quants to deliver the promises of Big Data.

Next, we need statistical diagnostics to measure the accuracy and reliability of results.

#### Conclusion

I second Marie Davidian’s call to arms and the recommendations in her aforementioned *Amstat News* articles. ASA members, like everyone else, must embrace change. In private industry, government, and all other organizational settings in which we work, statisticians and other quants must be data science generalists and practice every type of data analysis, whether in business analytics, Big Data, data mining, data science, machine learning, or predictive modeling (analytics). To differentiate our value proposition, we must be involved.

Furthermore, an understanding of statistics is necessary to properly lead and organize resources, which can address our concerns about involving the most appropriate professionals. I discuss in greater detail the needed changes in *A Practitioner’s Guide to Business Analytics*.

*Randy Bartlett, PhD, PSTAT®
Author of* A Practitioner’s Guide to Business Analytics

Vincent Granvillesaid:When you write that no statistics means no inference, I disagree. Google “Analyticbridge first theorem”, it illustrates how to build confidence intervals without models. Also search for recent articles that I published, such as statistical modeling without models. As a data scientist, not a business quant nor a statistician, I’m familiar with statistics and have published in journals such as Royal Statistical Society Series B. AMSTAT is too focused on clinical trials and to a lesser extent on government stats and biostats. This is the cause of the issue here in US with the term statistician. But it is not an issue in other countries.

Thomas Speidelsaid:@Vincent: It’s interesting that you brought up the clinical trials/biostats focus of Amstat. I think the reasons for this are obvious. Back 20 years ago, businesses and private organizations were doing very little statistics. There was less of a need, collecting data was expensive and impractical, statistics and statisticians were seen exclusively as academics, the internet barely existed. Statisticians and their peers naturally gravitated around the fields where their value added would be understood: health research/pharma, econometrics, actuarial science, government and a few other niches. Fast forward to today. Two very important things happened:

1)Internet, e-commerce, social websites, sensor monitoring and cheap storage have spurred a sudden interest in collecting and exploiting data.

2)Leo Breiman’s work in the 90’s creating a cultural shift.

I think what Randy is alluding to is the sudden unfounded uninterest in statistics in favour of algorithmic/ML approaches to solve everything. Some say our profession is being hijacked. The problem is that we still do not understand very well how these methods perform compared to classical methods (classical != dated). How does an algorithmic CI comapare to a probabilistic one? Under what circumstances does it become narrower compared to a probabilistic one? What about the estimate itself? How is it affected by missing data, extremes etc. Yet, from a more classical perspective, we understand a lot more about sampling distributions, the effect of missing values, extremes, etc. And what to say about

accumulating and storing huge amount of data to justify the use of algorithmic methods? Has anyone wondered if the increased costs justify an unknown change in accuracy? Does it improve, say, prediction error compared to sampling?

Here’s where I think Amstat’s focus on biostat has helped: the strong focus of this area on evidence based research, replicability, publication and peer reviews (for as problematic as they are) ensures that we make an effort to really see what works where. This does not change the fact that Amstat and all of us need to embrace data science instead of downplaying it, or worse, snobbing it.

Randy Bartlettsaid:Vincent,

I appreciate your drive to push the profession forward and I want to as well. As you can see in my article, I employ the assumption-based definition of statistics. This provides a logical split as opposed to some other arbitrary definitions. Statistical assumptions complement statistical thinking and fluster the ubiquitous deductive-only thinking.

I see no reason for people with a statistics degree to stop calling themselves statisticians or business quants; or for you to proclaim the ‘death of the statistician’ (in your blog). ‘Data scientist’ does not have a support organization or certification, so anyone can claim to be one; and the definition is still up for grabs. Why abandon the mother ship when there is no escape pod? I think the situation calls for teamwork rather than infighting. We need to get organized.

Randy Bartlettsaid:Thomas, I agree. We need to embrace change and we need to be involved. We have a strong value proposition and we have to explain it. Those with light training in statistics struggle at recognizing statistics applications. Others have bizarre ideas about the breadth of statistical tools and what statisticians do (see the Black Swan). I would not describe anything that my colleagues do as ‘classical’ in the sense that the term is employed by non-quants. There is a popular straw-man argument claiming that classical statistics (whatever that means) does not apply so statistics does not apply. Statisticians and other quants employ a much larger and more modern tool box than that of the 1930s!? I am not sure of the origin of this ‘classical statistics straw-man.’ Could it be from some off-topic professors teaching ‘classical’ statistics to students in their degree programs and then teaching modern statistics as a separate topic, labeled something like ‘data mining?’

The unremitting attacks mean to me that they want more of what we have.

Daniele Medrisaid:Interesting ideas that confirm some of my recent remarks.

Randy Bartlettsaid:Daniele, I think you are spot on. “Statisticians are data scientists by definition but data scientists do not necessarily have academic training. Statistical knowledge is the backbone of any analyst.” Statisticians in the field analyze whatever data contains the information, using whatever tools make sense. Some, who covet our role in the corporation, are trying to redefine statistics based upon the current publication activities of statistics professors and ignoring their past activities. Problems requiring statistical assumptions or statistical thinking fall into statistics.