## Statistics as a Science, Not an Art: The Way to Survive in Data Science

Mark van der Laanis the Jiann-Ping Hsu/Karl E. Peace Professor in Biostatistics and Statistics at UC Berkeley. He also is a recipient of many awards, including the 2004 Spiegelman Award and the 2005 Committee of Presidents of Statistical Societies (COPSS) Award.

My father told me the most important thing about solving a problem is to formulate it accurately, and one would think most of us statisticians would agree. Suppose we want to build a spaceship that can fly to Mars and return safely to Earth. It would be tragic folly to make simplifying assumptions that are known to be false, since that would mean the death of the astronauts and certainly the failure of their mission.

However, it is not true that one must start with simplifying assumptions, as the field of statistics has the theoretical foundation that allows the precise formulation of the statistical estimation problem. The foundation of statistics laid down by its founders, which is incorporating knowledge about the data-generating experiment through the statistical model and the formal definition of the question of interest through the definition of the estimand, could not have been to arbitrarily select a “convenient” statistical model. However, that is precisely what most statisticians blithely do, proudly referring to the quote, “All models are wrong, but some are useful.” Due to this, models that are so unrealistic that they are indexed by a finite dimensional parameter are still the status quo, even though everybody agrees they are known to be false.

The consequences of giving up on formulating the actual statistical model are dramatic for our field, making statistics an art instead of a science. Young and upcoming statisticians have approached me, asking: How is it possible that if one presents the same 1) data, 2) knowledge of the experiment, and 3) scientific question of interest to two different “top” statisticians, they will most likely come up with quite different answers? How is it possible that our field is loaded with a diverse set of methods that contradict each other, without guidance for how to choose among them? My response: These statisticians do not respect the definition of a statistical model. Of course, if one gives up on the scientific standard that a statistical model has to be true, then any statistician can do whatever they want. One statistician selects a logistic linear regression with main terms X and another statistician selects main terms X1, and of course, the coefficients in front of the treatment of interest for these two model choices represent different estimands and also lack any easy interpretation.

A closely related and large consequence of giving up on the scientific standard of estimation based on a true statistical model is that we create a disconnect between our scientific collaborators and us, the statisticians. If we do not care about the model being true, why would we spend a lot of time talking to the scientists who generated the data trying to determine true knowledge, and why would we bother trying to determine the estimand that best answers the actual scientific question of interest?

Instead, one typically asks a few questions about the data such as: Is the outcome a survival time? Is it case-control data? And then one quickly moves on to returning output from a Cox-Ph model or a logistic regression model with some “reasonable” set of covariates. Apparently, in this type of practice, it is not necessary to listen closely to our collaborators and understand as much as possible the underlying data-generating distribution.

As a consequence, we will not be an intrinsic part of the scientific team, and the scientists will naturally have their doubts about the methods used, thus only accepting the answers we generate for them if it makes sense to them: “Fortunately,” we can try out many models until we get an answer that achieves consensus. Our collaborators will view us as technicians they can steer to get their scientific results published. Statistics is now an art, not a science: The results are unreliable. “Confidence” intervals are based on completely wrong assumptions and will have asymptotic coverage zero for the true scientific question of interest, and, because of bias, “*p*-values” will always be “significant” as long as the sample size is large enough.

Some of you might say, “Oh, but we often do a sensitivity analysis.” This is like building a spaceship that can *only* do the job under unrealistic assumptions, and then determining how it would blow up under slightly less unrealistic assumptions. How useful is that?

Is this mess we have created really necessary? No! As a start, we need to take the field of statistics (i.e., the science of learning from data) seriously.

It is complete nonsense to state that all models are wrong, so let’s stop using that quote. For example, a statistical model that makes no assumptions is always true. But, often we can do much better than that: We might know the data are the result of n independent identical experiments; the treatment decision of a medical doctor is only based on a small subset of the measured variables; the conditional probability of death is always smaller than 0.04; the experiment involved two-stage sampling with known conditional sampling probabilities; and so on.

But to obtain this knowledge, we need to take the data, our identity as a statistician, and our scientific collaborators seriously. We need to learn as much as possible about how the data were generated. Once we have posed a realistic statistical model, we need to extract from our collaborators what estimand best represents the answer to their scientific question of interest. This is a lot of work. It is difficult. It requires a reasonable understanding of statistical theory. It is a worthy academic enterprise! We will open up a new world to our collaborators by actually being able to generate questions our collaborators had no idea they were even allowed to pose. Then, they will actually get excited instead of bored to death by another logistic regression model.

Estimators of an estimand defined in an honest statistical model cannot be sensibly estimated based on parametric models, let alone be based solely on idiosyncratic model selection, and thus will typically require the state of the art in machine learning/data-adaptive estimation, and targeting the estimator toward the estimand so the resulting estimator is minimally biased and statistical inference is possible.

This was our motivation to define the field of targeted learning as the statistical subfield concerned with developing estimators and statistical inference under realistic assumptions for specified estimands of the data probability distribution. In response to having to solve these hard estimation problems the best we can, we developed a general statistical approach—targeted maximum likelihood learning, or, more generally, targeted minimum loss-based learning—which integrates the state of the art in machine learning/data-adaptive estimation, all the incredible advances in causal inference, censored data, efficiency and empirical process theory while still providing formal statistical inference. This field is open for all to contribute to, and the truth is that anybody who honestly formulates the estimation problem and cares about learning the answer to the scientific question of interest will end up having to learn about these approaches and can make important contributions to our field.

The amount of data generated in our world for the sake of moving science forward has increased exponentially so that we now live in the world of Big Data. A new field has arisen that is called data science. Historically, data analysis was the job of a statistician, but, due to the lack of rigor that has developed in our field, I fear our representation in data science is becoming marginalized: Companies hire computer scientists and Big Data institutes are run by computer scientists, or those scientists who generate the data. As we have abandoned theory, why not go to the people who make the data or can write exciting algorithms to explore it? How did this happen? We are the science of learning from data!

There is also a very serious concern that these leaders and funding agencies do not realize that algorithms in data science need to have been grounded within a formal statistical foundation so they actually answer the questions we want to answer with a specified level of uncertainty. That is, the statistical formulation and theory should define the algorithm. Despite some prejudices to the contrary, Big Data does not obviate the need for statistical theory. Data itself is useless and can only be interpreted in the context of the actual data-generating experiment.

The solution to this threat to our survival as a field is precisely that we should not just state we are the science of learning from data, but to live it. Let us reinvigorate the science we are supposed to be and get away from the art. We have to be part of a scientific team solving a real-world problem. We have to formulate and solve the actual statistical estimation problem, educating our collaborators in the process about the unique and fundamental role of statistics. We have to start respecting, celebrating, and teaching important theoretical statistical contributions that precisely define the identity of our field. Stop working on toy problems, stop talking down theory, stop being attached to outdated statistical methods, stop worrying about the politics of our journals and our field. Be a true and proud statistician who is making an impact on the real world of Big Data. The world of data science needs us—let’s rise to the challenge.

Randy Bartlettsaid:According to Burtch Works, 31% of data scientists have degrees in mathematics or statistics, while only 19% have a degree in computer science.

Another point is that that ‘ASA big tent’ never has much room for applied statisticians and other quants. I have not seen the dialogue explaining why thousands of applied statisticians are not members of ASA when they so desperately need a professional association backing them. Where is the customer survey?

Also, why do some academic statisticians think that they can speak for applied statisticians or all statisticians? Is this based upon their copious communications with statisticians in the field?

Some of us started a new LinkedIn group: About Data Analysis, to help turn the tide.

Peter Chusaid:Since the article’s main focus is implicitly on predictive models, perhaps one can explain the conundrum of different models on the same data in simpler terms. Modeling is essentially an optimization exercise dating back to 18th century: find a functional form or forms using a selected subset of the input variables which optimizes some objective function e.g. AUC, KS, Gini, SSE, etc. among all possible functional forms and along with the selected variables. A very tall order indeed especially dealing with some of the typical modern data challenges, as Dr. van der Laan alluded in the article “…In response to having to solve these hard estimation problems the best we can, we developed a general statistical approach—targeted maximum likelihood learning, or, more generally, targeted minimum loss-based learning—which integrates the state of the art in machine learning/data-adaptive estimation,….” Except not just “we” as statisticians, mathematicians had formulated the problem centuries ago, and started tackling the problems centuries ago. Moreover, modern elementary theory of computation tells us (at least to those of us who knew some basic theory of computation), this problem is NP-hard even if you fix the algorithm of choice, and just focus on the variable selection problem especially when dealing with large data sets with large number of inputs and with high correlation among the inputs. This solution surface of NP-hard problems is typically rife with many local optima and in most case it is computationally unfeasible to solve the problem i.e. finding the global optimal solution in general. Hence, each modeler is using some (heuristic) modeling techniques, at best, to find adequate local optimal solutions in the vast solution space of this complex objective function. Some are better than others, depending on the modeler’s knowledge, experience, and skill. It’s hardly a surprise that two modeler may come up with two different models while modeling a complex data set. However I do think George Box’s truism needed is a postfix “depending on the data”. That is there is no single algorithm works best for all the data, the best model or (ensemble) models is data dependent.

Finally, to address Dr. van der Laan’s assertion “—which integrates the state of the art in machine learning/data-adaptive estimation….”. Perhaps there is the desire for “integration” among some statisticians but the sentiment is not reciprocated: Google “machine learning department”, here are just a few selected examples: Carnegie Mellon, Columbia, Duke, University of Toronto, etc. they are either housed in the Department of Computer Science, or Machine Learning department. One wonders why? Perhaps folks in ML don’t think Statistics is an art nor finding two different models on the same data set unacceptable (just pick the better one!) but some of them would definitely find Dr. van der Laan’s statement “….the statistical formulation and theory should define the algorithm…. “ utterly bizarre.