In Response to ‘Statistics as a Science, Not an Art: The Way to Survive in Data Science’ by Mark van der Laan
Dear Amstat News,
A February 2015 Amstat News article by Mark van der Laan expresses dismay at “giving up on the scientific standard of estimation based on a true statistical model” and urges us to define estimands honestly. According to van der Laan, “it is complete nonsense to state that all models are wrong” and “estimators of an estimand defined in an honest statistical model cannot be sensibly estimated based on parametric models.”
In my experience, there are indeed problems for which all models are wrong and for which parametric models are useful. For example, take a large data set collected at the Harvard Forest of soil respiration (carbon flux from the Earth into the atmosphere) and possible predictors. Investigators want to understand the drivers of soil respiration. Flux varies from place to place and time to time. It depends on exactly what we call the boundaries, in both space and time, of what we call the study region, which has, in fact, fuzzy boundaries. There was no random sampling within that fuzzy region. And even if we could precisely define a study region, we really want to understand the drivers of respiration in the wider world, not just in the study region.
In this problem, there is no true model and no true estimand, no matter how flexible and nonparametric. Yet there is a clear, nearly linear relationship between log(flux) and soil temperature. It is useful to point that out, to point out the ways in which a simple linear model can be improved by adding effects for type of forest, time of year, and other predictors—by pointing out where the nonlinearities are, by pointing out how residuals deviate from the ideal, and so forth, for all the things a good statistician would do with a regression problem.
The soil respiration data set, and many others in my experience, require us to keep many models in mind, knowing that none are true, but understanding and quantifying their strengths and weaknesses. A call to find a true model and a true estimand does not accord with my understanding of this ecological problem and the inference it requires.
Sincerely,
Michael Lavine
Dear Amstat News,
Mark van der Laan worries that a “lack of rigor that has developed in our field” may result in the marginalization of the statistics profession in relation to the emerging field of data science. I emphatically agree with the following:

(1) Statisticians should seek to understand the scientific question, formulate the statistical objective accordingly, and follow an analysis strategy that is fit for purpose. We too often fail, by reaching instead for a statistical model motivated by familiarity, mathematical convenience, or the availability of software.

(2) Estimation is often more scientifically meaningful than null hypothesis testing. A pvalue can be made “statistically significant” with a large enough sample size, without regard to clinical or practical significance. As Tukey (1991) said, two groups will always be different at some decimal place.
(3) Statisticians should rely sparingly on fully parametric models and “idiosyncratic model selection,” but consider integrating the perspectives and algorithms of other fields, such as machine learning.
Box, George E. P. 1999. Statistics as a catalyst to learning by scientific method Part II—A discussion. Journal of Quality Technology 31:16–29.
Freedman, David A. 2010. Statistical models and causal inference. New York, NY: Cambridge University Press.
Hughes, Peter C. 2004. Spacecraft attitude dynamics. Mineola, NY: Dover.
Lorenz, Edward N. 1963. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences 20:131–140.
MacKay, R. J., and R. W. Oldford. 2000. Scientific method, statistical method, and the speed of light. Statistical Science 3:254–278.
Taleb, Nassim Nicholas. 2007. The black swan: The impact of the highly improbable. New York, NY: Random House.
Tukey, John W. 1965. The future of data analysis. Annals of Mathematical Statistics 33:1–67.
Tukey, John W. 1991. The philosophy of multiple comparisons. Statistical Science 6:100–116.
However, I cannot go along with the author on much else.
Van der Laan defines statistics as “the science of learning from data.” If so, our profession should study others who have been successful at learning from data. However, statisticians have shown little interest in the role of data in the process of discovery and invention in the history of science and engineering, though a few exceptions exist (e.g., Box, 1999, on the Wright brothers; MacKay and Oldford, 2000, on the speed of light; Freedman, 2010, on examples from medicine and epidemiology). Statisticians need to hear Hamlet’s message: “There are more things in heaven and earth than are dreamt of in your philosophy.”
In particular, van der Laan is upset with the classic George Box quote, “Essentially all models are wrong, but some are useful,” which he dismisses as “complete nonsense.” He says that this quote has been used to justify the use of statistical models based on unrealistic assumptions that are known to be wrong. Instead, he speaks often of a “true” or “actual” model, in the form of a probability distribution. He illustrates the alleged folly of using wrong models with the example of building a spacecraft, where simplifying assumptions “would mean the death of the astronauts and certainly the failure of their mission.”
The spacecraft example actually undermines van der Laan’s own line of reasoning. Consider the spacecraft’s attitude control system, which prevents the ship from tumbling (thus maintaining communications antennae and/or solar panels in alignment with the Earth or sun). In a standard text like Hughes (2004), the theory of attitude dynamics is presented using Newtonian mechanics, without reference to quantum theory or special and general relativity. Classical Newtonian mechanics are known to be wrong with respect to the latter theories of physics, but using “modern physics” to model attitude dynamics would be unnecessarily complex and mathematically and computationally intractable. The use of Newtonian physics here provides approximate answers to the right questions, to borrow another phrase from Tukey (1965).
The history of science and engineering is littered with other examples. In optics, we have a choice of at least three models of light, in order of decreasing simplicity (and increasing correctness): geometrical optics (a ray theory of light), physical optics (a wave theory of light), and quantum optics (a photon theory of light, including waveparticle duality). All three continue to be used today, depending on the problem at hand. In fluid dynamics, we often deliberately model fluids as continuous materials, though we know they really consist of discrete molecules. In meteorology, the Lorenz (1963) model, a deliberate oversimplification of the equations of motion for thermal convection in the atmospheric boundary layer, provides groundbreaking insights into the consequences of nonlinearity for weather prediction.
In these examples, subject matter knowledge determines which model is sufficiently useful and fit for purpose for a given problem or question. At some level, all these models are “wrong,” but they can all be useful, depending on the context. George Box was right.
With regard to Big Data, many large data sets are not the results of designed experiments or surveys, and a random datagenerating process may fail to be a plausible assumption. In some cases, the use of any probability model, even as a surrogate for our ignorance (as in queuing theory or in the kinetic theory of gases) may become questionable. Beware of the ludic fallacy (Taleb, 2007), wherein a probability model is neither right nor useful, though possibly harmful. Since a probability distribution remains central to van der Laan’s notion of a “true” model, I wonder if his concepts are too narrow to deal with the whole spectrum of Big Data problems in circulation today.
The views expressed here are mine alone, and do not necessarily reflect the policies, views, or opinions of my employer.
Sincerely,
Christopher Tong
Response to Letters by Michael Lavine and Christopher Tong
Mark van der Laan is the JiannPing Hsu/Karl E. Peace Professor in Biostatistics and Statistics at UC Berkeley. He also is a recipient of many awards, including the 2004 Spiegelman Award and the 2005 Committee of Presidents of Statistical Societies (COPSS) Award.
Let me start by stating that I am highly appreciative of the letters to the editor by Michael Lavine and Christopher Tong. Our field badly needs discussions on these important points. I view these letters as a constructive start of such a discussion.
Both authors disagree with my criticism of the statement by George Box that all models are wrong, but some are useful—a statement made in a historical context in which parametric models were the norm. Given that different notions of the word model are used, both within our discipline as well as across disciplines, it is important to first clarify that I refer to socalled statistical models defined as the set of possible probability distributions of the data.
The statistical estimation problem only depends on the statistical model, while the possible additional nonstatistical assumptions are often used to define interesting underlying quantities of interest and corresponding identifiabilty results that then define the statistical estimand of interest. As long as one is willing to assume the data were a result of an experiment, then that data has a probability distribution, and one can always define the model as the set of all probability distributions, which is a true model (i.e., it contains the true probability distribution of the data). We should be pursuing real statistical knowledge that restricts the statistical model, but does not exclude the true distribution.
If one observes a single gene expression profile, and one does not know much about the joint distribution, then one should state the true model and thus acknowledge that it serves no point to fool each other with statistical inference based on a model that assumes all gene expressions are independent.
Michael Lavine and Christopher Tong point out that there are applications in which models, even when not true, can be very useful, and they use, in particular, the Newtonian models in physics as an example. These examples of models in physics demonstrate a parametric model that is highly accurate in describing the observed data, a situation we simply do not encounter in our typical biostatistical applications. However, even in this setting, in which the data are a result of a very wellunderstood experiment, if in a particular application the observed data would contradict this parametric model, then it is my view and I presume the view of physicists that one should carry out statistics in a statistical model that contains the true probability distribution of the data. This allows one to honestly learn from the observed data and thereby move science forward in potentially very exciting directions: There is no benefit in obstructing the view of reality. This does not mean that a particular working model (i.e., a submodel of the true statistical model) is not of interest and possibly of great interest when it represents a highly accurate description of the datagenerating distribution. For example, it will be good to know what the best fit of this working model is to the actual data and to test for specific deviations of this working model, but all of this should be done in the context of a true statistical model.
The importance of developing theory within a true model, relative to developing theory in a wrong model, is already easily illustrated with the simplest of all examples. Suppose that one fits a univariate linear regression model using least squares regression. A typical textbook will teach you that using weighted least squares linear regression can improve the efficiency of the estimator, but does not change the estimand. Of course, the true relation between the two variables is not linear. As a consequence, a weighted least squares regression is fitting the projection of the true curve on the set of all lines, and the choice of weights defines the norm used to define the projection. Thus, a different choice of weights targets a different line, and, in fact, the choice of estimator of the desired weights will affect the variability of the estimator of the intercept and slope.
Same story applies to generalized estimating equations for generalized linear models, in which the choice of model for the covariance matrix of the residuals affects the definition of the estimand and the variance of the estimator in a far different manner than predicted by theory that assumes the model is correct. That is, we are teaching nonsense to our students by telling them theory that only holds under a model we all know is wrong instead of teaching them about the real world.
Let’s consider the example mentioned by Michael Lavine in which he argues log(flux) will approximately linearly depend on soil temperature and that linear regression techniques are useful even though these models are wrong. If, indeed, we know the mean of log flux conditional on temperature and other variables follows a semiparametric regression model with a linear term and unspecified function of the other variables, without interactions between temperature and these other variables, then that is the model, but if that cannot be defended, then we have to state an even more nonparametric model. Starting out with a wrong (e.g., fully linear) model and then testing if adding terms make a difference and proceeding with the modified model as if true is bad practice, although taught to our students. Forcing the choice of model to be truthful also forces one to define the target parameter of interest instead of letting it be defined by a coefficient in a misspecified model.
Dr. Tong does not support my description of our field as the science of learning from data since, according to him, statisticians have shown little interest in the role of data in the process of discovery and invention in the history of science and engineering. As I argued in my piece, we statisticians have the responsibility to be an intrinsic part of scientific teams, so that the statistical methods and theory we develop and employ actually target the answer to the scientific question and thereby fully serve science. So yes, in many ways we have failed, but a good start in improving our standing is to be very clear about our role and advance our methods accordingly. The targeted learning approach respects the true meaning of model and target parameter and defines a roadmap that can be employed in any scientific setting while fully respecting the science.
Dr. Tong also wonders if assuming the existence of a true probability distribution of the data is too narrow to deal with the whole spectrum of Big Data problems. Let me first clarify that the existence of a probability distribution of observed data does not require a designed experiment; it just requires an experiment, possibly a natural experiment such as simply observing wildlife in a large area. It is true that we are moving away from experiments defined by taking a random sample from a target population and getting more into experiments generated by total populations of connected individuals. To interpret this data, it will be absolutely necessary that we understand the experiment that generated the data, as much as possible, define the scientific questions, and establish if they are identifiable from the data. That is, we need to define the statistical model and estimand. We might learn that we need to improve and target our designs and advance statistical theory and methodology—and all that will be progress.
One example from my experience comes to mind that points out the need for clarity, instead of perfection, in modeling. In an EEO litigation, the expert witness for those claiming racial discrimination in hiring (the plaintiffs) used simple regression and twodimensional graphs. The defendant’s expert devised an exquisite regression model with over 50 parameters. In rendering his final judgement, the judge said (in simple words), “You, I understand (the plaintiff’s expert). You, I don’t understand (the defendant’s expert). I find for the plaintiffs.” In other words, a result that is truthful and useful to the client is often preferable to one that would charm our major professor, but leave the client confused.
A main virtue of mathematical/statistical models is that they are explicit. Their building blocks and assumptions are transparent. Therefore, they can be examined and critiqued. Models provide a common language for people to have reasoned discourse. For any particular problem, there will likely be a series of models simultaneously existing and evolving. Different models will address different issues. The various issues will evolve and become clearer. New issues will arise which require new models or combinations of models which previously had existed separately. It is unrealistic to expect a single model to be superior under all relevant conditions.
Modeling provides a language for scientific discourse. Models help clarify our process of reasoning. They sharpen our ken. Ideally, they make explicit the assumptions that are otherwise unstated or hidden in scientific discourse. As our understanding, assumptions, and goals change, so should the models. The dichotomy of models as right and wrong is a gross misrepresentation of the richness and dynamism of the modeling process.
Statistics shouldn’t be an art, but it is definitely a craft. Most of us spend our time reducing complexity to simple techniques for people like the judge above who value simplicity, even when it might not be true. Yesterday I spent an hour with a clinical researcher creating a graphic that would show the difference between a 2 x 2 contingency table and a ranking technique for blocked data. Our simple description was described as ‘too wordy.’ Is either model true? No. Is either useful? that is our current problem. Should we look to improve it, yes.
And having all the data might provide truth but is that truth of value? I interviewed for a project at a big pharma company (unfortunately it was killed the next week). Since it was in biologics manufacturing, a few big data companies had their hands in all the process data. But the engineer said to me ‘You are so different from all of them. You think about the content. These analysts bring me graphs and say things like feed amount is correlated with microbe count.’ That is true but useless. And adding terms to this model which Mark van de Laan disparages offers a chance to segment the data and dive deeper for real world adjustments to explanations. And as Deming pointed out and the ASA sued (but lost), sometimes sampling is better than full enumeration.
Data science uses algorithms rather than models, to make predictions. Also it covers many applications never discussed in this forum (it goes far beyond government statistics and statistics related to epidemiology / health care). It encompasses actuarial science, quality control, and operations research, to name a few.
Ontario health and safety act small business
In Response To ‘Statistics As A Science, Not An Art: The Way To Survive In Data Science’ By Mark Van Der Laan  Amstat News