## Statistics Ready for a Revolution

## Next Generation of Statisticians Must Build Tools for Massive Data Sets

*Mark van der Laan, Jiann-Ping Hsu/Karl E. Peace Professor in Biostatistics and Statistics at UC Berkeley, and Sherri Rose, PhD candidate at UC Berkeley*

The statistics profession has reached a tipping point. The need for valid statistical tools is greater than ever; data sets are massive, often measuring hundreds of thousands of measurements for a single subject. The field is ready for a revolution, one driven by clear, objective benchmarks by which tools can be evaluated.

The new generation of statisticians must be ready to take on this challenge. They have to be dynamic and thoroughly trained in statistical concepts. They have to work effectively on an interdisciplinary team and understand the immense importance of objective benchmarks to evaluate statistical tools. They have to produce energetic leaders who stick to a roadmap, but who also break with current practice when necessary.

Why do we need a revolution? Sadly, 99.99% of all data analyses are based on the application of so-called parametric (or other restrictive) statistical models that assume the data-generating distributions have specific forms. Many agree that these models are wrong. That is, statisticians know linear or logistic regression models and Cox proportional hazards models are specified incorrectly. But, they still use them to draw conclusions and then hope these conclusions are not too wrong.

The original purpose of a statistics model was to develop a set of realistic assumptions about the probability distribution generating the data set (i.e., incorporating background knowledge). However, restrictive parametric models are almost always used because standard software is available. These models also allow the user to obtain *p*-values and confidence intervals for the target parameter of the probability distribution, which are desired to make sense out of data.

Unfortunately, these measures of uncertainty about our estimates are even more susceptible to bias than the effect estimates. We know that for large enough sample sizes, every study—including ones in which the null hypothesis of no effect is true—will declare a statistically significant effect.

Some practitioners will tell you they have extensive training, are experts in applying these tools, and should be allowed to choose the models to use in response to the data. Be alarmed. It is no accident that the chess computer beats the world champion in chess. Humans are not as good at learning from data and easily susceptible to beliefs about those data.

For example, an investigator may be convinced his or her data have a particular functional form, but if you bring in another expert, his or her belief about the functional form may differ. Or, many models may be run, dropping variables that are nonsignificant in each model. While this is common, it leaves us with faulty inference.

With high-dimensional data, not only is the correct specification of the parametric model an impossible challenge, but the complexity of the parametric model also may increase so that there are more unknown parameters than observations. The true function also might be described by a complex function not easily approximated by main terms.

For these reasons, allowing humans to include only their true, realistic knowledge (e.g., treatment is randomized, such as in a randomized controlled trial, and our data set represents an independent and identically distributed observations of a random variable) is essential.

What about machine learning, which is concerned with the development of black-box algorithms that map data (and few assumptions) into wished objects? Indeed, this is in contrast to using misspecified parametric models, but the goal is often the whole prediction function, instead of particular effects of interest.

Even in machine learning, however, there is often unsupported devotion to beliefs. In this case, to the belief that certain algorithms are superior. No single algorithm (e.g., random forests, support vector machines, etc.) will always outperform all others in all data types, or even within specific data types (e.g., SNP data from genomewide association studies). One can’t know a priori which algorithm to choose. It’s like picking the student who gets the top grade in a course on the first day of class.

The concept of a model is also important. We need to be able to incorporate true knowledge in an effective way. In addition, we need such data-adaptive tools for all parameters of the data-generating distribution, including parameters targeting causal effects of interventions on the system underlying the data-generating experiment. The latter typically represents our real interest: We are not only trying to sensibly observe, but also to learn how the world operates.

The tools we develop must be grounded in theory, such as an optimality theory, that shows certain methods are more optimal than others. For example, one can compare methods based on mean squared error with respect to the truth. It is not enough to have tools that use the data to fit the truth well. We also require an assessment of uncertainty, the very backbone of statistical learning. That is, we cannot give up on reliable assessment of uncertainty in our estimates.

The new generation of statisticians cannot be afraid to go against standard practice. Remaining open to, interested in, and a developer of newer, sounder methodology is perhaps the one key act statistics students can perform. We must all continue learning, questioning, and adapting as new statistical challenges are presented.

The science of learning from data (i.e., statistics) is arguably the most beautiful and inspiring field—one in which we try to understand the very essence of human beings. However, we should stop fooling ourselves and actually design and develop powerful machines and statistical tools that can carry out specific learning tasks.

There is no better time to make a truly meaningful difference.

Galit Shmuelisaid:I completely agree that it is prime time for statisticians to step out of the “statistics bubble” that most of us have studied and operated in and evolve into the next level.

To develop effective methods and approaches for both scientific development and practical use, I have found that an excellent trick is collaborating with non-statisticians. I was quite surprised to see how my social scientist colleagues use statistical methods for building and testing theory – they use it mainly for testing causal hypotheses using mainly regression-type methods. Another example is how epidemiologists and environmental scientists use statistical inference to infer predictive power.

Our role as “new generation statisticians” is therefore to step outside of our own community, stat departments, and stat research and look at how statistics is used by others in academia and in practice. We should then communicate clearly via non-statistics journals and conferences. This, of course, requires substantial learning of new writing standards and of a different culture, but it is the only way to actually make a difference.

I can attest from my own experience that collaborations with non-statistician colleagues in various disciplines has lead me to substantial insights about our field, about its current directions, and about new directions that are critical for scientific development. You can only see the “big picture” by taking a step back. For instance, one insight that I was lucky to discover is the deep misunderstanding about the difference between modeling for purposes of prediction, description, and causal explanation (see http://www.rhsmith.umd.edu/faculty/gshmueli/web/html/explain-predict.html). Another is that prediction is considered non-academic in many disciplines. It has taken a serious effort to pass on this message to the information systems community (publishing in their top journals, receiving “best paper” awards in conferences), but the impact is likely to be high.

In short, “next generation statisticians”, in my opinion, should strive to be more like Renaissance Homo Universalis.

Jeffrey Monroesaid:The article sums up nicely the challenges industry statisticians face. Restrictive models no longer sufficiently address the challenges of global businesses. Moore’s law described the trend in computing hardware, namely that the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years. The trend in available data undoubtedly follows a similar, if not stronger growth path (http://www.information-management.com/infodirect/20050930/1038403-1.html). Even if technology begins to level out around 2020 (http://news.cnet.com/New-life-for-Moores-Law/2009-1006_3-5672485.html), the amount of data would continue to build on history.

As databases continue to grow, it is important for the statistician to be increasingly more involved in database development and design. The statistician should serve as a bridge between the I.T. professional and Business Manager. It is the statistician that is uniquely qualified to make sense of the ever-increasing data and organizing variables in the most efficient way is critical. A song that I think sums up the statisticians role follows (borrowed heavily from Ray Parker’s Ghostbuster lyrics):

A STATISTICIAN

If there’s some strange variable

in your database

Who ya gonna call?

A STATISTICIAN

If there’s some weird design

and it don’t look good

Who ya gonna call?

A STATISTICIAN

I ain’t afraid of no data

I ain’t afraid of no data

If you’re feeling bias

running through your head

Who can ya call?

A STATISTICIAN

This makes no sense

is what your boss said

Who ya gonna call?

A STATISTICIAN

I ain’t afraid of no data

I ain’t afraid of no data

Who ya gonna call?

A STATISTICIAN

If ya all alone

pick up the phone

and call

A STATISTICIAN

I ain’t afraid of no data

I here it likes randomness

I ain’t afraid of no data

Yeah Yeah Yeah Yeah

Who ya gonna call?

A STATISTICIAN

If you’ve had a dose of a

freaky model baby

Ya better call

A STATISTICIAN

Lemme tell ya something

Analyzin’ makes me feel good!

I ain’t afraid of no data

I ain’t afraid of no data

Don’t get caught alone no no

A STATISTICIAN

When data comes through your door

Unless you just want some more

I think you better call

A STATISTICIAN

Who ya gonna call?

A STATISTICIAN

Who ya gonna call?

A STATISTICIAN

I think you better call

A STATISTICIAN

Who ya gonna call?

A STATISTICIAN

I can’t hear you

Who ya gonna call?

A STATISTICIAN

Louder

A STATISTICIAN

Who ya gonna call?

A STATISTICIAN

Who can ya call?

A STATISTICIAN

Who ya gonna call?

A STATISTICIAN