## Statistics Losing Ground to Computer Science

*Norman Matloff*

The American Statistical Association leadership and many in statistics academia have been undergoing a period of angst the last few years. They worry that the field of statistics is headed for a future of reduced national influence and importance, with the feeling that the field is to a large extent being eclipsed by other disciplines, notably computer science (CS).

Norman Matloffis a professor of computer science at the University of California, Davis, and was formerly in the statistics department at that institution. His research interests include parallel processing, statistical computing, and regression and classification methodology.

This is exemplified by the rise of a number of new terms, largely in CS, such as data science, Big Data, and analytics, with the popularity of the term machine learning growing rapidly. To many of us, this is “old wine in new bottles,” just statistics with new terminology.

I write this as both a computer scientist and statistician. I began my career in statistics, and though my departmental affiliation later changed to CS, much of my research in CS has been statistical in nature. And I submit that the problem goes beyond the ASA’s understandable concerns about the well-being of the statistics profession. The broader issue is not that CS people are doing statistics, but rather that they tend to do it poorly.

This is not a problem of quality of the CS researchers themselves; indeed, many are highly talented. Instead, there are a number of systemic reasons for this:

- The CS research model is based on very rapid publication, with the venue of choice being conferences rather than slow journals. The work is refereed, but just on a one-time basis, not with the back-and-forth interaction of journals. There is also rapid change in fashionable research topics. Thus there is little time for deep, long-term contemplation about the problems at hand. As a result, the work is less thoroughly conducted and reviewed.
- Due in part to the pressure for rapid publication and the lack of long-term commitment to research topics, most CS researchers in statistical issues have little knowledge of the statistics literature, and they seldom cite it. There is much “reinventing the wheel,” and many missed opportunities.
- For instance, consider a well-known CS paper by a prominent author on the use of mixed labeled and unlabeled training data in classification. Sadly, the paper cites nothing in the extensive statistics literature on this topic, going back to 1977.
- The CS “engineering-style” research model often causes a cavalier attitude toward underlying models and assumptions. Consider, for example, a talk I attended by a machine learning specialist who had just earned her PhD at one of the very top CS departments in the world. She had taken a Bayesian approach, and I asked why she had chosen that specific prior distribution. She couldn’t answer—she had just blindly used what her thesis adviser had given her. Moreover, she was baffled as to why anyone would want to know why that prior was chosen.
- CS people tend to have grand—and sometimes starry-eyed—ambitions. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a large crowd. But this mentality leads to an oversimplified view, with everything being viewed as a paradigm shift.

Neural networks epitomize this problem. Enticing phrasing such as “neural networks work like the human brain” blinds many CS researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification. Among CS folks, there is often a failure to understand that the celebrated accomplishments of “machine learning” have come mainly from applying huge resources to a problem, rather than because fundamentally new technology has been invented.

None of this is to say that people in CS should stay out of statistics research. But the sad truth is that the process of CS overshadowing statistics researchers in their own field is causing precious resources—research funding, faculty slots, the best potential grad students, attention from government policymakers—to go quite disproportionately to CS, even though the statistics community is arguably better equipped to make use of them. Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

What can be done? I offer the following as a start:

- There should be more joint faculty appointments between CS and statistics departments. Teaching a course in the “other” department forces one to think more carefully about the issues in that field and fosters interaction between fields.
- CS undergraduates should be encouraged to pursue a double major with statistics, and to go on for graduate work in statistics. There are excellent precedents for the latter, such as Hadley Wickham and Michael Kane, both of them winners of the John Chambers Statistical Software Award.
- Statistics researchers should be much more aggressive in working on complex, large-scale, “messy” problems, such as the face recognition example cited earlier.
- Though many statisticians have first-rate computing skills, stat should reach out to CS for collaboration in advanced areas, such as the R project is doing with CS compiler experts.
- Stat undergraduate and graduate curricula should be modernized (while retaining mathematical rigor). Even math stat courses should involve computation. Emphasis on significance testing, well known to be under-informative at best and misleading at worst, should be reduced. Modern tools, such as cross-validation and nonparametric density/regression estimation, should be brought in early in the curriculum.

The academic world is slow to change, but the stakes here are high. There is an urgent need for the fields of CS and statistics to re-examine their roles in research, both individually and in relation to each other.

**Editor’s Note:**This article also was published on the StatsLife website.

Daniel Normollesaid:I would also add as a side note that neural nets use a model of neuronal function from the 1950s that neuroscientists now consider simplistic.

I am told that a fairly large internal RFA at my institution (the University of Pittsburgh) for the analysis of Big Data is being written by the office of the leadership of the Schools of Health Scientists to which Statisticians and Epidemiologists will not be allowed to apply. Apparently, we move too slow and are not sufficiently enthusiastic.

Vince Malfitanosaid:Vince MalfitanoStatistics Losing Ground to Computer Science | Amstat News

Trevor Butchersaid:This comment by Daniel Normolle is, I believe, very important: ‘Apparently, we move too slow and are not sufficiently enthusiastic.’

As a former engineer working in a linguistics environment I cannot believe the lack of enthusiasm, the unwillingness to experiment, the conservatism. As a result I believe that we should be examining what kind of people we are accepting into traditional fields, and what is happening back in the classrooms of schools that whole fields of people end up being considered conservative (although, I might add, not all members of those fields are conservative).

siphoning pozo negrosaid:siphoning pozo negroStatistics Losing Ground to Computer Science | Amstat News

Jumiaci.Zendesk.comsaid:Jumiaci.Zendesk.comStatistics Losing Ground to Computer Science | Amstat News

william e winklersaid:“Statistics researchers should be much more aggressive in working on complex, large-scale, “messy” problems, such as the face recognition example cited earlier.”

“Though many statisticians have first-rate computing skills, stat should reach out to CS for collaboration in advanced areas, such as the R project is doing with CS compiler experts.”

I give some complementary insight to Norman’s excellent comments.

One venue (there are many others now) is the Conference on the Interface on Computing Science and Statistics. Three keynote speakers at the Interface, Brad Efron, Jerry Friedman, and Leo Breiman, had all warned 10+ years ago that CS would overtake Statistics for a large number of problems, particularly the newer, far larger, problems.

If you read the papers in conferences such as Uncertainty in Artificial Intelligence, Neural Information Processing Systems, International Conference on Machine Learning, and a number of machine learning journals, you will notice that, by removing 20% of less of the papers, the papers are basically statistics. The difference is that the papers apply computational methods that are considerably beyond methods that many statisticians are capable of doing. When I first read the papers 12+ years ago, I immediately thought “Why should CS folks (particularly ML ones) even bother with statisticians?” and “Will mainstream statisticians ever be able to do more than a small fraction of what the ML folks are doing?”

On a more specific level, Michael Jordan showed how to do the theory associated with a hierarchical mixture of ‘experts’ (i.e., mixture of statistical mixture models). Jordan also showed how to do the computation using variational inequalities. Thomas Minka and John Lafferty extended both the theory and the speed of applications using an Expectation-Propagation Algorithm (related to EM-type techniques). We expect breakthroughs on theory. We should also expect breakthroughs on the computational algorithms that are properly tied in with the theory as with the above work.

One place where most CS folks are comparatively weak is where statisticians are even weaker: Working with exceptionally large files using conventional or unconventional statistical models. In some situations, we might want to clean-up and do analysis on a set of files using the Fellegi-Holt Model of Statistical Data Editing (JASA 1976) or the Fellegi-Sunter Model of Record Linkage (JASA 1969). Unfortunately, the type of integer programming methods of the FH edit model and the search/retrieval/comparison and approximate string comparison methods of the FS model have almost entirely been taken over by the CS folks. The issue is that many of the computational algorithms needed for the FH model and the FS model are nearly the same as the algorithms needed for conventional statistical models.