Statistics Losing Ground to Computer Science
The American Statistical Association leadership and many in statistics academia have been undergoing a period of angst the last few years. They worry that the field of statistics is headed for a future of reduced national influence and importance, with the feeling that the field is to a large extent being eclipsed by other disciplines, notably computer science (CS).
Norman Matloff is a professor of computer science at the University of California, Davis, and was formerly in the statistics department at that institution. His research interests include parallel processing, statistical computing, and regression and classification methodology.
This is exemplified by the rise of a number of new terms, largely in CS, such as data science, Big Data, and analytics, with the popularity of the term machine learning growing rapidly. To many of us, this is “old wine in new bottles,” just statistics with new terminology.
I write this as both a computer scientist and statistician. I began my career in statistics, and though my departmental affiliation later changed to CS, much of my research in CS has been statistical in nature. And I submit that the problem goes beyond the ASA’s understandable concerns about the well-being of the statistics profession. The broader issue is not that CS people are doing statistics, but rather that they tend to do it poorly.
This is not a problem of quality of the CS researchers themselves; indeed, many are highly talented. Instead, there are a number of systemic reasons for this:
- The CS research model is based on very rapid publication, with the venue of choice being conferences rather than slow journals. The work is refereed, but just on a one-time basis, not with the back-and-forth interaction of journals. There is also rapid change in fashionable research topics. Thus there is little time for deep, long-term contemplation about the problems at hand. As a result, the work is less thoroughly conducted and reviewed.
- Due in part to the pressure for rapid publication and the lack of long-term commitment to research topics, most CS researchers in statistical issues have little knowledge of the statistics literature, and they seldom cite it. There is much “reinventing the wheel,” and many missed opportunities.
- For instance, consider a well-known CS paper by a prominent author on the use of mixed labeled and unlabeled training data in classification. Sadly, the paper cites nothing in the extensive statistics literature on this topic, going back to 1977.
- The CS “engineering-style” research model often causes a cavalier attitude toward underlying models and assumptions. Consider, for example, a talk I attended by a machine learning specialist who had just earned her PhD at one of the very top CS departments in the world. She had taken a Bayesian approach, and I asked why she had chosen that specific prior distribution. She couldn’t answer—she had just blindly used what her thesis adviser had given her. Moreover, she was baffled as to why anyone would want to know why that prior was chosen.
- CS people tend to have grand—and sometimes starry-eyed—ambitions. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a large crowd. But this mentality leads to an oversimplified view, with everything being viewed as a paradigm shift.
Neural networks epitomize this problem. Enticing phrasing such as “neural networks work like the human brain” blinds many CS researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification. Among CS folks, there is often a failure to understand that the celebrated accomplishments of “machine learning” have come mainly from applying huge resources to a problem, rather than because fundamentally new technology has been invented.
None of this is to say that people in CS should stay out of statistics research. But the sad truth is that the process of CS overshadowing statistics researchers in their own field is causing precious resources—research funding, faculty slots, the best potential grad students, attention from government policymakers—to go quite disproportionately to CS, even though the statistics community is arguably better equipped to make use of them. Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.
What can be done? I offer the following as a start:
- There should be more joint faculty appointments between CS and statistics departments. Teaching a course in the “other” department forces one to think more carefully about the issues in that field and fosters interaction between fields.
- CS undergraduates should be encouraged to pursue a double major with statistics, and to go on for graduate work in statistics. There are excellent precedents for the latter, such as Hadley Wickham and Michael Kane, both of them winners of the John Chambers Statistical Software Award.
- Statistics researchers should be much more aggressive in working on complex, large-scale, “messy” problems, such as the face recognition example cited earlier.
- Though many statisticians have first-rate computing skills, stat should reach out to CS for collaboration in advanced areas, such as the R project is doing with CS compiler experts.
- Stat undergraduate and graduate curricula should be modernized (while retaining mathematical rigor). Even math stat courses should involve computation. Emphasis on significance testing, well known to be under-informative at best and misleading at worst, should be reduced. Modern tools, such as cross-validation and nonparametric density/regression estimation, should be brought in early in the curriculum.
The academic world is slow to change, but the stakes here are high. There is an urgent need for the fields of CS and statistics to re-examine their roles in research, both individually and in relation to each other.
Editor’s Note:This article also was published on the StatsLife website.