Time to Embrace a New Identity?
There is no question in my mind that statisticians are crossing a sea of changes. As a profession, we have made high-quality contributions to many fields over the past decades, with our engagement being perfectly epitomized in the recent book Statistics in Action: A Canadian Outlook. However, one cannot help but notice the recent trends (and hype) in the closely aligned—and somewhat vaguely defined—fields of analytics, Big Data, data science, and machine learning and wonder if our current model will continue to do well.
As a statistician, I am concerned. As a professional who recently migrated from the cancer research “sandbox” to the energy industry “sandbox,” I am facing numerous challenges associated with poor statistical literacy and the burden of the image problem we suffer from, which was so well captured by Brian Everitt in his book, Chance Rules: An Informal Guide to Probability, Risk, and Statistics: “[Statistics] conjures either a near-sighted character amassing volumes of figures about cricketers’ bowling and batting averages […] or a government civil servant compiling massive tables of figures” Hype aside, we are seeing a distortion of our field, the reinvention of many concepts, and the sad disregard of our contributions—often by those who do not analyze data professionally.
Because I happen to believe that many of our members and colleagues are insulated from what goes on outside of their fields, we may not fully understand the repercussions these events can have on our future. Hence, I feel compelled to list a few examples:
First, consider the consulting firm McKinsey & Company, which wrote a report on Big Data in 2011. On Page 28, they list the techniques to analyze Big Data as mostly coming from the field of machine learning. However, I count at least 11 techniques that were developed in the field of statistics. On Page 30, regression, predictive modeling, and statistics are separate entities. And that’s not all. On Page 47, the authors list the new R&D opportunity in health care as “analyzing clinical trials data.” Does this imply we have not been analyzing clinical trials in the past? Now, to put this into perspective, consider how influential and trusted McKinsey is. Add the low statistical literacy of most organizations and we have a problem: Treating this field as novel ignores nearly 300 years of statistical history, and most people looking into Big Data won’t realize that.
Second, the machine learning attitude toward statistics is worrisome. All too often, we observe bright computer scientists who can pick up computational aspects of our work, yet rarely possess the solid statistical foundations needed to properly tackle the problem—from poor research methods to ignoring uncertainty. In a guest blog post on FierceBigData, ASA Executive Director Ronald Wasserstein wrote, “Are the data collected in a way that introduces bias? Are there missing or incomplete data? Are there different kinds of data? Statisticians not only know how to ask the right questions, but may have practical solutions already available.” There are plenty of examples of this attitude, especially on the popular forum Cross Validated. In turn, this leads to provocative articles such as “The Death of the Statistician” and “Is Data Science the End of Statistics? A Discussion.”
Third, many things are being reinvented. Bradley Efron once said, “Statistics is the science of learning from experience. Those who ignore statistics are condemned to reinvent it.” According to Wikipedia, logistic regression is a classifier. More recently, Hadley Wickham noticed how nearly 50 years of statistical smoothing literature has had little effect on information visualization, which had to reinvent the wheel.
Fourth, as Randy Bartlett explains in A Practitioner’s Guide to Business Analytics, making data analysis software more user-friendly has opened the flood gates holding back statistical malfeasances. The desire to simplify tools, methods, and solutions for use by business users has led to what some people refer to as a culture of “buttonology.” Frank Harrell had this to say: “What I most fear is that statistics wasn’t respected enough before the machine learning field went viral, and things have just gone from bad to worse. The ready availability of software has hurt.”
Fifth, false novelty is feeding reinvention. Consider Terry Speed’s talk on Big Data, for instance, in which he gave a memorable example. A University of California alumni magazine article on Big Data showed an empty row for statistics. Economics, chemistry, marketing, computer science? All there. Statistics? Nope. And to add insult to injury, they have not forgotten it; it’s simply empty, as if statistics contributed nothing. I echo what Jeff Leek wrote on his Simply Statistics blog: This “shows a fundamental disrespect for applied statisticians who have developed serious expertise in a range of scientific disciplines.”
A recent report on the future of the statistical sciences says, “Statisticians, with some prominent exceptions, also have been unwilling or unable to communicate to the rest of the world the value (and excitement) of their work.” This sentence hints at the consequences we may face if we do not act quickly: We may never have existed in the eyes of many and our contributions may be reinvented and re-packaged in a different field.
Also, the report confirms challenges we have never faced in the past: “Undoubtedly the greatest challenge and opportunity that confronts today’s statisticians is the rise of Big Data.” While some think these trends will “eventually fade,” as they did in the data mining movement of the ’90s, I believe there is too much economic interest for it to simply fade away. If the numbers of analytics software and languages are any indication of things to come, this movement is hardly going to fade.
I am convinced that despite the misguided direction and pitfalls, the focus and attention on Big Data (or data science) is mostly a good thing. Sure, Big Data is not going to change those organizations and research institutions that have been doing this work for decades. However, it will inevitably bring a more evidence-based approach to the way companies do business and the government makes policies. This progress, however, may come at a price.
Statistical certification is largely unrecognized outside academic and research institutions. I suspect this was meant to protect us from the very improvised statisticians who contributed to the bad image. It may have worked, if things stayed the same. I think we are falling victim to the complacency of our own culture. Perhaps, ASA Past President Robert Rodriguez saw this coming when he suggested we use the big tent approach.
Doing nothing and hoping problems will fade away is not a good strategy.
First, this is going to hurt us because we cannot properly assert our knowledge and contributions against parallel fields with a much more rapid mechanism of spreading new ideas (e.g., conference proceedings are typical in CS/ML vs. peer-review in statistics).
Second, our lack of notoriety in other fields may deprive our departments and professors of the needed funding and recognition.
Third, we have been unable or unwilling to prepare the next generation of applied statisticians for a work place that might change substantially. At present, statistics departments are reluctant to incorporate feedback from applied statisticians in the field. Applied statisticians must finish their basic training after graduate school. Fourth, a multitude of certifications are now being established to monetize on the recent data movement. Should we not be at the forefront of this? Shouldn’t our certifications be the highest regarded owing to our nearly 300-year history? INFORMS (an operation research organization) is aggressively pushing their certification, CAP, which is establishing itself as the certification for analytics. A quick scan of its content reveals it covers a blend of data management and data analysis.
There are multiple ways we can become more engaged. At a minimum, acknowledging and talking about these issues is a first step. Here are a few ideas.
- Consider being active on social media. There are numerous venues to show the rest of the world the value and excitement of our work: Stack Exchange, LinkedIn, Twitter, Facebook, Quora, and the many fora specific to statistical software packages are some of the most obvious choices. I am part of a team founding About Data Analysis (ADA), a new LinkedIn discussion group specific to data analysis issues.
- Consider stepping outside of your comfort zone. For example, many of the methods we commonly use are now being used in other fields (e.g., survival analysis in marketing). Why not speak at conferences outside your sandbox to those who are starting to use the very methods we have mastered?
- Consider making some of your work openly available. Write a blog or an open-access paper. If a paper was not accepted at a journal, why not make it freely available?
- If you teach, consider approaching your department about making video tutorials. Look at the work of Jeff Leek and Roger Peng for examples.
- If you have videos of your conference presentation, make them available.
- As a profession, we should explore diversifying our certifications programs or joining forces with similar and reputable professional organizations.
As a profession, we need to have the courage to look outside the wall that has so far protected us from unscrupulous intruders. As Randy Bartlett wrote in Amstat News, “[T]o differentiate our value proposition, we must be involved.” We need to involve ourselves with other parallel fields, learn about their problems, and share existing solutions. This does not mean lowering our standards for rigorous results. We cannot defend our profession and retain our current customers by building walls meant to keep the barbarians out. We need to empower our applied statisticians with certification and more applied training. Furthermore, we need to build bridges to support their entrance into other fields.
Editor’s Note: A version of this article was published in the August 2014 issue of Liaison, the newsletter of the Statistical Society of Canada.