Home » A Statistician's View, Departments, Featured

Time to Embrace a New Identity?

1 October 2014 1,166 views 5 Comments
Thomas Speidel

    There is no question in my mind that statisticians are crossing a sea of changes. As a profession, we have made high-quality contributions to many fields over the past decades, with our engagement being perfectly epitomized in the recent book Statistics in Action: A Canadian Outlook. However, one cannot help but notice the recent trends (and hype) in the closely aligned—and somewhat vaguely defined—fields of analytics, Big Data, data science, and machine learning and wonder if our current model will continue to do well.

    As a statistician, I am concerned. As a professional who recently migrated from the cancer research “sandbox” to the energy industry “sandbox,” I am facing numerous challenges associated with poor statistical literacy and the burden of the image problem we suffer from, which was so well captured by Brian Everitt in his book, Chance Rules: An Informal Guide to Probability, Risk, and Statistics: “[Statistics] conjures either a near-sighted character amassing volumes of figures about cricketers’ bowling and batting averages […] or a government civil servant compiling massive tables of figures” Hype aside, we are seeing a distortion of our field, the reinvention of many concepts, and the sad disregard of our contributions—often by those who do not analyze data professionally.

    Because I happen to believe that many of our members and colleagues are insulated from what goes on outside of their fields, we may not fully understand the repercussions these events can have on our future. Hence, I feel compelled to list a few examples:

    First, consider the consulting firm McKinsey & Company, which wrote a report on Big Data in 2011. On Page 28, they list the techniques to analyze Big Data as mostly coming from the field of machine learning. However, I count at least 11 techniques that were developed in the field of statistics. On Page 30, regression, predictive modeling, and statistics are separate entities. And that’s not all. On Page 47, the authors list the new R&D opportunity in health care as “analyzing clinical trials data.” Does this imply we have not been analyzing clinical trials in the past? Now, to put this into perspective, consider how influential and trusted McKinsey is. Add the low statistical literacy of most organizations and we have a problem: Treating this field as novel ignores nearly 300 years of statistical history, and most people looking into Big Data won’t realize that.

    Second, the machine learning attitude toward statistics is worrisome. All too often, we observe bright computer scientists who can pick up computational aspects of our work, yet rarely possess the solid statistical foundations needed to properly tackle the problem—from poor research methods to ignoring uncertainty. In a guest blog post on FierceBigData, ASA Executive Director Ronald Wasserstein wrote, “Are the data collected in a way that introduces bias? Are there missing or incomplete data? Are there different kinds of data? Statisticians not only know how to ask the right questions, but may have practical solutions already available.” There are plenty of examples of this attitude, especially on the popular forum Cross Validated. In turn, this leads to provocative articles such as “The Death of the Statistician” and “Is Data Science the End of Statistics? A Discussion.”

    Third, many things are being reinvented. Bradley Efron once said, “Statistics is the science of learning from experience. Those who ignore statistics are condemned to reinvent it.” According to Wikipedia, logistic regression is a classifier. More recently, Hadley Wickham noticed how nearly 50 years of statistical smoothing literature has had little effect on information visualization, which had to reinvent the wheel.

    Fourth, as Randy Bartlett explains in A Practitioner’s Guide to Business Analytics, making data analysis software more user-friendly has opened the flood gates holding back statistical malfeasances. The desire to simplify tools, methods, and solutions for use by business users has led to what some people refer to as a culture of “buttonology.” Frank Harrell had this to say: “What I most fear is that statistics wasn’t respected enough before the machine learning field went viral, and things have just gone from bad to worse. The ready availability of software has hurt.”

    Fifth, false novelty is feeding reinvention. Consider Terry Speed’s talk on Big Data, for instance, in which he gave a memorable example. A University of California alumni magazine article on Big Data showed an empty row for statistics. Economics, chemistry, marketing, computer science? All there. Statistics? Nope. And to add insult to injury, they have not forgotten it; it’s simply empty, as if statistics contributed nothing. I echo what Jeff Leek wrote on his Simply Statistics blog: This “shows a fundamental disrespect for applied statisticians who have developed serious expertise in a range of scientific disciplines.”

    A recent report on the future of the statistical sciences says, “Statisticians, with some prominent exceptions, also have been unwilling or unable to communicate to the rest of the world the value (and excitement) of their work.” This sentence hints at the consequences we may face if we do not act quickly: We may never have existed in the eyes of many and our contributions may be reinvented and re-packaged in a different field.

    Also, the report confirms challenges we have never faced in the past: “Undoubtedly the greatest challenge and opportunity that confronts today’s statisticians is the rise of Big Data.” While some think these trends will “eventually fade,” as they did in the data mining movement of the ’90s, I believe there is too much economic interest for it to simply fade away. If the numbers of analytics software and languages are any indication of things to come, this movement is hardly going to fade.

    I am convinced that despite the misguided direction and pitfalls, the focus and attention on Big Data (or data science) is mostly a good thing. Sure, Big Data is not going to change those organizations and research institutions that have been doing this work for decades. However, it will inevitably bring a more evidence-based approach to the way companies do business and the government makes policies. This progress, however, may come at a price.

    Statistical certification is largely unrecognized outside academic and research institutions. I suspect this was meant to protect us from the very improvised statisticians who contributed to the bad image. It may have worked, if things stayed the same. I think we are falling victim to the complacency of our own culture. Perhaps, ASA Past President Robert Rodriguez saw this coming when he suggested we use the big tent approach.

    Doing nothing and hoping problems will fade away is not a good strategy.

    First, this is going to hurt us because we cannot properly assert our knowledge and contributions against parallel fields with a much more rapid mechanism of spreading new ideas (e.g., conference proceedings are typical in CS/ML vs. peer-review in statistics).

    Second, our lack of notoriety in other fields may deprive our departments and professors of the needed funding and recognition.

    Third, we have been unable or unwilling to prepare the next generation of applied statisticians for a work place that might change substantially. At present, statistics departments are reluctant to incorporate feedback from applied statisticians in the field. Applied statisticians must finish their basic training after graduate school. 
Fourth, a multitude of certifications are now being established to monetize on the recent data movement. Should we not be at the forefront of this? Shouldn’t our certifications be the highest regarded owing to our nearly 300-year history? INFORMS (an operation research organization) is aggressively pushing their certification, CAP, which is establishing itself as the certification for analytics. A quick scan of its content reveals it covers a blend of data management and data analysis.

    There are multiple ways we can become more engaged. At a minimum, acknowledging and talking about these issues is a first step. Here are a few ideas.

    1. Consider being active on social media. There are numerous venues to show the rest of the world the value and excitement of our work: Stack Exchange, LinkedIn, Twitter, Facebook, Quora, and the many fora specific to statistical software packages are some of the most obvious choices. I am part of a team founding About Data Analysis (ADA), a new LinkedIn discussion group specific to data analysis issues.
    2. Consider stepping outside of your comfort zone. For example, many of the methods we commonly use are now being used in other fields (e.g., survival analysis in marketing). Why not speak at conferences outside your sandbox to those who are starting to use the very methods we have mastered?
    3. Consider making some of your work openly available. Write a blog or an open-access paper. If a paper was not accepted at a journal, why not make it freely available?
    4. If you teach, consider approaching your department about making video tutorials. Look at the work of Jeff Leek and Roger Peng for examples.
    5. If you have videos of your conference presentation, make them available.
    6. As a profession, we should explore diversifying our certifications programs or joining forces with similar and reputable professional organizations.

    As a profession, we need to have the courage to look outside the wall that has so far protected us from unscrupulous intruders. As Randy Bartlett wrote in Amstat News, “[T]o differentiate our value proposition, we must be involved.” We need to involve ourselves with other parallel fields, learn about their problems, and share existing solutions. This does not mean lowering our standards for rigorous results. We cannot defend our profession and retain our current customers by building walls meant to keep the barbarians out. We need to empower our applied statisticians with certification and more applied training. Furthermore, we need to build bridges to support their entrance into other fields.

    Editor’s Note: A version of this article was published in the August 2014 issue of Liaison, the newsletter of the Statistical Society of Canada.

    1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)
    Loading...

    5 Comments »

    • arnold goodman said:

      on right track !!! see article in wires computational statistics …

    • Randy Bartlett said:

      This is an excellent article, properly conveying the view from the field. This is about branding. It is harmful to applied statisticians in the field to have other statisticians behave as if data mining, data science, machine learning, and future rebrandings are somehow different fields from statistics. Our purview in the field is to analyze data and we can not keep data analysis out of these rebrandings.

      Mathematics consists of numerical tools for deducing from complete numbers. Built on top of that is statistics, which consists of numerical tools to infer from partial information. In the field, splitting statistics into machine learning statistics and statistics is arbitrary and should be seen for what it is, a transparent grab for who gets to analyze the data, cheered on by those who were previously excluded to some degree. This is becoming a greater issue now because software is so user friendly that everyone can run a complex algorithm, which is far beyond their comprehension.

      We need to train and certify people, who are competent to analyze data and what we should avoid is certifying whether people are competent to publish papers.

    • Kel Z said:

      Perhaps more data scientists in the next generation.

      Now in UK:

      http://www.pri.org/stories/2014-09-25/reading-math-and-javascript-coding-now-mandatory-english-schools

      All children between the ages of 5 and 16 in English public schools are now learning computer science — not just how to use software, but how to create it, too.

      Teenagers will have to master at least two programming languages: Java and Python.

    • Praveen Chandra said:

      Praveen Chandra

      Time to Embrace a New Identity? | Amstat News

    • Statistics vs. Computer Science: A Draw? | Only Best News said:

      […] science lacking (statistical) theory, and the other one is from Thomas Speidel, who asks “Time to Embrace a New Identity?“, largely lamenting that statistics is not embracing new technologies and […]