Where Does a Statistician Fit in the Big Data Era?
This column is written for statisticians with master’s degrees and highlights areas of employment that will benefit statisticians at the master’s level. Comments and suggestions should be sent to Megan Murphy, Amstat News managing editor, at firstname.lastname@example.org.
Andy Hoegh is a graduate student at Virginia Tech, researching predictive fusion methods for big, messy data sets. He is also a statistical collaborator in Virginia Tech’s Laboratory for Interdisciplinary Statistical Analysis (LISA) and the network outreach coordinator for StatCom.
In an era with a plethora of cheap, vast data in which buzzwords such as Big Data, data analytics, and data mining have been integrated into the common vernacular, it is worth asking where the statistician fits in. Several years ago, I described statistics as the “science of data.” Now, with a more comprehensive understanding of the field and data science referring to something seemingly different from statistics taught in the traditional classroom, I’d like to revisit what statistics is. Specifically, I’ll address the ideal skillset necessary for a modern statistician to be an integral part of this Big Data era.
ASA President Marie Davidian’s article, “Doctoral Training in Statistics and Biostatistics: Where Are We Headed?” reflects on what statistics is through the lens of graduate school training. It’s worth reading. Considering this article and the master’s core curriculum in the statistics department at Virginia Tech provided me with the motivation to understand and describe the ideal tools for a 21st-century statistician.
Ralph O’Brien’s talk on the Completely Sufficient Statistician provides a useful framework for describing the skills and talents of a statistician. O’Brien describes the ideal statistical scientist as one that develops and maintains a broad range of technical skills and personal quality in the following four domains:
1. Numeracy in mathematics and numerical computing
2. Articulacy and people skills
3.Literacy in technical writing and programming
Using the structure of O’Brien’s argument, I will describe essential skills residing in the four domains of the ideal modern statistician.
Traditional statistical training on methodology and computing resides in the numeracy component. O’Brien states that the ideal statistician must be sufficiently mature in using mathematics and numerical computing to define and solve real problems. Science and designed experiments/statistics often go hand-in-hand, and a thorough understanding of statistics as it relates to the scientific method is still essential.
However, a majority of the data necessary to solve novel problems doesn’t come from the realm of carefully designed experiments these days. Information often is multimodal, messy, and of a form that is naturally characterized numerically—genetic information, images, or text. These challenges require development of methodology and computing tools to make inferences and predictions. Bayesian methods and the associated Markov Chain Monte Carlo (MCMC) methods are fruitful for specifying and estimating complicated hierarchical models. Additionally, exposure to non-probabilistic machine learning techniques such as classification trees provides tools that are useful in solving real problems.
Articulacy is not a requirement unique for statisticians, as it is a valuable skill in all occupations and social situations. However, our profession has a stereotype for a dearth of aptitude in this discipline, so extra attention is warranted here. Requisite in the articulacy component is the ability to efficiently work in a team environment and communicate with statistical peers and non-technical audiences alike. Defining and solving real problems rarely allows a statistician to retreat and work in isolation. The problemsolving process begins with input from team members, which often include scientists from other disciplines. Understanding their discipline and the scientific question is imperative before constructing a statistical model.
Next, the statistician must be able to explain clearly the methods used to address the problem to both a technical and non-technical audience—often simultaneously.
The final step is communicating the findings with an engaging presentation. The skills of the articulate statistician may not come as naturally as those in the numeracy component or be as heavily stressed in a classical education, but they can be learned. Most graduate programs have some sort of statistical consulting component, such as Virginia Tech’s Laboratory for Interdisciplinary Statistical Analysis (LISA), that provides training and access to relevant problems. Pro-bono statistical groups such as StatCom or Statistics without Borders also are great ways to gain experience and hone articulacy skills.
O’Brien states that literacy includes technical writing and programming. In a basic sense, literacy is defined as the ability to read and write. So, within the context of the statistician, literacy refers to the ability to read and write technical documents and computer code. Hence, when faced with an unknown problem, an ideal statistician would be able to read academic literature to understand existing solutions.
After understanding existing methods, it may be necessary to use numeracy skills to develop improved techniques. Once a technique has been formulated and the model can be written down, computations need to be carried out. Literacy in programming implies the ability to implement calculations using existing software packages and develop new code when necessary.
Technical writing, sharing newly created knowledge via published papers, is the final task. Much like articulacy, literacy is not the main focus of graduate education, particularly for master’s students. Nonetheless, these skills can be learned, and advanced prowess proves extremely beneficial.
As an aside about programming, having worked with computer scientists in a few interdisciplinary projects, the difference in programming skills is striking. Statisticians often start the process with a neatly ordered CSV file in which each row represents a single observation and the columns correspond to variables. While not a requirement in many statistical settings, the ability to efficiently scrape, sort, and manage data is a liberating skill that the ideal modern statistician possesses.
The final component is graphicacy, which focuses on effective displays of data. Having taught courses on statistical graphics, I find it important to understand the distinction between traditional statistical graphics and data visualization. Statistical graphics are often exploratory and used to inform statistical models. Statistical graphics also can be the endgame, displaying important findings, but typically statistical machinery underlies these displays.
Data visualization, another of those en vogue buzzwords, includes infographics and displays of data that are often more artistic than typical statistical graphics. Data visualization is focused on telling a story without the principles of statistical inference. The modern statistician has expertise in statistical graphics and competence in statistical visualization.
The most valuable contribution of a statistician is the ability to contribute to solving real problems. With vast data sources available and numerous complex, interesting, and relevant problems to focus statistical machinery on, it is an exciting time to be a statistician.
This article has explored the components of an ideal modern statistician through the lens of O’Brien’s “completely sufficient statistician.” Using the statistical definition of sufficient, this clever play on words suggests a modern statistician contains all the necessary information (knowledge) to successfully address and answer pressing questions of our era.