Leanna House and Scotland Leman, Virginia Tech
In the July issue of Amstat News, 2013 ASA President Marie Davidian put on paper issues, concerns, and questions many statisticians have about the emerging field called data science. In the context of Big Data, data science—just like statistics—is fundamentally a discipline in researching methods to find information in data. Yet, for the sake of data science—not statistics—institutes are being built, grants are being solicited, and industries are investing. In fact, statisticians are often excluded from current data science initiatives. Why? What needs to change so we are included? Davidian, in collaboration with Rachel Schutt of Johnson Research Labs, suggested specific actions that statisticians at universities can take to be included in future initiatives. In particular, they advocated that statistics departments embrace industry more (e.g., statistics faculty should attend and sponsor conferences that industries attend, pursue sabbaticals at Big Data industries, and invite researchers from industry to present at colloquiums). Although including industry in statistics departments is a worthwhile action, we think a fundamental shift in how statisticians think is needed. We must be willing to change our analytical attire, so that we can play with data scientists in the playground of Big Data.
As statisticians, our analytical clothes are neat and tidy. We pride ourselves on slick mathematical theory (typically based in probability theory), “objective guarantees,” general wisdom, and our willingness to be included in collaborative projects. Yet, Big Data problems addressed by data science are, as Schutt pointed out, messy. In fact, they are downright dirty!
To play effectively with others and be invited to the playground of Big Data while maintaining our identity as statisticians, we need two sets of analytical clothes—those we don’t mind getting dirty and those we keep clean. Theorems and proofs are concise, precise, and, ultimately, clean, whereas big data sets are unstructured and messy. They seldom adhere to our mathematical assumptions. Staying clean and getting dirty can’t be obtained simultaneously. Just as relying on one simple dress code doesn’t work in daily life, it won’t continue to work in departments of statistics. Different clothing is appropriate for different occasions.
To extend our closets, we must 1) acknowledge and value differences between industry and academia and 2) change how we educate undergraduate and graduate students. While the goals of both industry and academic departments are symbiotic, they are inherently different. Analysts in industry aim to solve specific problems that advance their companies and value practical skills for solving them. Also, the solutions developed do not need to pass the test of time because the problems in industry are constantly changing. New problems in industry require and result in new heuristic solutions.
In academia, statisticians work on advancing general knowledge by developing theory and methods that apply in diverse settings. It is important to us (the academics) to identify, assess, and articulate analytical assumptions to generalize methodology and quantify uncertainty. But, doing so takes time that those in industry do not have. Time is money, so to speak. So, to extend our closets, statisticians must accept without prejudice the needs of industry and recognize when to take time for analyses versus when to get dirty. In turn, departments must value the applied work of faculty when they choose to collaborate with industry.
For example, tenure and promotion applications from statisticians who work with industry will publish considerable work in applied journals, rather than top-tier theoretical journals. To value industry means to value the work it produces and the statisticians who wear the right clothes to do it.
In regard to education, right now we mostly offer clean clothes to our undergraduate and graduate students in statistics. Yet, it is at universities where we sculpt the field of statistics and impress upon students the scope and relevance of statistics in problemsolving. To preserve our place as the primary educators of statistics/data science, we must balance lessons from industry and “traditional” statistics curricula as depicted in Figure 1.
Provided the domain of analytical methods, statistics professors at all levels tend to teach the theoretical foundations of analytical methods. Whereas, those in industry and applied programs may rely heavily on heuristic analytical algorithms, without a thorough understanding of their utilities or how they relate to other methods. We suggest a balanced, symmetric curriculum so that students wear clean and dirty analytical clothes.
To demonstrate the balance, consider Data Analytics I taught at Virginia Tech for graduate students. Data Analytics I is taught jointly by statistics and computer science and is often accompanied by an industrial partner. The class is populated by engineers and statisticians (nearly 50/50) with varying analytical backgrounds. One of the primary objectives is to highlight the skills of both statisticians and computer scientists.
For example, the course starts by surveying a simple, but messy, classification problem. A few teaser techniques are motivated and compared (specifically linear regression and K-nearest-neighbors). Subsequently, a statistical guiding light is shone on the problem in the form of a question: What is the optimal solution? Because questions in optimality are steeped in assumptions, clean statistical solutions are presented. Then, the classification problem is complicated by additional noise, outliers, and/or heterogeneity in the data so the set of assumptions becomes inapplicable and dirty solutions become desirable. In turn, students experience changing analytical clothes and choose the method that is best suited for their analyses. The students learn that the clean and dirty solutions are complementary. Given a comparison of the different sides of the clean/dirty solution spectrum, students are taught to identify the solutions that best meet the demands and goals of the problem at hand.
To conclude, we make one final point about education. Changing how we teach statistics has the added advantage of changing how we are perceived as statisticians. For many students, we have one opportunity—one introductory statistics course—to impress upon them the importance, breadth, and utility of statistics. Yesterday’s students are today’s doctors, lawyers, policymakers, entrepreneurs, and researchers in industry. By changing how we educate—demonstrating the versatility of statistics and providing opportunities to change analytical clothes—we will change the perception of statistics and the role statistics has in future data science endeavors. Let’s get dirty!