Aren’t We Data Science?
Last month, I shared this column with President-elect Nat Schenker and Past President Bob Rodriguez to announce an ASA strategic initiative to promote engagement of statisticians in Big Data. I’m following that announcement with an account of some of my recent experiences regarding data science, which inspire my enthusiasm for this effort. One in particular serves as a metaphor for the disconnect between statistics and data science we noted last month.
Around the time we were finalizing that column, Michelle Dunn, chair of the ASA Committee on Funded Research, forwarded an email to me. Michelle thought I would be interested in learning from the press release in the email that Eric Green would be speaking in Chapel Hill, North Carolina, 25 minutes from my office in Raleigh, on April 23. In January, the director of the National Institutes of Health (NIH), Francis Collins, announced the creation of a new NIH-wide position, the Associate Director for Data Science (ADDS), to “capitalize on the exponential growth of biomedical research data”. Collins named Green, current director of the National Human Genome Research Institute, as acting ADDS. Green is also co-chair of the search committee charged with nominating the permanent ADDS.
Indeed, I was very interested! But what was even more interesting was the organization that had invited Green to speak. The press release announced “a new collaboration called the National Consortium for Data Science (NCDS) (aiming) to make North Carolina a national hub for data-intensive business and data science research.” It went on to note that the NCDS had been launched at the Renaissance Computing Institute at The University of North Carolina at Chapel Hill (UNC-CH) and included among its founding members businesses, government organizations, and major research universities.
I highlight that last group because, upon locating the NCDS website, I was astonished to review the list of founding members and see that not only is my university (North Carolina State) a founding member, but so are Duke University and UNC-CH. Along with SAS Institute; Research Triangle Institute International; NIH’s National Institute for Environmental Health Sciences; IBM; and several other institutions, businesses, and government agencies that employ numerous statisticians. The member representatives listed on the website from NC State, Duke, and UNC-CH are computer scientists/engineers, and among all 17 representatives, there is not one statistician.
Until I saw that email, I had no idea that the NCDS even existed. A quick check with my department head, others in my department, and statistician friends at the other institutions listed (including Bob at SAS) revealed that none of them did, either. I later learned that, of the 80 or so individuals participating in the invitation-only NCDS Leadership Summit on “Data to Discovery: Genomes to Health” for which Green was the keynote public speaker, only two are affiliated with an entity with the word “statistics” in its name (and are known to me to be trained as statisticians).
I tell this story not to take issue with the formation of the NCDS, but because it is reminiscent of stories and comments I have heard from many of you.
As we discussed in June, the field of data science has commanded considerable attention in the media and among business and science leaders. It is described as a blend of computer science, mathematics, data visualization, machine learning, distributed data management—and statistics. A New York Times article in April reported that centers and institutes devoted to data science and Big Data are being created and curricula and certificate and degree programs are being developed at a number of universities.
Many of you have expressed concern that these and other data-oriented initiatives have been or are being conceived on your campuses without involvement of or input from the department of statistics or similar unit. I’ve been told of university administrators who have stated their perceptions that statistics is relevant only to “small data” and “traditional” “tools” for their analysis, while data science is focused on Big Data, Big Questions, and innovative new methods. I’ve also heard about presentations on data science efforts by campus and agency leaders in which the word “statistics” was not mentioned. On the flip side, I have heard from statistics faculty frustrated at the failure of their departments to engage proactively in such efforts.
In fact, some of you have asked directly the question that comprises the title of this column.
I decided to contact a statistician who is at the forefront of data science to get her thoughts about the challenges (and opportunities) these developments pose for our discipline and how we might confront them. Rachel Schutt, who is featured in the Times article cited above, earned her PhD from the department of statistics at Columbia University, where she is an adjunct faculty member. Upon graduation, Rachel took a position at Google, where she became acquainted with the scope, practice, and jargon of data science before moving to her current position at Johnson Research Labs. In fall 2012, she taught “Introduction to Data Science” for the Columbia statistics department and is co-author of a book, Doing Data Science, summarizing the course (). I encourage you to visit the course website and read Rachel’s blog about the evolving course activities.
Rachel generously spent well over an hour sharing her perspectives with me; I summarize our discussion of only a few key topics here.
Data science is here to stay, Rachel says. There may be a lot of “hype,” but that might not be bad if it attracts talented people to work on data-driven problems. And to statistics. Statistics has enormous potential to contribute to data science. There are open research problems requiring that classical statistical methods in sampling, design, and causal inference be “scaled up” to be feasible with massive data sets. Few of the computer scientists and others who dominate the data science landscape are well-versed in these concepts, and many take an “algorithmic” view of data analysis. Data science needs statistical thinking and new foundational frameworks—for example, what is the “population” when one confronts the Big Data generated by Google?
In fact, many businesses are beginning to collect data prospectively for internal testing and validation, and there is little appreciation for the power of design principles. Statisticians could propel major advances through development of “experimental design for the 21st century”!
What skills does a statistician need to engage in data science activities, and how should we be preparing statistics students? In addition to a strong foundation in statistical theory, methods, and software, statistics students should develop deep proficiency in programming, Rachel says. Coding skills—in R and in Python including the use of Python as a scripting language—should be part of any modern statistics curriculum. And statisticians must appreciate issues and tools associated with parallel computing, combining data from disparate sources, and handling textual and streaming data.
Familiarity with data visualization techniques and popular tools like D3.js would be ideal and could enliven curricula and projects. Exploratory data analysis, which is generally not taught formally in many statistics programs, should be emphasized. Training in machine learning methods also is key. Not to mention communication skills.
Rachel stressed the importance of exposure to “real world” problems—the disconnect between curriculum and the “messiness” of the real world is greater than it has ever been. She advocates engaging local businesses and research organizations to present case studies to students, as she did in her course. Not only will this acquaint students with what they might confront, but also such interactions can forge connections that can inspire needed statistical research.
What can we do as individuals, a profession, and an association to address the concerns noted above? Rachel’s thinking? Sponsor and attend events that bridge disciplinary boundaries and afford opportunities to interact with scientists with massive data problems such as the University of California at Davis 2013 Statistical Sciences Symposium: Analysis of Complex and Massive Data. The ASA could make a big impact by sponsoring or collaborating in a conference on statistics and data science featuring top data scientists and statisticians as speakers.
Participate in data science Meetup groups. There are scores of these in San Francisco, Washington, DC, New York, Boston, and elsewhere—or consider forming one. We statisticians should seek these out and attend and offer to speak, and we should encourage our students to do likewise. In fact, Rachel and several colleagues have started The NYC Data Skeptics Meetup, which focuses on all aspects of data from a “skeptical perspective” on the hype surrounding Big Data and data science.
Statisticians in academia interested in engaging in data science should seek sabbatical opportunities in industry, and departments should reach out to industry data scientists and invite them to present seminars, contribute to the curriculum, and serve as adjunct faculty. Departments can propose partnerships with computer science, operations research, and other disciplinary units on campus to develop and team-teach courses and to sponsor joint seminars and working groups. Such interactions will reveal areas in which statistical research is needed.
Rachel noted in closing that she fears academic departments of statistics could be viewed as obsolete and be phased out over the next decade if we do not evolve to embrace this challenge—data science is not going away. She suggests we ask ourselves, “How would you feel if there were no departments of statistics 50 years from now?” It is essential that we confront this head-on; otherwise, the many philosophical issues data science presents demanding deep statistical thinking will not be addressed.
I am grateful to Rachel for sharing her candid views with me. She has convinced me that the ASA Big Data initiative is an essential step toward addressing some of these challenges at the association level, laying the groundwork for curriculum enhancements, significant engagement with stakeholders, and professional development. We aren’t data science, but we have a critical role to play. I encourage you to consider steps you can take locally to raise awareness of the importance of statistics in data science.