ASA Statement on the Role of Statistics in Data Science
David van Dyk, Imperial College (chair)
Montse Fuentes, NCSU
Michael I. Jordan, UC Berkeley
Michael Newton, University of Wisconsin
Bonnie K. Ray, Pegged Software
Duncan Temple Lang, UC Davis
Hadley Wickham, RStudio
The rise of data science, including Big Data and data analytics, has recently attracted enormous attention in the popular press for its spectacular contributions in a wide range of scholarly disciplines and commercial endeavors. These successes are largely the fruit of the innovative and entrepreneurial spirit that characterize this burgeoning field. Nonetheless, its interdisciplinary nature means that a substantial collaborative effort is needed for it to realize its full potential for productivity and innovation. While there is not yet a consensus on what precisely constitutes data science, three professional communities, all within computer science and/or statistics, are emerging as foundational to data science: (i) Database Management enables transformation, conglomeration, and organization of data resources, (ii) Statistics and Machine Learning convert data into knowledge, and (iii) Distributed and Parallel Systems provide the computational infrastructure to carry out data analysis.
Certainly, data science intersects with numerous other disciplines and areas of research. Indeed, it is difficult to think of an area of science, industry, commerce, or government that is not in some way involved in the data revolution. But it is databases, statistics, and distributed systems that provide the core pipeline. At its most fundamental level, we view data science as a mutually beneficial collaboration among these three professional communities, complemented with significant interactions with numerous related disciplines. For data science to fully realize its potential requires maximum and multifaceted collaboration among these groups.
Statistics and machine learning play a central role in data science. Framing questions statistically allows us to leverage data resources to extract knowledge and obtain better answers. The central dogma of statistical inference, that there is a component of randomness in data, enables researchers to formulate questions in terms of underlying processes and to quantify uncertainty in their answers. A statistical framework allows researchers to distinguish between causation and correlation and thus to identify interventions that will cause changes in outcomes. It also allows them to establish methods for prediction and estimation, to quantify their degree of certainty, and to do all of this using algorithms that exhibit predictable and reproducible behavior. In this way, statistical methods aim to focus attention on findings that can be reproduced by other researchers with different data resources. Simply put, statistical methods allow researchers to accumulate knowledge.
For statisticians to help meet the considerable challenges faced by data scientists requires a sustained and substantial collaborative effort with researchers with expertise in data organization and in the flow and distribution of computation. Statisticians must engage them, learn from them, teach them, and work with them. Engagement must occur at all levels: with individuals, groups of researchers, academic departments, and the profession as a whole. New problem-solving strategies are needed to develop “soup to nuts” pipelines that start with managing raw data and end with user-friendly efficient implementations of principled statistical methods and the communication of substantive results. Statistical education and training must continue to evolve—the next generation of statistical professionals needs a broader skill set and must be more able to engage with database and distributed systems experts. While capacity is increasing within existing and innovative new degree programs, more is needed to meet the massive expected demand. The next generation must include more researchers with skills that cross the traditional boundaries of statistics, databases, and distributed systems; there will be an ever-increasing demand for such “multi-lingual” experts.
Working with statisticians, departments of statistics, and other professional societies, the American Statistical Association (ASA) is well positioned to help formulate discussion around the role of statistics in data science, to navigate the way forward in this quickly evolving environment, and to provide forums for communication and collaboration among data scientists, including statisticians and nonstatisticians alike. The ASA aims to facilitate collaboration between statisticians and other data scientists and thus enable them to achieve more than they could on their own.
See Ron Wasserstein’s blog entry on this statement and the press release about the statement.
See Fierce Big Data Pam Baker’s commentary on this ASA statement
You get an A for writing, ‘statisticians and other data scientists,’ instead of ‘statisticians and data scientists’ or ‘statistics and machine learning.’ By the way, everything in (Statistical) ML for analyzing data is statistics.
It seems to me that Statistics has not really recognized the importance of computing and data management for a long time. In the mid-1970s my first full time job was at a statistical consulting firm. I was hired because of my programming skills and statistical experience as a student programmer in the Physics department. At that time the people involved in the statistical side of the business were mostly Physicists and Mathematicians. Only the CEO and one other senior staffer had Statistics degrees (PhD and MS). We did not use packages and programmed everything in FORTRAN. We were much more like the Data Scientists of today than most Statisticians are. With packages, much of the programming work became trivial and that side of the equation was lost.
I ended up taking a Computer Science degree when I went back to school, but I also took all the upper level Math courses and Statistics courses because they were relevant to R&D work I was doing. I started a MS in Applied Statistics, which was interrupted and which I have restarted recently (after about 20 years). I opted then, and opt now, for a traditional Statistics program. I already have the relevant computting and data management skills. When I started I was at Villanova. Interestingly, in the early 1990s, they had a variation of the MS in Applied Statistics geared toward computation. Out of ten courses, the computational variation included two or three courses in Computer Science and one in Electrical Engineering (Computer Architecture). I noticed recently that the University of Chicago Statistics Department is adding a number of faculty positions and stressing computational aspects of statistics.
I really think that this is the way forward. In all other sciences computing has been important for a very long time, as has statistics. Emphasizing the computational aspects is critical at this time, as the current controversy shows.
Well said!
It is important for the statistical community to recognize its central role in this endeavor, and to help others realize how our understanding of stochastic processes and uncertainty allows us to not just give “best answers” but also measures of uncertainty about those best answers. Often, these measures of uncertainty are as important as the best answers themselves in helping end users make decisions based on what they learn from the data.
Louis,
RE: It seems to me that Statistics has not really recognized the importance of computing and data management
RESP: One statistics culture does not deal with data management. Applied statistics has valued computing and data management as the two have grown together in the field.
This capability is an authentication program intended for those with an enthusiasm to enhance their profession prospects by entering the information investigation industry as an information expert and in addition those with existing foundation in programming and measurements who need to upgrade their aptitudes with a viable educational modules to in the long run be information researchers.
http://sollers.edu/medical-programs/data-science-certificate-course/
http://sollers.edu/medical-programs/masters-in-data-science-course/
Welcome!
Amstat News is the monthly membership magazine of the American Statistical Association, bringing you news and notices of the ASA, its chapters, its sections, and its members. Other departments in the magazine include announcements and news of upcoming meetings, continuing education courses, and statistics awards.
ASA HOME
Departments
Archives
ADVERTISERS
PROFESSIONAL OPPORTUNITIES
US Census Bureau
SOFTWARE
SAS
Stata
QUOTABLE
“ Think carefully about your “value function.” That is, what are you using to measure the value of your work and output (broadly interpreted) to derive a sense of fulfillment and happiness? This can be highly individual, but I believe it is worth thinking about explicitly, and it is never too early to start.”
— COPSS President’s Award Winner Ryan Tibshirani
Editorial Staff
Managing Editor
Megan Murphy
Graphic Designers / Production Coordinators
Olivia Brown
Meg Ruyle
Editor and Content Strategist
Val Nirala
Advertising Manager
Christina Bonner
Contributing Staff Members
Kim Gillam
Naomi Friedman
Kathleen Santoro
Contact us
Amstat News
American Statistical Association
732 North Washington Street
Alexandria, VA 22314-1904
(703) 684-1221
www.amstat.org
Address Changes
Amstat News Advertising