Home » President's Corner

The ASA and Big Data

1 June 2013 6,087 views 9 Comments
This month’s column is a team effort. President-elect Nat Schenker and Past President Bob Rodriguez join me in announcing a strategic initiative for the ASA.

Contributing Editors

    Schenker

    Nathaniel Schenker

    Marie Davidian

    Marie Davidian

    Robert Rodriguez

    Robert Rodriguez











    As Bob discussed in his June 2012 column, Big Data is a Big Topic. It is almost impossible to avoid the daily barrage of media accounts, conference announcements, and events such as the recent Big Data Week focused on Big Data. Last year, President Obama announced a major Big Data research and development initiative and, last month, the White House hosted a Big Data workshop. The National Institutes of Health created the position of associate director for data science, and a new book—Big Data: A Revolution That Will Transform How We Live, Work, and Think—which explores the explosion of digital information, has received extensive press coverage.

    Big Data are data on a massive scale in terms of volume, intensity, and complexity, and their promise for transforming business, health care, scientific discovery, public policy, and a host of other areas has been proclaimed widely. But, despite the enormous potential for contributions by statisticians, our profession and the ASA have not been very involved in Big Data activities. We are often missing from Big Data discussions in the media.

    There are three reasons for this disconnect. First, the media and public lack a general understanding of what statisticians contribute to society (the issue that motivated the International Year of Statistics). Second, few statisticians are engaged in Big Data projects or have the special skills necessary to handle Big Data challenges.

    Third, the statistical community is disconnected from the new (and vaguely defined) community of data scientists, who are completely identified with Big Data in the eyes of the media and policymakers. Data science is frequently described as an amalgam of computer science, mathematics, data visualization, machine learning, distributed data management—and statistics. Data scientists must be innovative modelers and programmers; they also must be exceptional communicators who have a deep understanding of the problem domain and can formulate key questions, uncover novel insights, and use this information to guide high-impact decisionmaking. Other disciplines have been quick to identify themselves with data science and are routinely featured in media accounts. Although statistics is mentioned in passing, statisticians are nearly invisible.

    Ideally, statistics and statisticians should be the leaders of the Big Data and data science movement. Realistically, we must take a different view. While our discipline is certainly central to any data analysis context, the scope of Big Data and data science goes far beyond our traditional activities. As Bob noted in his column, the sheer scale and velocity of the data being generated from multiple sources requires new data management and computational paradigms. New techniques for analysis and visualization must be developed. And communication and leadership skills are critical.

    The goal is to prepare members of our profession to collaborate on Big Data problems.

    We believe we should focus on what we need to do as a profession and as individuals to become valued contributors whose unique skills and expertise make us essential members of the Big Data team. The ASA is already providing opportunities for statisticians to hone their communication and leadership skills. Through Bob’s career success factors initiative, discussed in his October 2012 column, a high-quality presentation skills course is now available. And Nat has proposed development of a leadership skills course in 2014. We likewise must take steps to enhance our profession’s role in Big Data practice. We know statistical thinking—our understanding of modeling, bias, confounding, false discovery, uncertainty, sampling, and design—brings much to the table. We also must be prepared to understand other ways of thinking that are critical in the Age of Big Data and to integrate these with our own expertise and knowledge.

    We have had many discussions—among ourselves and with ASA members who are familiar with Big Data—about strategies for achieving this preparation and integration. These discussions have led to our joint ASA presidential initiative to establish the statistical profession as a valued partner in Big Data activities and to position the ASA in a proactive and facilitating role. The goal is to prepare members of our profession to collaborate on Big Data problems. Ultimately, this preparation will bridge the disconnect between statistics and data science.

    We recognize we cannot tackle the breadth of this challenge all at once. Accordingly, we have launched three projects that focus on the knowledge base—beyond fundamental statistical training—that statisticians need to succeed in Big Data efforts.

    Curriculum Development

    A workgroup will be formed to identify issues, approaches, and models for curriculum development in statistics programs that equip students with the knowledge and experience needed to work in Big Data applications. A panel session will be developed for JSM 2014 that will discuss the findings and present recommendations. The workgroup and panel will include academic representatives involved in introducing Big Data into their curricula, together with government and business leaders who are hiring the Big Data work force. The workgroup will develop a report summarizing these discussions and disseminate it to the profession. The report will serve as a roadmap for integrating Big Data skills and knowledge into statistical training.

    Engagement with External Stakeholders

    The ASA will sponsor a series of one-day meetings, each involving leaders at the forefront of some aspect of Big Data in which statisticians and the ASA are not engaged, along with ASA representatives interested in pursuing Big Data initiatives. For example, a meeting could be held in Silicon Valley with Big Data leaders from the business and technology sectors; another could take place in Washington, DC, with Big Data stakeholders in government. A major goal is to develop networks that will both help the ASA to better understand the Big Data knowledge that interested statisticians must gain and to promote statistical thinking among Big Data leaders. The ASA participants will recommend next steps toward bridging the “disconnect.”

    Continuing Professional Development

    The ASA will offer short courses in text analytics for interested statisticians at the Conference on Statistical Practice and JSM in 2014. As Bob discussed in his June 2012 column, an understanding of how to acquire and analyze unstructured text data is critical for Big Data work because so much data arise from sources such as electronic health records and social network interactions. To develop these courses, it will be necessary to identify the specific training that would most benefit statisticians and to collaborate with outside experts in natural language processing and text analytics.

    Work on these activities has already begun. This initiative will form the foundation for a continuing strategy focused on Big Data beyond 2014 that will highlight the value statistics can bring to Big Data and engage statisticians in successful collaborations.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
    Loading...

    9 Comments »

    • RP said:

      Great news! Did you consider creating a new journal (just ASA or ASA + another association) entirely devoted to Big Data? Or at least a section in a major statistics journal? Or, maybe, a section in a popular magazine like Significance? These could help bring some attention to us.

    • Thomas Speidel said:

      Great article. I think it is safe to say that for the most part statisticians have been snobbing big data/data science the same way we snobbed data mining a decade or more ago. But while data mining remained low on people’s and company’s radar, big data is now everywhere and it’s here to stay. It’s clear that this attitude is not constructive and will just reinforce the image problem statisticians have had in many fields.

      I am quite happy to see that efforts are being taken to involve us in big data because I totally agree with the authors in that we have a lot to contribute.

      I would add a fourth reason behind the disconnect between statisticians and big data: commercial interests. A lot of the discourse around big data is biased towards software solutions aimed at storing and managing massive amount of data. Some of the commercial focus is also aimed at one stop analysis packages that are supposed to make it easier for the end user to engage in data analysis. Statisticians who are used to sampling and a limited number of software packages naturally are not too interested in automated solutions.

      The fact that big data or data science is not well defined does not make it a reason for us not to be involved. ML folks have been much more involved than us in this field and are clearly contributing the overall methodological direction of everything that is data science. Statisticians have just as much to contribute (think of how applicable statistical genetic methodologies could be towards big data): is there something we can learn from other fields? Can we contribute more work on developing methods that specifically address big data challenges? Can we write more papers that compare the advantages and disadvantages of, say, logistic regression vs. classification trees in the context of big data?

    • David Madigan said:

      ASA does have a big data journal! Statistical Analysis and Data Mining

    • Vincent Granville said:

      Much of machine learning is about clustering with training sets. Sure it’s about automating and big data, but I believe it is the case for statistics too. At least it was when I completed my PhD in 1993. My research was in the department of statistics, not operations research, computer science or engineering.

      So in my opinion, machine learning is a subset of statistics, but not the kind of statistics that ASA represents. As a data scientist, I do a lot of statistics. Indeed I am a statistician, just not one that does stuff similar to what ASA statisticians do. I also have expertise in sampling, DOE, cross-validation, model-free confidence intervals etc. Anyone can learn these things nowadays, it’s public knowledge.

    • Alex Liu said:

      I am a statistician with my Statistics degrees from Stanford, and I am also a data scientist as I have worked as a chief data scientist for two companies already.

      In my opinion, statisticians need to be open minded, and am not afraid of new challenges. Then we will see great data science opportunities where statisticians are needed to lead, to help solving feature selection challenges, causality detection challenges …

    • Choi said:

      Glad ASA is taking a stance on Big Data. Undoubtedly, Big Data is empowering our world. Check out http://fop.good.is/ to see how community leaders are using data to inspire change!

    • Randy Bartlett said:

      I am another one of the statistically trained Vincent Granville’s and Thomas Speidel’s out here. I agree that taking action will bear fruit.

      RE: Second, few statisticians are engaged in Big Data projects or have the special skills necessary to handle Big Data challenges.
      RESP: What? Despite IT’s effort to annex statistics (and you should be careful not to play into their hands), a quant degree like the MS in Statistics remains the gold standard for analyzing data in the corporation (and that includes Big Data and data science). Please do not try to screw this up by listening to the lost media.

      Applied statisticians/business quants/data scientists(our definition) have the skill set to analyze any data, size does not matter. IT does not have these skills. In corporations, IT lies in its own silo separate from business operations, where data analysis occurs. The premise is that you have to understand the business to analyze the data. The IT version of a data scientist is rather limited when it comes to advanced analytics.

      RE: Third, the statistical community is disconnected from the new (and vaguely defined) community of data scientists, who are completely identified with Big Data in the eyes of the media and policymakers.
      RESP: The disconnect remains between many statisticians in the scientific community, whose interests are better represented by ASA and statisticians in the corporate world. Do not fall for IT’s propaganda about their statistical capabilities. They are just reinventing data analysis for the first time.

      Those of us statisticians in the corporate world are like a colony. ASA can lose us forever by ceding any part of data analysis (predictive modeling, data mining, machine learning, etc.) to IT or by otherwise stepping on us.

    • Ron said:

      This is yet another wake up call. Bob envisioned “the big tent” as an inclusive arena for applications and development of statistics. This blog is formulating some specifics that can help create it.

      My three cents are that: i) statisticians need to adopt a life cycle view, ii) the impact of statistical work needs to be assessed and iii) statistical work needs to be focused on generating high information quality. Traditionally, Statistics in academia has not been about this. Big data forces us to rethink our role.

      1. The life cycle view is developed in
      “Aspects of statistical consulting not taught by academia” (with P. Thyregod), Statistica Neerlandica, special issue on Industrial Statistics, 60, 3, pp. 396-412, August 2006

      2. The impact analysis is behind the concept of practical statistical efficiency (PSE):
      “Statistical Efficiency: The Practical Perspective” (with S. Coleman and D. Stewardson), Quality and Reliability Engineering International, 19, pp. 265-272, July-August 2003

      3. The topic of information quality (InforQ) is explained in:
      “On Information Quality” (with G. Shmueli), Journal of the Royal Statistical Society, Series A (with discussion), 176(4), 2013.
      http://ssrn.com/abstract=1464444

      For a more ambitious proposal see: http://ssrn.com/abstract=2171179

    • Frédéric Lefebvre-Naré said:

      Excellent initiative, congratulations! As commenters put it, an academic degree in statistics remains a gold standard to work on big data… yet the tools and way of thinking of big data practitioners are way different of what I learnt as a student in statistics (econometrics, actually). Statistics bring much to data science, as the position puts it (“We know statistical thinking—our understanding of modeling, bias, confounding, false discovery, uncertainty, sampling, and design—brings much to the table.”), yet it has to reinvent / re-translate itself into the digital (and big data) world.

      Like management consulting has to reinvent itself in order to surf the Business Analysis wave, and so on.

      And the three actions you take look like very relevant steps in this direction.