Home » Columns, Stats4Good

Stats4Good: Big Data Methods for Data for Good

1 June 2020 908 views No Comment

David CorlissWith a PhD in statistical astrophysics, David Corliss leads a data science team at Fiat Chrysler. He serves on the steering committee for the Conference on Statistical Practice and is the founder of Peace-Work, a volunteer cooperative of statisticians and data scientists providing analytic support for charitable groups and applying statistical methods in issue-driven advocacy.

As part of a continuing series on technology to support D4G projects, we’re going to look at the big data revolution and how it affects our approaches to problem-solving, as well as examples of how big data is being applied in Data for Good.

First, let me offer a few words on the language used. Three important terms—big data, machine learning (ML), and artificial intelligence (AI)—are often used in an imprecise, overlapping, and even conflicting manner. For the sake of clarity, I will offer definitions for the purposes of this column. As a physicist, I like operational definitions, so here goes.

For machine learning and artificial intelligence, I will offer the convention of using ML for the algorithms and AI for the decisions they make. For example, an ML algorithm estimates the risk associated with a loan application, while AI decides whether the application is approved. Here, ML will be used for a mechanized statistical calculation and AI for a statistical result.

Big data is often nebulously defined. What exactly constitutes big? For this column, I use small data to refer to anything that can be done on a desktop or laptop. Medium data requires a server, while big data is anything a conventional server can’t handle. The properties involved are called the “Vs” of big data: volume of the data, velocity—how fast it needs to move and/or be computed, and variety—usually the number and types of fields. Other Vs are sometimes suggested but aren’t distinctive qualities of big data because they don’t demand a change in the data architecture or computational environment when things get beyond a certain level.

ML algorithms often show their greatest power when applied to big data. This is because many use “boosting,” a statistical algorithm combining a large number of weak predictors to make a stronger one.

A single decision tree can be used on small data, like the well-known Titanic example, but has limited predictive power in many cases. Leo Breiman and Adele Cutler’s Random Forests algorithm and gradient boosting, which has won many competitions but can sometimes over-fit the data, combines thousands of weak decision trees to strengthen prediction. These often are applied to Data for Good situations in which there is a large number of individuals, such as homeless veterans, or repeated observations of a smaller group (e.g., using gradient boosting to mine medical records to assess propensity for a severe case of COVID-19). This situation qualifies as big data for the large volume of the data, the huge variety of data in medical records, and the large computing power required.

One of the most common uses of big data methods are classification methods that produce a discrete outcome. These are often binary, such the Kaggle competition identifying households in need of assistance. The algorithms also work for ordinal (1, 2, 3, …) and categorical (A, B, or C) outcomes. Support vector machines (SVMs) classify by moving the problem to higher dimensions (independent variables), where the distinction between groups becomes clearer. SVM often works well for complex problems when simpler ML classification algorithms don’t offer sufficient separation of the final clusters. This makes SVM a popular choice for ensemble modeling, as in predicting the type of crime most likely to occur at a given time and place.

Artificial neural networks are a class of ML algorithm that try to mimic how the human brain learns. The algorithms use multiple layers of nodes, with each node making a simple interaction with input data and sending the output to a subsequent layer. Each node is connected to many nodes. Deep learning uses multiple layers in the network to enhance prediction accuracy and detail. Neural nets can be a powerful tool for modeling Data for Good, and researchers studying the COVID-19 pandemic immediately put them to use for image analysis of X-rays to improve detection of the disease.

Applications for social media firehose often require multiple big data capabilities, from large data and computational volumes to complexity and high velocity to deliver real-time results. One application tracks hate speech on social media in real time. Another uses text analytics of social media content to fight human trafficking. Social media applications are one of the most promising and rapidly evolving areas in Data for Good.

While big data methods offer great promise, they also have special challenges. Ethical concerns can be magnified since analytic methods can create new ways to identify and track people. Lack of transparency, where it is not possible to know with certainty how an AI decision is made, has caused serious problems in some areas, including criminal justice applications. As models have become more complex, people can be even more likely to rely on model results alone, instead of applying them in context and with a full understanding of their strengths and weaknesses. An increasingly important area of Data for Good is recognizing and intervening when analytics are misused with harmful results. The ethical scientist will always be on guard against bias, violations of data privacy, and the potential for harm and misuse.

The big data revolution isn’t just about how the numbers have gotten larger. It’s really a paradigm shift, a fundamental change in our thinking and how we approach problems. Whether through distributed data and edge computing, new algorithms to handle more volume and variety, real-time analytics, or other emerging methods, the focus is on how the actions we need to take have changed. The real story of big data is one told by the verbs—all the ways we are responding to today’s challenges that traditional server environments can’t support. It’s a story told by many new uses and applications as emerging technology continues to reshape the world of Data for Good.

JSM is going virtual this year, but the theme is still Everyone Counts: Data for the Public Good, making it the leading D4G event in 2020.

Another great Data for Good opportunity is the annual meeting of the American Association for the Advancement of Science. The 2021 conference will be February 11–14 and entirely online. The call for submissions is open through July 14.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Comments are closed.