Home » Data for Good, Featured

Data for Good Tech: Cool Science for Hot Projects

1 September 2020 799 views No Comment

David CorlissWith a PhD in statistical astrophysics, David Corliss leads a data science team at Fiat Chrysler. He serves on the steering committee for the Conference on Statistical Practice and is the founder of Peace-Work, a volunteer cooperative of statisticians and data scientists providing analytic support for charitable groups and applying statistical methods in issue-driven advocacy.

    Data for Good is a place where high technology and deep problems meet. This often involves learning new statistical methods or even helping to develop new ones. One of the great aspects of D4G is that technology never stands still. Trying out new ideas and developing novel applications for established methods are needed to make the greatest impact, as constant reinvention drives new solutions. For some ideas and inspiration, here are a few examples of where Data for Good tech is leading the way.

    Quantile regression could be a nominee for the most under-used statistical technique in all of analytics. This method creates separate regression lines for specified percent slices (e.g., the 50th percentile models the median). All the statistical software packages have it, so you can find code examples in the language you prefer.

    You might hear quantile regression is useful when the conditions imposed by the central limit theorem on ordinary regression are not met. That’s true, but definitely not the whole story. Even for a set of perfectly normal, perfectly randomly distributed Bernoulli trials (pretty sure you’ll never see one outside of a homework problem), simple regression only gives the mean of the response variable. As a result, we can be left without any clear understanding of what happens outside the central tendency.

    In Data for Good, that’s often where the action happens: social concerns, rare events, extreme cases, and other questions where the answer is found far from the mean. We can’t very well model homelessness, toxic waste spills, deaths in police custody, economic one-percenters, children who die from COVID-19, or almost anything it seems we really want to know by only looking at what happens to most folks. A mistake made by many is learning a few regression methods in school and not adding to the list throughout a career. With so many regression methods available, try to look around to find a few best suited to your particular problem. If quantile regression isn’t in your analytic arsenal now, maybe think about adding it.

    Another much over-looked analytic method is survival analysis. The name is unfortunate, because it’s really for modeling the time until some event occurs—not just things that eventually fail. It can be used to find how long it takes a person to learn something, recover from a disease, even complete a PhD! Modeling the time to some event has been used to understand the evolution of substance abuse cases, assess the resiliency of a damaged ecosystem, evaluate the effectiveness of public policy decisions, project the results of program fundraising campaigns, and much more. Survival analysis to model the time to some event is one of the basic, bread-and-butter D4G techniques used over and over again.

    Principal components analysis (PCA) is another method with more than often meets the eye. Usually thought of in terms of dimensionality, simplifying many predictive variables into a few, it can provide important benefits in D4G analysis. Properly applied, PCA is a powerful diagnostic tool for understanding and explaining the key drivers of the phenomenon in question. The key extra comes from thoroughly investigating the factors comprising each principle component. Digging into what they have in common can often provide valuable insight into root causes.

    For example, a study of risk factors for victims of human trafficking found several related predictors associated with new homelessness, such as foreclosure rates, but not chronic homelessness. Identifying a common theme in the fields comprising a principal component led to the finding that youth programs for preventing homelessness also lowered the risk of them becoming victims of human trafficking.

    Principal components can also be used to find “tracers”—easily observed features correlated to something not directly related that is difficult to observe. A common example is body temperature as a tracer for infection. A thermometer doesn’t measure serum levels of a pathogen but can be used as an indicator due to being correlated.

    Tracers can be found by plotting principal components along with all variables; a tracer is easily measured and points in the same direction as a principal component that is difficult to measure. As an example, gentrification can be complex and difficult to measure. However, changes in chronic homelessness are much easier to measure and trace changes in gentrification, which is a major driver of the metric. Plotting principal components against all observables can find these tracers.

    Capture recapture (CRC) is a great example of a well-established D4G method finding a vitally important new application. This is one of the oldest techniques around, developed in the late 1800s to estimate and track the size of biological populations. It became one of the most important tools in ecological D4G … and pretty much stayed there for a century, until Patrick Ball came along.

    Ball, founder of the Human Rights Data Analysis Group (HRDAG), applied CRC to human rights abuses, starting with the Bosnian genocide. HRDAG continues to lead in the development and application of advanced statistical methods to some of the worst problems in the world today, and capture recapture has become one of the most important analytic tools in Data for Good, tracking everything from deaths in police custody to the rise of hate sources on Twitter to government suppression of press reports.

    Text mining is one area undergoing rapid technological advancement, leading to many important success stories in Data for Good. Mining social media text has been applied to many cases, from finding under-counted COVID-19 cases to tracking the rise of hate speech sources on Twitter. One particularly outstanding example is Tom Sabo’s work fighting human trafficking by mining social media, decoding the language to identify ads for people. Analysis of social media text often borrows from methods and code from sentiment analysis. The open-source Python Natural Language Toolkit is a tremendously valuable resource—for text analytics in Python, it is a must.

    These are just a few examples of how both new technology and new uses of established methods are a driving force in Data for Good. They also have important benefits, as tech learned and skills developed for D4G projects can be applied to any analytic activity. Trying out new ideas and adapting existing technology to applications makes D4G tech a powerful force in working for the greater good.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
    Loading...

    Comments are closed.