Data Resources for Data for Good Researchers

1 May 2018
This column is written for those interested in learning about the world of Data for Good, where statistical analysis is dedicated to good causes that benefit our lives, our communities, and our world. If you would like to know more or have ideas for articles, contact David Corliss.

David Corliss With a PhD in statistical astrophysics, David Corliss works in analytics architecture at Ford Motor Company while continuing astrophysics research on the side. He serves on the steering committee for the Conference on Statistical Practice and is president-elect of the Detroit Chapter. He is the founder of Peace-Work, a volunteer cooperative of statisticians and data scientists providing analytic support for charitable groups and applying statistical methods to issue-driven advocacy in poverty, education, and social justice.

Data for Good research, of course, starts with data. Many projects begin with an existing relationship between the researcher and an organization that has data but needs statistical support to obtain the greatest good. As one becomes more familiar with the data and needs of the organization or cause, a natural progression in the analytical process is to consider, evaluate, and integrate data from other sources. While the researcher often begins with a paucity of useful data in-house, this can turn to an embarrassment of riches when other data sources are considered.


JSM Presentations
Going to JSM? Visit the online program and search for Data for Good to find 95 listings of presentations and other activities. On August 2 at 10:30 a.m., Jake Porway of DataKind will lead an invited session on data science for social good, with instructions and examples for setting up your own project.

Student Instruction Packet
Peace-Work has developed a student instruction packet on gun violence research. Written for people in the first two years of a college or high-school AP Stats class, it describes a process for performing a local gun violence study. This is to support a number of local studies at the metro area, county, or state level investigating various dimensions of gun violence. Once a number of these studies are completed and published, meta-analysis can leverage them to contribute to a national picture of gun violence in the United States. The gun violence student instruction packet is available for free download.

Data Sources

Many data sources are compiled by government agencies. While the federal government maintains a general-purpose search engine at USA.gov, the data-focused website Data.gov is a more valuable resource, with a search engine for publicly available government data. More than 200,000 data sets can be searched.

One of the most important data sources for Data for Good researchers is the US Census Bureau. The main data page links to the main Census Bureau products. Census data can be collected through the chief (decennial) census every 10 years, which collects a large amount of data of different kinds and tries to reach every person in the country. The American Community Survey (ACS) collects more detailed information than the decennial census, with a basic survey sample taken every year and a detailed survey taken every five years. The wealth of data in the ACS makes it an important source for Data for Good projects.

The American Fact Finder page provides easy access to the ACS data sources. Both the decennial census and ACS are produced in a series of tables representing a particular area, such as demographics, economic data, and housing. The ACS also includes data on specific industries. Data can be downloaded free of charge, including whole tables, although some files may be broken into separate files by geographic characteristics such as “state.”

Census data is geocoded using FIPS codes for state, county, and census track, so gaining a familiarity with these codes will facilitate use of the data. The FIPS codes can be separate fields, combined into a single string as GEOID, or both. The Census Bureau strives to support many kinds of data consumers with different needs. This leads to some redundancy in the tables, so it’s best to parse through it.

Most people may not think of the Census Bureau often, or they may only think of it in conjunction with its foremost commercial uses. As a person active in Data for Good projects, my colleagues and I have come to understand the Census Bureau data as a cornerstone of this important work. Learning to find, access, and manage government resources—especially Census Bureau data—is a valuable skill well worth cultivating, sharing with colleagues, and teaching to students.

State and local government agency sites also provide valuable data sources for local data. This is particularly true for crime data, as most localities have good data, not all of which is shared with the federal government. Data for good studies requiring crime data will do well to look at state and local sources. The same is true for environmental data and school- and district-level education data and outcomes. Data resources also exist for specific industries and areas of social concern. I am always surprised at what I can find, and data sources continue to grow in number and improve in quality.

Data for Good, of course, is more than the data. The software is available on an open source or free basis to support your work. Many people will be familiar with the open source analytic platform R. The base software and many packages can be accessed through the Comprehensive R Archive Network. Because R is open source, it has been incorporated into many commercially available analytic packages. These offer the advantage of providing support through the maker of the commercial package and being integrated into other tools with which you may be familiar. However, proprietary platforms embedding R will only be able to support a limited number of packages; rarely used packages or those with specific uses may not be covered. RStudio is a popular IDE, but is not open source for commercial purposes.

SAS offers University Edition, a free platform for professors, students, and non-commercial researchers. Some Data for Good researchers (including me) prefer SAS University Edition for its SAS environment and an end-to-end solution for data, analytics, and visualizations.

Getting started in Data for Good research can seem confusing at first, but good resources are available to help. Even experienced researchers can benefit from the variety of free and government data sources. Because data is a type property, public data should be considered public property. Data for Good uses the public data resources we own for the greater good of all.

