Toward a Vision: Official Statistics and Big Data
Cavan Capps and Tommy Wright, U.S. Census Bureau
Every day of our lives, each of us is bombarded with data. More recently, there has been an explosion of data, and it is nearly impossible to ignore the increasing volume of and potential use for Big Data.
For example, a paper published on the High-Level Group for the Modernisation of Statistical Production and Services’ website in March 2013 notes the following:
In the Netherlands, approximately 80 million traffic loop detection records are generated a day. These data can be used as a source of information for traffic and transport statistics and potentially also for statistics on other economic phenomena.
The New York Times reported in February 2012 that, “In economic forecasting, research has shown that trends in increasing or decreasing volumes of housing-related search queries in Google are a more accurate predictor of house sales in the next quarter than the forecasts of real estate economists.”
Federal statistical agencies produce official statistics. While Big Data are generally not official, we believe there are opportunities where they can enhance official statistics. Here, we share our thoughts about this topic to further a conversation, with attention given to official statistics provided by the U.S. Census Bureau.
With a great deal of our work integrally linked with the roots of our nation’s democracy, the Census Bureau does much more than the constitutionally required census every 10 years to redistribute the seats in the U.S. House of Representatives among the states. In addition to drawing boundaries for representation at all levels of government, our data assist in the allocation of hundreds of billions of dollars per year in state and federal funding to local, state, and tribal governments. Our data also are used to plan economic development and assess the need for schools, hospitals, job training, etc.; to plan communities and predict future needs; to plan the location of roads and public facilities; and to analyze social and economic activity and trends. In brief, our data products and those of other statistical agencies provide information about our people: who we are, how many we are, what we do, where we live, and how we live.
How the Census Bureau Collects, Provides Data
At a very high level, our data products result primarily from the implementation of censuses, sample surveys, administrative records, and statistical modeling. We conduct a census every 10 years that primarily provides counts of people (including limited demographics) and counts of all habitable dwellings in the United States. Every five years, we also conduct a census of the economy, measuring counts of different types of businesses and some characteristics of their activities, as well as a census of governments.
While censuses are our primary source of counts and some characteristics, sample surveys are conducted more frequently (e.g., monthly, quarterly, and annually) of people, housing, businesses, and governments to provide current estimates of their characteristics. Measures produced by our demographic sample surveys (many operated in partnership with other federal agencies) include unemployment and other labor force characteristics, income, government program participation and eligibility data, purchases of specific goods and services, time use, education characteristics, illnesses, disability, health, types and incidence of crime, science/engineering work force, housing, poverty, and health insurance coverage. Measures produced by our economic sample surveys include dollar value of retail sales, inventory, wholesale trade activities, receipts and revenues for service industries, manufacturing, construction, finances, state and local tax revenues, state and local government employment, imports, exports, transportation, communications, and utilities.
The use of administrative records is increasing. One example is our Population Estimates Program, which develops and disseminates annually estimated population counts of the population and associated characteristics for the nation, state, counties, and functioning governmental units. Records from the Internal Revenue Service help with our estimates of internal migration. This program also develops and disseminates estimated counts of state- and county-level housing units. National and state-level population projections also are produced. We also make use of statistical methods (e.g., modeling) to increasingly provide estimates for small domains such as subpopulations and low levels of geography.
To conduct censuses and sample surveys, we send questionnaires via the postal service and Internet, administer questionnaires using the telephone, and knock on doors to conduct face-to-face interviews. Quality control and assurance efforts are in place to ensure high-quality data products. Providing important social and economic indicators for more than two centuries also has yielded major advances and improvements in data collection methodology (including many by others) that are noteworthy:
1. Probability sampling theory and methodology help us generalize from samples to populations
2. Assessment of nonsampling errors (e.g., inaccurate or inconsistent responses) is possible, and we can compensate for many of them, including nonresponse
3. Dissemination of data and access to data by users has improved and increased
4. Significant progress has been made in protecting privacy and confidentiality of our respondents
Official statistics in the United States are grounded in the scientific method and constantly subject to scientific review; they are understood, they are authoritative, and they are credible. However, they are not perfect, and they are not free.
Changes, Changes, Changes
As has been noted by many, our society is increasing in complexity. We are highly mobile, technological advances are rapidly changing how we live, data users want more data in more detail, and data collection costs are increasing while data collection and processing budgets are not. As a result, questions abound. What to measure? How to measure? What can we measure? How frequently should we measure? What resources do we have to measure?
And There Is Big Data …
Digital transaction data are ubiquitous and growing. Financial transactions previously done with paper checks or cash are now done with credit or debit cards, or increasingly with smart-phones. Online news and blogs replace newspapers, search engines record the trends of people’s interests, and social media provides trends on what people are discussing. Smart phone GPS data provide traffic congestion data for Google maps. E-commerce transactions provide signals as to what items cost and which demographics are buying them. Local governments are making data available via Internet APIs for public access. Internet use is growing; there are more devices connected to the Internet in the U.S. than people. Smart phone use continues to grow, and the trends are not expected to reverse.
Can we ignore this growing ocean of digital data? Avoiding a formal definition of Big Data, we present a few comparisons reflecting our impressions between official statistics that result from censuses, sample surveys, administrative records, and statistical modeling on the one hand and Big Data on the other:
1. The size of databases for official statistics tend to be no more than hundreds of millions (108) of records, while Big Data come in big volumes and make use of galaxy-type prefixes to describe their sizes (e.g., tera-1012, peta-1015, exa-1018, zetta-1021, and yotta-1024).
2. Official statistics are disseminated every decade, every five years, annually, quarterly, and monthly, while Big Data can be disseminated practically instantaneously or just about every second as is the case, for example, with financial data.
3. Official statistics are obtained largely by asking, and response assumes permission to use, while Big Data come as byproducts of other primary activities and without asking explicitly.
4. Official statistics tend to be labor intensive, subject to human error and costly, while Big Data are mainly digitally captured and available and seem to be relatively cheap.
5. Official statistics are the result of careful data collection design with clearly defined uses, while Big Data come with unknowns (e.g., uses are less clear, data are less understood, data are of unknown quality, and representativeness is largely unknown).
Complementary Roles for Official Statistics and Big Data
Big data come with great promise for official statistics, as they can enhance and strengthen official statistics. Leading potential complementary roles for Big Data include the following:
1. Providing variables to help us stratify better for our sample surveys
2. Improving sample survey estimates provided by methods (weighting, ratio, regression estimation, unequal probability sampling, balanced sampling, adaptive methods, imputation, modeling, calibration, post-stratification, etc.) calling for auxiliary variables that do not need to be perfect; these auxiliary variables just need to be correlated with our primary variables of interest
3. Helping us compensate for nonresponse
4. Helping us check our estimates
5. Helping us improve the frequency and timeliness of our data releases
6. Helping us improve and provide more small-area estimates
Sources of Big Data
- Administrative data that arise from the administration of a program, be it governmental or not (e.g., electronic medical records, hospital visits, insurance records, bank records, and food banks)
- Commercial or transactional digital data that arise from the transaction between two entities (e.g., credit card transactions, online transactions—including from mobile devices Sensor data (e.g., satellite imaging, road sensors, and climate sensors)
- GPS tracking devices (e.g., tracking data from mobile telephones)
- Behavioral data (e.g., online searches about a product, service, or any other type of information and online page views)
- Opinion data (e.g., comments on social media)
Big Data are generally not the result of careful design to provide reliable measures. For example, sample surveys show that Twitter users are younger than the general public, and hence results from Twitter cannot represent the general population. When news reports cover a topic, Twitter feeds increase on that topic. Big data are signals or proxies for economic and social behavior. To compensate for inherent biases, Big Data are in need of evaluation and baselining with data of known quality.
Can Big Data reliably supply the social, demographic, health behavior, and business activity information required for a 21st-century society? Our current answer to this question is, “Not yet.” Given the growing concern over privacy and confidentiality related to Big Data, our nation may not ever want or trust Big Data to serve as a source for official statistics. However, statistical agency infrastructures are in place to critique and address the accuracy, consistency, and interpretability of the results produced from Big Data. With this infrastructure, the Census Bureau is in position to incorporate relevant Big Data sources while ensuring the consistency of official statistics, providing interpretation of them, and improving their relevance and timeliness.
Integrating official statistics and Big Data require statistical and computational methods capable of producing unbiased and reliable estimates of social and economic indicators. Increased judicious blending of design-based and model-based sampling methods offer real options, especially if we view Big Data as a source of auxiliary variables. Such efforts may require new computational techniques and software/hardware architectures.
Two recent events are worth noting. In June 2011, the McKinsey Global Institute released the highly cited report Big Data: The Next Frontier for Innovation, Competition, and Productivity, which defines Big Data and provides seven key insights and opportunities. On March 29, 2012, the White House announced a $200 million Big Data research and development initiative to help the United States make the most of this opportunity.
Opportunities with Big Data for Small-Area Estimation
Nearly all monthly statistics and economic indicators are only released for the national level. However, many analysts need timely estimates at a sub-county level, because most cities are contained in counties. The property, crime, education, transportation, and other economic development data are different for the inner cities and outer suburbs within the same county. Some investment occurs at a metropolitan level, and other commercial or social investment occurs on a neighborhood level. Good business thrives on good data. Producing timely and reliable small-area data based only on sample surveys cannot be done in the current environment of limited budgets.
Small-area estimates rely on statistical models that require predictor variables, often either for low levels of geography or for small subpopulations. The supply of good predictor variables for use in small-area estimation is limited. Currently, sufficient official statistics do not exist for small areas, and Big Data might just be able to supply a variety of variables (within budget) that only need to be correlated with our primary variables of interest to improve estimates.
Another of the obstacles to releasing small-area estimates for demographic characteristics or business characteristics is the need to ensure the confidentiality of all released data. The U.S. Census Bureau is required to ensure that individual people or companies cannot be identified from Census Bureau data. Estimates modeled from sample survey data and Big Data integration may provide a way to make increasingly detailed small-area estimates available while decreasing the risk of identification.
There are many examples of statistics users have requested at small geographic levels that might be improved with Big Data. These include requests for small-area housing and construction data, including housing permits, housing sales, foreclosures, housing values, property taxes, construction starts, and commercial construction values. There are also requests for business activity data, including retail sales, durable goods sales, data on business clusters and supply chains, interests in small business start-ups, local government sales, and shipping activity (e.g., barge, shipping, rail, trucking, FedEx, and UPS). Finally, there are many requests for small-area estimates for health and other social data, including educational participation, crime, health behavior, and disease spread (e.g., flu, heart disease, ADHD, cancer).
Business analysts argue that data made available close to certain events (or the period of measurement) are valuable. In response to Hurricane Katrina and Hurricane Sandy, users requested timely information about economic activity impact and reconstruction costs. Additionally, timely data on small areas regarding occupational and business activity during the recent recession might have been useful in targeting stimulus funds.
Current Efforts in Producing Small-Area Estimates
The Census Bureau uses statistical models to integrate data from different sample surveys and administrative data sources to create modeled small-area estimates. Some of these include estimates that are produced by the Small Area Income and Poverty Estimates (SAIPE) Program, Small Area Health and Income Estimates (SAHIE) Program, and Longitudinal Estimates for Household Dynamics (LEHD) Program. These programs have released successful data products for some time, and they are accepted by both the scientific and data user communities.
Other Opportunities for Big Data
The Census Bureau traditionally processes massive data sets like the Decennial Census. It also continually updates the Topologically Integrated Geographic Encoding and Referencing (TIGER) geographic information system (GIS) files. Today, the Census Bureau processes and manages massive administrative files like these routinely.
Construction statistics have used commercial administrative data sources to establish the sampling frames for construction activities.
Using Big Data, the Census Bureau might be able to release preliminary data estimates much closer to the time of an event. These estimates would need revision after being base-lined with designed sample survey data. Some data estimates that might be released this way include potential preliminary estimates for county business patterns, minority or women-owned small business start-ups, housing sales and foreclosures, and transportation data such as mass transit and bicycle traffic that would reflect seasonal changes in transportation patterns.
Other estimates might include personal discretionary income, consumer confidence, Internet usage, health insurance participation and use, disability, child care, educational issues, and college or professional school enrollment.
Current Activities (Research, Testing, Experimentation …)
In addition to the recent move to use the Internet as one of several modes to collect data in the American Community Survey and the extensive testing and planning to do the same for the 2020 Census, the Census Bureau is sponsoring active research administered by the National Science Foundation to improve measurement of economic and demographic characteristics with statistics based on administrative data, social media streams, web-based data, and geospatial data. The research objective is to improve estimates while reducing costs and respondent burden.
The Census Bureau is continuing to experiment internally with web scraping. There may be useful Internet data available for residential housing permits and sales, crime incidence, state and local government sales, and property taxes. Corporate finance data also might be available. State and local government web application program interfaces (APIs) exist to download available state and local data.
We are continuing to explore the quality of commercial e-transaction data to track aggregate retail sales and wholesale transactions, perhaps to provide lower-level geography estimates and housing foreclosure data. To reduce escalating survey costs, we are exploring more administrative and commercial data for accurate housing addressing and commercial cell phone numbers for more cost-effective telephone surveying.
Specifically, the Census Bureau is exploring how it can use local records to more seamlessly and continually update the 2020 Census address lists and maps, rather than waiting to receive such information as part of a one-time decennial update. Local records may be helpful in targeting decennial operations to hard-to-count groups or those in certain geographic areas.
Research for the 2020 Census also is exploring the use of paradata (additional data describing what happens as data are being collected), along with administrative and longitudinal data to improve the cost and effectiveness of “responsive/adaptive design” methods for census and sample survey data processing.
Anonymous cell phone GPS data may be integrated with other demographic data to provide more accurate and less costly estimates of the Census Bureau transportation data packages, but privacy and confidentiality concerns must be addressed.
Challenges and Issues
Arguably of utmost importance is the protection of privacy and confidentially. There is growing public concern over privacy issues in the online data space. Recently, there has been increasing attention in the press to publish concerns about Big Data intrusions on privacy. This concern has grown to be so important that, in February 2012, the White House developed a framework for data privacy titled Consumer Data Privacy in a Networked World: A Framework for Protecting Privacy and Promoting innovation in the Global Digital Economy (PDF download). The basic principles of this framework include the following:
(a) Individual Control
(c) Respect for Context
(e) Access and Accuracy
(f) Focused Collection
Knowing that the accuracy and validity of our data directly depend on our protection of privacy and confidentiality, the Census Bureau continues to work to improve procedures that take into account new data sources to prevent identification and actively address public concerns about privacy. The use of modeling strengthens the confidentiality of small-area estimates.
The Census Bureau needs to continue to educate the public about the processes in place to ensure confidentiality and privacy, as well as the laws that enforce them.
How will the scientific establishment react to official statistics, especially small-area estimates, that rely on Big Data? What will be the public perception of these estimates? Will the costs associated with Big Data prove to be unacceptable relative to the benefits? Can we understand and describe what Big Data represent; can we understand and quantify the quality of Big Data?
Throughout this process, the Census Bureau must maintain its tradition of keeping its complete estimation processes transparent and reproducible.
Looking to the Future
The Census has a long history of innovation. Herman Hollerith invented the punch card for the 1890 Census; the first civilian computer was used for the 1950 Census. The first official sample survey was used by the Census Bureau to measure unemployment in 1937. Some of the basic technology for GIS was developed in the Dual Independent Map Encoding/Graphic Base Files efforts for the 1970 Census and TIGER for the 1990 Census.
Each of these innovations was done to reduce escalating cost and to preserve official statistical integrity. For these same reasons, the Census Bureau will continue to explore the possibility of using the explosion of Big Data to reduce cost, reduce reporting burden, and increase the effectiveness of national statistical estimation.
These benefits will accrue only if the Census Bureau can continue to preserve individual and corporate confidentiality, working to earn and preserve the public’s trust.
Cavan Capps is the Census Bureau lead on Big Data, and Tommy Wright is chief of the Center for Statistical Research and Methodology. This article is released to encourage discussion. The views expressed are those of the authors, and not necessarily those of the U. S. Census Bureau.