Big Data: A Perspective from the BLS
I’m pleased to have Bureau of Labor Statistics Associate Commissioner Michael W. Horrigan as this month’s guest columnist. With the buzz about Big Data—as well as the private efforts to produce independent estimates for certain government statistics using Big Data—I was excited to see Horrigan address the topic of Big Data and official statistics at the Association of Public Data Users 2012 Annual Conference. In this column, Horrigan discusses the future of the use of Big Data for the U.S. statistical system.
~ Steve Pierson, ASA Director of Science Policy
Michael W. Horrigan is the associate commissioner in the Office of Prices and Living Conditions at the Bureau of Labor Statistics (BLS). With a doctorate in economics from Purdue University, he has held a variety of senior positions at BLS since he began his career there in 1986.
Big Data—a term that has an increasingly familiar ring, but also defies easy description. As may be the case for many of the readers of Amstat News, I first became aware of the term when I heard about the Billion Prices Project at MIT. As head of the Bureau of Labor Statistics (BLS) Office of Prices and Living Conditions, the idea that researchers at MIT were constructing daily price indexes for several countries using “web scraping” techniques to convert posted Internet prices to a digitized database was immediately intriguing.
I invited Roberto Rigobon (he and Alberto Cavallo head the MIT project) to give a talk on the subject at BLS. In addition to Rigobon being one of the most engaging and entertaining speakers ever to grace the halls of BLS, his message struck a chord and started me down the path of asking, “What are Big Data?” The answer to this question and the extent to which we use Big Data in our programs at BLS surprised me.
I begin, and probably at my peril, by attempting to define Big Data. I view Big Data as nonsampled data, characterized by the creation of databases from electronic sources whose primary purpose is something other than statistical inference.
The Billion Prices Project digitizes posted Internet prices to construct estimates of daily price change. Hal Varian, chief economist at Google, has done highly innovative work using Google searches to create proxies for current economic activity. For example, to predict, at time (t), the level of initial claims for unemployment insurance (UI) at time (t+1), he constructs a model of distributed lag values of prior weeks’ initial claims data along with an index of searches made in the current week that are relevant to people looking for information about filing an initial claim. This is a clever combination of ‘official’ government collected data with the construction of an indicator from a ‘big’ data source.
Based on presentations I have seen in the last year, Varian also is exploring the use of Google’s enormous database of prices to construct price indexes for goods traded on the net. Matthew Shapiro, along with other researchers at the University of Michigan, has used data from Twitter accounts in a model that also predicts the level of initial claims for unemployment insurance, where he isolates tweets that reference job loss. And yet another example of Big Data is scanner data such as the point of sale retail databases and the household-based purchase data from A.C. Nielsen.
These innovative and exciting explorations of data would seem not to be the standard fare for an agency like the BLS. But are they? How do we fit into this picture of the use of Big Data?
From a nonsampled data point of view, I point to the traditional and extensive use of administrative data to draw stratified probability samples and create weights for constructing estimates. The difference here is that this type of Big Data typically comprises the universe and, by definition, can represent (nearly) the entire population of establishments (the BLS Quarterly Census of Employment and Wages drawn from the universe of establishments reporting to the UI system) or households (the 2010 Decennial Census of household addresses).
There are numerous other administrative databases such as those covering railroads, hospitals, medical claims, and auto sales that we use for our surveys. For example, our item sample of used cars and trucks in the Consumer Price Index Program (CPI) is drawn from the universe data collected by JD Power and Associates. We use universe data on hospitals from the American Hospital Association to draw our samples of hospitals and data from the Agency for Healthcare Research and Quality to select the diagnosis codes used for pricing diagnosis related groups (DRGs) in the Producer Price Index Program (PPI).
In addition to using nonsampled universe files to draw samples and create sampling weights, we use this type of administrative data for the direct construction of population estimates. For example, the International Price Program (IPP) uses Energy Information Agency administrative data on crude petroleum for their import indexes; the PPI uses Department of Transportation administrative data on baggage fees in constructing airline price indexes. The PPI also uses a monthly census of all bid and ask prices and trading volume for all traded securities as of market close for three selected days of the month to construct price indexes for securities. The CPI uses SABRE data to construct airline price indexes. Both the PPI and CPI use the universe file for Medicare Part B reimbursements to doctors by procedure code in the construction of health care indexes.
In other cases, administrative data are used to fill in missing data as an alternative method of imputation or in making statistical adjustments to improve the efficacy of estimates. For example, the Current Employment Statistics (CES) Survey uses administrative data from the Quarterly Census of Employment and Wages (QCEW) to impute for key nonrespondents in the production of industry employment estimates by state. QCEW data also are used in the development of the CES net birth-death model to account for the creation and death of firms between updates to the universe file used in constructing monthly employment estimates.
But what about the use of more ‘traditional’ Big Data techniques? In fact, my not-so-random survey of programs in BLS uncovered some intriguing forays into Big Data exploration. For example, in the CPI, I knew we were using web-scraping techniques to collect input price information used to increase the sample of observations we use to populate some of our quality adjustment models. So far, we have used this technique with quality adjustment models for televisions, camcorders, cameras, and washing machines. What I also discovered is that we are web scraping Current Procedural Terminology (CPT) codes, descriptions, and reimbursements for Medicare Part B quotes used in index calculation. CPI also is researching the use of web scraping for the collection of prices for cable TV services.
This latter example raises an obvious question: Why not just use web scraping to produce the CPI? The principal reason is the requirement that we select a bundle of goods and services that is a statistically representative sample of what consumers purchase and reprice that same bundle month after month. Accomplishing this can be challenging, especially having to account for changes in the quality characteristics of goods and goods that disappear off the shelves from one month to the next. The representative basket is updated on a regular basis to reflect changes in consumer preferences and the emergence of new products; however, the principle of constructing an inflation rate based on the rate of price increase for a known bundle of goods with statistically determined weights lies at the heart of what we do. While research may show the viability of using a web-scraped source of data for a particular item, it needs to be done within the framework of this methodology. The Billion Prices Project, with all of its advantages in terms of the timeliness of a daily price index and large sample sizes, does not price the same representative bundle on a daily basis, nor does it have a source of sampling weights derived from the websites for which it collects prices.
BLS, like many agencies, has been exploring the use of retail scanner data for many years. To date, our most extensive use of scanner data has been in the realm of research, including comparative research between CPI data and scanner data. For example, we are conducting research that compares, for specific expenditure classes of items (e.g., fats and oils), the distributions of items selected in the CPI selection process with the distributions of those same items in the A.C. Nielsen Homescan database.
And one final example of Big Data that is unique even in terms of the examples given above and has the potential to affect our data collection systems greatly is the use of corporate data. In one of our surveys, a respondent has sought an arrangement in which we replace local, establishment-by-establishment data collection (our samples always include numerous establishments owned by the same parent company) with corporate data that they maintain on every item sold in every one of their establishments in the domestic United States. I would venture the opinion that, compared to the types of Big Data cited above, these qualify as ‘really really Big Data.’
In today’s fast-paced economy, there is often a single point of control and/or information gathering on the inventories, pricing schedules, and sales of every item and store under the aegis of multi-establishment companies—the same companies for which we often need to collect data on an establishment-by-establishment basis. With cooperative respondents—an essential ingredient—there is enormous potential for the use of corporate data. There are significant potential benefits of greatly increasing sample coverage and sample size. There also is the potential for reducing data collection travel costs and respondent burden (the reduction in travel costs has to be viewed against the lens of increased IT development and processing requirements). In addition, the quality of the data may improve. In the case of the respondent noted above, the data represent actual transactions at the cash register as opposed to data collected on the list prices of items on the store floor.
And so, what is the future of the use of Big Data for the U.S. statistical system? I see one immediate potential: the use of Big Data to improve the quality of our estimates within our current methodological frameworks. This may include studies of comparability between official and Big Data–derived estimates, the use of Big Data for modeling and imputation, and—in some cases—the use of Big Data for direct estimation.
One important caveat, and one that is as relevant to the U.S. statistical system as it is to the practitioners of Big Data techniques such as Billion Prices and Google, is the need to create transparent methodological documentation (metadata) that describes the ways in which Big Data are used in the construction of any kind of estimate. Given rising costs of data collection and tighter resources, there is a need to consider the creative use of Big Data, including corporate data. However, the blending of estimates drawn from traditional statistical methods and the incorporation of larger universe data requires clear statements of how these estimates are developed and a perspective on potential sources of sampling and nonsampling errors that can produce biases in our estimates and threats to valid inference.