Crooked Roads: IRS Statistics of Income at 100
William Blake, Mick Jagger, and the next 100 Years for SOI and Other Federal Statistical Agencies
This month’s guest science policy columnist is Arthur Kennickell, who was the discussant of a JSM session commemorating the first 100 years of the IRS Statistics of Income. He makes suggestions for SOI to continue its success over the next century as it faces new challenges, demands, and opportunities. Kennickell’s suggestions also may help other federal statistical agencies.
~ Steve Pierson
Arthur B. Kennickell has served on the Federal Reserve Board of Governors since 1984 and is currently a board member. He has a PhD in economics from the University of Pennsylvania.
The most famous living economist, Mick Jagger (London School of Economics graduate), has offered much wisdom, but perhaps the most useful for those of us in statistics and related areas is what I call Jagger’s Theorem and Corollary:
Theorem: You can’t always get what you want.
Corollary: But if you try sometimes you might find you get what you need.
The proofs are left to the reader.
The wisdom I see in these constructs points to the heart of what we face in empirical research: What we want to know is often at best only remotely or statistically knowable to us, even in principle, and the available data and tests also are frequently limited.
Although the lack of internal punctuation in the corollary introduces some ambiguity, the words “try,” “sometimes,” and “need” seem to me the critical points to focus on. In research, what we actually need very often becomes clear only after a painful path of jettisoning wishes that we come to realize were unrealistic or unnecessary. Still, we know that sometimes even then the odds are against us and we may need to start over. But trying—being creative and persistent about data and estimation—is virtually always the sine qua non.
Creativity and persistence can take us to places we had not known before, and often we see openings to other places we need to visit. To paraphrase the poet William Blake, crooked roads can be roads of genius.
It was in this light that I approached the honor of serving as the discussant in a recent JSM session organized by Barry Johnson to commemorate the 100th anniversary of the Statistics of Income Division of the IRS.
Fritz Scheuren, the keynote speaker, is a former SOI director and retains passion for the organization and its mission as a statistical agency. To remain relevant, a high-order element of the mission of any statistical organization must be to explore and adapt to the resources and constraints of its time, even if this element is technically unspoken. Over time, the specifics of the mission at any point in its evolution may sometimes become mistakenly identified with the higher-order mission. For that reason, it is helpful periodically to have a view from above the landscape to see where we might need to stand back or tack. In that position, Fritz argued that, among other things, the SOI of today should do all it can to reach beyond the current specific mission and foster increased data linkage and greater collaboration with others. I return to these and related points below.
Each of the three technical presentations exemplified ways in which SOI resources can be used creatively, particularly in mixing data sources and applying new techniques where straightforward approaches are not workable. As indicated by its title, the presentation “Estimating Persistence in Employee Business Expense Correspondence Examinations Using Hidden Markov Models” by Anne Parker, Julie Buckel, and Sarah Shipley discussed an application of hidden Markov models to gauge the effects of a limited type of audit on reports of a specific deduction in subsequent years where those effects are only indirectly observable. The interesting data twist is that, for some years of the universe of individual tax returns they use for their analysis, the relevant deduction is aggregated with other information. They use a modeling technique to impute the partially obscured deduction information, exploiting the fact that edited SOI sample data for the same years contain highly detailed information for the sampled returns.
The next presentation was “Improving Techniques to Use Panel Data to Produce Cross-Sectional Estimates” by Yan Liu, Michael Strudler, Janette Wilson, and Young Lim. It reported on attempts to address needs for annual detailed cross-sectional estimates of sales of capital assets when only periodic cross-sectional edited samples of individual returns are available, but data are available from a panel of edited returns produced annually for a sample stratified by income in the base period of the panel. For estimation, the panel is supplemented by a relatively small “refreshment sample” to represent returns not in the panel universe. But some incomes are so variable that they appear to “jump strata” between panel waves, and realizations of capital asset sales can be sporadic. The effect is that the unadjusted panel data typically give highly inefficient or unrealistic estimates of cross-sectional values. The authors attempt to use calibration weighting based on universe data on return characteristics related to sales of capital assets to realign the panel data and evaluate the resulting estimates for a year in which independent cross-sectional estimates are also available.
The final technical presentation—“Using Sample Data to Reduce Nonsampling Error in Unit-Level Tax Administrative Data” by Tracy Haines, Victoria Bryant, and Kimberly Henry—gave some results of an effort to provide estimates of key individual tax characteristics at the level of three-digit ZIP code areas. SOI invests considerable effort into capturing information not retained in the universe data, editing reporting errors, and rearranging data in conceptually more correct ways. Although the edited sample is large, it is not large enough to support reliable estimates at this geographic level on its own. The research aims at using small-area estimation techniques to combine the strength of the high quality of the SOI edited sample files with varying sample proportions at the ZIP code level and unedited IRS universe data.
All the technical presentations reported running into substantial obstacles, and work remains to be done on each of them. But one thing they have in common is their exploitation of the rich data ecosystem and connections available at SOI. I see in the spirit of this exploration and creativity something of what I think should be in the mission of SOI going into its second century. I have a few recommendations for SOI, which I make in all humility and with freely acknowledged ignorance about institutional constraints that might make them difficult or impossible to implement. Some of these suggestions also may point to changes for other U.S. federal statistical agencies.
The IRS conducts two especially important types of “mandatory surveys” (tax returns): individual tax returns and corporate tax returns. SOI creates sample files from these (and other) universe collections. In creating those edited sample files over a long time, SOI has developed deep knowledge and skill in managing misclassifications and other errors that appear in the unedited returns. I fear this is an under-appreciated type of knowledge. It is, I believe, one of the crown jewels of SOI. Other jewels are the knowledge and experience of the staff more broadly and the institutional possibilities for linkages and connections highlighted by the keynote speaker for the session, Fritz Scheuren.
In various ways, a substantial part of the work behind each technical presentation was driven by a need to compensate for the universe data not being adequate, either because insufficient detail is captured or because the data are unedited in ways that would be likely to cloud the desired analysis. At the same time, the SOI edited files, while being highly detailed—containing many records and being highly representative for high-value returns—are also not sufficient for every question. Limitations in the SOI samples are especially severe in the case of panel data when only one period of income is used in the design of the samples.
Because income is variable, but often mean-reverting, the limitation of the stratifier to one period tends to build in a substantial expectation of change in subsequent periods. To the extent variables of interest over time are correlated with longer-run income or have a spikey time-series profile, very large panel samples might be required to support meaningful estimates under such a design.
Instead of choosing between the edited universe data and the SOI edited samples or attempting to bridge differences between the two types of data through adjustments or approximations, an alternative might be to rethink the relationship between them in a way that would exploit the resources available at SOI.
One approach might be to propagate the structure of editing to the universe data. It would be a practical impossibility to edit the universe data in the same way as the SOI samples. But a close equivalent might be achieved through modeling, a deep learning exercise using the SOI edited data as “training data,” or a combination of the two approaches. The lower level of detail in the universe data might be addressed by capturing all details from electronically filed returns and simulating the remainder from the aggregated variables retained in the universe data, using models estimated on the edited data. If successful, this effort would make it possible to construct much more detailed analyses, or even create special-purpose retrospective panels, without the need to consider sampling, except perhaps indirectly via the uncertainty in the simulated editing. Admittedly, building sufficient infrastructure to implement longitudinally consistent simulated editing might take substantial time and research.
In the meantime, SOI also might think about changing its design for the existing panel samples. In principle, SOI cross-sectional samples can be selected to focus on particular characteristics as of the time of sampling. In contrast, traditional panel samples must be selected in hopes of subsequently presenting a useful longitudinal picture of the relevant population. Where oversampling is not a requirement or is undesirable for some reason, simple random sampling within relatively stable strata—such as geographic classifications—would be adequate when the sample size is calculated to support analysis, taking into account the time-series variability of the variables of interest.
For the more usual SOI oversampling on economic criteria, such as income, there is a risk of selecting cases whose current values are higher or lower than normal, and determination of the sample size to support analysis would need to take into account not just the variability of the variables of interest, but also the variability in the stratifier. To the extent that behavior of interest is more related to longer-run characteristics (for example, “permanent income”), there would be advantages to smoothing over multiple years of the values underlying the stratifier. The Survey of Consumer Finances uses such a smoothing technique in the design of its cross-sectional samples selected from SOI data to support wealth measurement.
In many statistical organizations beyond SOI, there is increasing pressure to use data beyond surveys and other traditional sources (often “Big Data”) that offer a prospect of cutting costs, improving representation, increasing timeliness, or avoiding technical complications from the use of sample-based data, particularly in light of nonresponse rates in traditional surveys. But many alternative data sources also have less certain provenance or less constancy of method, purpose, or availability. Especially where there is uncertainty about the true population effectively represented, as is also the case for surveys with low response rates, having reliable universe data as a point of reference is essential for crafting adjustments.
Direct linkage to universe data offers even more possibilities, aside from simply increasing the variables available for analysis. For many analytical objectives with an economic component, linkage to data derived from tax returns, which have highly elaborated provenance, would provide a firm anchor for calibration or similar techniques and a basis against which to evaluate selectivity issues that might plague data of less well-defined provenance. When there is a need to capture the upper reaches of the income distribution or at least its shape, for example, such connection may be even more important. Because the data from tax returns obviously refer only to the population of filers, however, additional work also would be needed to understand the nonfiler population more deeply.
Other statistical agencies also might do well to consider their hidden strengths. Just as SOI has skill in understanding the problems in the generating processes for tax return data, other agencies have often invested considerable time and resources to develop an understanding of the relevant total survey error or other characterization of error processes in their data. Rather than view such “emergency room” skills as a painful necessity, it may be helpful to elevate them in addressing the possibilities in alternative data sources. Aside from physical infrastructure and legal mandates, the statistical agencies have only the embodiment of their history in the accumulation of expertise in terms of subject matter and methodology. We have no meaningful alternative to fostering the evolution of those skills to sustain the high-order mission of remaining relevant, and especially so in the current time of such great extensions of what is being measured.
For SOI, strengthening the organization as a statistical agency should be a high priority for its next 100 years. Like the U.S. Census Bureau, SOI holds sensitive information. Both manage data of enormous potential social value, beyond the local uses of the agencies. The tax code allows a set of external uses to support tax administration as well as some use by other statistical agencies, such as the Census Bureau and Bureau of Economic Analysis.
Although SOI has made great progress in engaging with outside researchers more broadly, it has yet to match the scope of engagement the bureau has achieved through its research data centers, where researchers may access and link confidential information under strictly controlled conditions. Finding a similarly appropriately controlled means of sharing information and facilitating its linkage with other sources would allow SOI to unlock the analytical energy in tax data. It also would build other supporters for an agency that should have a more central role in the nation’s statistical infrastructure.
Editor’s Note: The views expressed in this article are those of the author and do not necessarily reflect the view of the Federal Reserve Board or its staff.