Open Data: More Questions Than Answers
This column is written to inform ASA members about what the ASA is doing to promote the inclusion of statistics in policymaking and the funding of statistics research. To suggest science policy topics for the ASA to address, contact ASA Director of Science Policy Steve Pierson at firstname.lastname@example.org.
Jonathan Auerbach is a statistician/policy wonk hybrid with a master’s in statistics from Columbia University. During the day, he is an analyst at the finance division of New York City’s legislative body, the New York City Council.
What will public participation in government look like in the era of Big Data? Will data be free and abundant, allowing for responsible, evidence-based policy? Or will data be hoarded and distorted by power brokers, inciting public distrust of anything data related? Only the data gods know for sure, but I believe the current open data movement in New York City gives insight into the future of data and its use in public participation and government. This movement suggests many challenges remain before the public can use open data responsibly.
The Open Data Movement
The open data movement has made tremendous progress in New York. On November 20, 2013, New York City’s legislature—the city council—held its first hearing on open data since the passage of its landmark open data law in 2012. The purpose of the hearing was to reiterate the council’s commitment to the law and review how well the city has been complying with the law’s twin mandates.
The first mandate specifies when data must be released, directing the city to keep its data open “by default” as opposed to “by request.” This means all non-sensitive data are to be identified proactively by each agency and released to the public in a series of deadlines ending in 2018. The second specifies how the data must be released, requiring they be released in a raw, consumable format on the city’s data website, the Open Data Portal.
The law—possibly the most comprehensive legislation of its kind in the country—is poised to radically increase the quantity and granularity of public information available on New York City—provided it is implemented as written. Noncompliance, however, is not penalized under the law, and there is no incentive other than public pressure for agencies to meet the law’s deadlines.
Therefore, the better part of the hearing was on whether the city’s departments had identified the appropriate data sets and whether they are on schedule to meet the upcoming deadlines. Spoiler alert: They didn’t, and they aren’t. In fact, in an ironic twist, the extent to which New York City has been noncompliant with the open data law is so great that city officials have been advised not to release data on their noncompliance.
The City’s Limitations
Releasing data is not an all or nothing affair, and to its credit, the city does release a great deal of public information on New York City to a website called NYCStat. Yet, that sort of data is “cooked” (i.e., manipulated, no longer raw), and most of the raw, unstructured data released in compliance with the law have been relatively low stakes, nonpolitical, and therefore not within the “open by default” spirit of the law.
In fairness, compliance with the law is no small task. Governments capture a tremendous amount of information, and, over the last decade, the automation of government services like payroll, emergency response, and fleet management has vastly accelerated this rate of collection. While New York City, like the private sector, has embraced the Big Data movement and begun leveraging this data to evaluate its policies, agencies are not equipped to organize, document, and contextualize their data for widespread consumption.
The reason is that monumental difficulties are associated with collecting and releasing unstructured data in a usable format. For starters, merely identifying the universe of data sets has been an obstacle for the city, and even if all the data were collected into a usable format, disseminating the information while protecting the integrity of the data and—as you might suspect—reputation of the agencies is also a challenge. The question then is to what lengths should the city go to provide the public open data as specified by the law? Even more important: What responsibility does the city have for ensuring the data (and any conclusions drawn from them) accurately reflect the process that generated them? How should the city prioritize in the face of all its limitations?
The coalition of advocates for open data that came to the November hearing was diverse, consisting of representatives from the technology companies, the nascent civic hacker movement, and the traditional transparency watchdogs. Each advocate group had an entirely different reason for wanting the success of the open data law, and open data meant something different to each of them. Some advocates stated clear interests in the benefits to businesses that could create jobs and increase the quality of life for residents (and incidentally increase city tax revenues). Others argued the government side, believing the data will promote transparency and accountability and “crowd source” difficult government problems by unleashing the power of creativity. Equally as diverse were the advocates’ technical abilities required to manipulate the data.
Perhaps it was due to this diversity that the coalition focused their testimony almost entirely on their mantra of receiving good data, something on which all the members could agree. They did not specify the types of data and documentation that should be provided given the city’s limitations. This is understandable since, for the most part, these groups have never worked with the data they hope to make public or with the agencies that actually produce the data.
Nevertheless, hammering out these details is vital. City law already mandates city agencies to release their data, and I expect they ultimately will, but without agreeing on the specifics, there is a danger the data will be released in a form desirable to no one.
An honest conversation about how best to release the data requires an understanding of how the city functions and experience working with city data. Ideally, a nonpartisan body with this background would lead the discussion on how data can be delivered to the public responsibly. Academia certainly fits the bill. A logical next step would be an engagement between active researchers, the coalition, and the NYC Council. They could hold public seminars, showcase the use of open data sets in research, or send representatives to hearings and public discussions. Despite academia having taken the initiative in defining the proper use of data in many other disciplines, this sort of leadership has largely been absent in the open data oversight process so far.