Data Sharing and the Scientific Community
I wish to reply to Professor Devore’s letter in the January 2010 issue of Amstat News. Some journals are now requesting that data sets be made available for use by others, but this is not possible for all data sets for several reasons.
First, there is the problem that Professor Devore has already encountered: The data set does not belong to the authors of the article, but belongs to a pharmaceutical or other company. If this is the case, then ask the company concerned; do not blame the author approached.
Second, especially in the case of human subjects, the privacy laws are very stringent, rightly so in my opinion. The laws are enforced and are becoming more restrictive with respect to publication as the years go on. No author would risk sending data that has not been completely de-identified. Further, they would have to approach their institutional review board for permission to send it. Professor Devore would have to say exactly what he wanted to do with it, how it would be protected, who would have access to it, etc. and prove that his purpose really requires that data set. The whole process would take months or more. Unless one owes Professor Devore a great debt, the time and effort would not be worth it to most professionals. This would apply to data collected before any privacy laws were enforced also.
Is there a risk to people’s privacy? Yes. Let me give an example based on a similar case, not the original example. A trial was done comparing two options with a two-year follow-up. In one of the papers written about the data, it was stated that there was only one case of a massive infection and that patient was HIV positive. There is no way that data can now be released with that case in it. If you remove the case, the data set will be different and, from the published results, it will be possible to estimate the age, sex, etc. of the missing case. You cannot assume that, with the correct resources and knowledge, no one can trace that patient and therefore their identity. Many of us can see how it can be done.
The question of errors is, in most cases, minor. Errors abound. Reviewers should spot many of them, but often we do not—the extra zero changing 0.0002 to 0.00002, the wrong p-value quoted, sometimes the wrong subscript in an equation. If the main outcome is not affected, most people admit it at meetings, but otherwise accept that these things happen. There is a paper from the 1980s—a very important paper that I often refer people to—that has an attachment to its web version admitting an error in the logarithm base used. It is still an important paper.
The other problem is that many data sets are used many times over (reducing, of course, the value of successive analyses) and often used internally for graduate work. Releasing the data set for others wastes the time and resources invested in the data set, as well as requiring new institutional reviews. Even data without human subjects takes time and money to collect the specimens, prepare for study, and run tests, often requiring nightly trips to the laboratory. Why not join up with a research group [or] tutor technicians and students in return for the later use of the data? The best one can do is create dummy data sets for teaching. It is what I preferred to do.
M. G. E. Peterson
Response from Steve Pierson, ASA Director of Science Policy
January’s letter to the editor from Jay Devore reports on his difficulties obtaining data sets from journal article authors. Barring the privacy, proprietary, and other concerns offered by M.G.E. Peterson, few of us would disagree that sharing data should be part of the scientific community’s culture.
Affirming “data sharing is essential for expedited translation of research results into knowledge, products, and procedures to improve human health,” the National Institutes of Health (NIH) has a strong data-sharing policy and facilitates appropriate data sharing.
NIH officials in the Office of Extramural Research (OER) recommend the following process for requesting data, assuming the data are appropriate for sharing:
i) Approach the principal investigator (PI) of the NIH grant that funded the data you seek
ii) If unsuccessful, have your institution’s research/grant administration office approach the PI’s home institution research/grant administration office
iii) If unsuccessful, have your institution’s research/grant administration office approach the NIH extramural research office of the institute or center that funded the grant
iv) If unsuccessful, contact OER for additional guidance
Resistance to data sharing was brought to my attention on my first day with the ASA by Stan Young of the National Institute of Statistical Sciences (NISS). In 2008, Young and I selected a data set produced by two NIH grants and followed the above procedure, except that we contacted the PI directly because, at that time, the OER procedure did not specify a research/grant administration office should carry out steps i and ii. After step ii, the PI emailed me, saying her group would not provide the data because it didn’t have to (since its grants were funded prior to NIH’s 2003 data-sharing policy.)
Not finding step iii helpful, I contacted OER again and an officer was assigned to assist Stan and me. His approach was to facilitate communication between parties and provide them with guidance on NIH policies. He recommended that the contact be between the research/grant administration offices of the requestor and the PI since institutions, as the actual grant recipients, are responsible for administering the grants consistent with the terms and conditions of the grant award. He offered to provide guidance on NIH policies to both administration offices and to facilitate communication between the institutions if the discussions did not progress satisfactorily.
This process is involved and takes time and persistence. It is also not guaranteed to yield success; Stan’s and my request has yet to yield the data. (I did not track the process after the administration office to administration office contact was recommended.) Nevertheless, I see the NIH process as the most promising route in the short run if one is seeking data obtained through an NIH grant.
Devore also stated, “Authors of an article to be published in a nonproprietary journal have an obligation to make their data available to anyone requesting it.” While I’m not sure how one convinces other journals to adopt such a policy, the Journal of the American Statistical Association (JASA) does have a data policy along these lines: “Whenever a data set is used, its source should be fully documented. When it is not practical to include the whole of a data set in the paper, the paper should state how the complete data set can be obtained.” Unfortunately, I think JASA is more the exception than the rule.
ASA Director of Science Policy