The weekly science journal Nature just published an article on online data sharing that quotes me. My comments are from an e-mail exchange that I had with their Senior Reporter Declan Butler about the potential of new online data sharing sites such as Swivel and IBM’s Many Eyes. I’ve posted about Many Eyes before.
According to Declan’s e-mail to me, some scientists are already using these new tools to share sequence and microarray data. The potential value from scientists openly sharing their data is huge, possibly akin to the value provided by open-source software development. More people exploring data is always a good thing, and someone could discover meaningful information in data that the original owner/researcher missed. Or one's interests might be different than that of the original owner/researcher and thus one could analyze the data in a different way that is meaningful to questions not investigated by the original researcher. In a scientific publication, the author can't produce every possible permutation of the data that the readers might want, so letting the "reader" explore the data themselves through online accessibility has value. As Edward Tufte says in his book Visual Explanations,
When assessing evidence, it is helpful to see a full data matrix, all observations for all variables, those private numbers from which the public displays are constructed. No telling what will turn up.
(Thanks to Squaring the Globe blog for providing this quote.)
Anyone who has tried to obtain the raw data behind published research, however, knows that it can be difficult to get for many reasons: researchers have difficulty retrieving the data from media that is no longer used, researchers not having the time to search for and provide the data in an understandable format, researchers simply not wanting to lose any perceived advantage in pursuing future funding.
I’ve thought that a way around this is for NIH (or whatever the funding organization is) to require that all data from NIH-funded research be submitted to the NIH and be made publicly available. There are many difficulties with this proposal, of course, not the least of which is ensuring that others know how to read and interpret the data. The potential for misinterpretation would be huge. One possible solution to this would be to make available only data associated with a publication that details the methods and procedures of the data collection. This could become a policy that the publishing journal mandates rather than the funding organization.
I’ve been told that a proposal was made within the NIH to do just this several years ago for a discipline that is data-heavy, but the scientists in that field shot down the idea for several reasons, one of which was that they didn’t want any errors in their own data analysis discovered. Whatever the reasons, published figures and tables have been the primary form of information transmission of data for hundreds of years. With today’s electronic tools, there is no reason to limit our data sharing ability to techniques developed centuries ago.