Where to Share Your Data?

Thursday, July 16, 2009 at 06:29 PM EDT

There are many competing standards out there for how to publish datasets with due credit to the author and publisher. Rich, structured metadata and interoperable standards for data identification are rapidly developing, but it’s not clear which standard is going to win the day or which search engines will successfully organize all that data.

Here is my short wish list for data standards:

  • datasets should be openly available for peer-reviewed scientific publications
  • datasets should be openly available for publicly funded scientific research
  • dataset publishing standards should include units of measure for quantitative datasets
  • dataset publishing standards should contain descriptions of categorical variables contained
  • dataset publishing standards should contain full bibliographic information

Once structured information about datasets is available in an open format, aggregators will step into the breach and serve that information to researchers, along with well maintained links to the data publishers’ sites. The main barrier to aggregating and indexing information about datasets is common standards, not a dearth of universities and companies willing to do the job.

Academics seem to have one set of standards, and government websites another. Upstart companies (both for-profit and not-) are piling into this space. Their websites’ grandiose claims to universal data cataloging are completely at odds with the slim pickings you’ll find if you bother to visit.

Every discipline has a relatively small number of expert publishers that aggregate information for researchers. Access to the most important datasets may be open or closed, but the links are gathered together by professional societies. Academic departments and courses also have resource portals for students and practitioners.

Up until recently, the vast majority of datasets were housed on purpose-built websites. For a book, this would be the equivalent of having a single library for each book, or at best for each publisher. We need the equivalent of a library for online datasets. Even though publishers ultimately retain authorship and of (and control over) their data, it is senseless that libraries’ search engines for datasets lag so far behind search engines for books and journal articles.

Bibliographic software (Endnote, BibTeX, Zotero) needs to catch up too. Common practice in academics is now to cite the datasets used in your writing, if you are using any publicly available data as correlates for the primary observations in your study.