
Examining the uses of shared data
Author(s) -
Heather Piwowar,
Douglas B. Fridsma
Publication year - 2007
Language(s) - English
Resource type - Journals
ISSN - 1756-0357
DOI - 10.1038/npre.2007.425.1
Subject(s) - reuse , computer science , data science , microarray databases , microarray analysis techniques , microarray , data mining , biology , gene , genetics , gene expression , ecology
Background
Many initiatives and repositories exist to encourage the sharing of research data, and thousands of microarray gene expression datasets are publicly available. Many studies reuse this data, but it is not well understood which datasets are reused and for what purpose.
Materials and Methods
We trained a machine-learning algorithm to automatically classify full-text gene expression microarray studies into two classes: those that generated original microarray data (n=900) and those which only reused data (n=250). We then compared the Medical Subject Heading (MeSH) terms of two classes to identify MeSH topics which were over- or under-represented by publications with reused data.
Results
Studies on humans, mice, chordata, and invertebrates were equally likely to be conducted using original or shared microarray data, whereas shared data was used in a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.05). Unsurprisingly, when we looked at Major MeSH terms to represent the primary purpose of the studies, statistical and computational methods clearly dominated. The only biomedical topics with a relatively high proportion of data reuse Major MeSH terms were Promoter Regions, Evolution, and Protein Interaction Mapping.
Discussion
Identifying areas of particularly successful microarray data reuse—such as Saccharomyces cerevisiae datasets and studies of promoter regions and evolution—can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.