Premium
Large–scale digitization of herbarium specimens: Development and usage of an automated, high–throughput conveyor system
Author(s) -
Sweeney Patrick W.,
Starly Binil,
Morris Paul J.,
Xu Yiming,
Jones Aimee,
Radhakrishnan Sridhar,
Grassa Christopher J.,
Davis Charles C.
Publication year - 2018
Publication title -
taxon
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.819
H-Index - 81
eISSN - 1996-8175
pISSN - 0040-0262
DOI - 10.12705/671.10
Subject(s) - digitization , herbarium , workflow , container (type theory) , computer science , metadata , throughput , task (project management) , database , information retrieval , world wide web , engineering , operating system , ecology , telecommunications , biology , mechanical engineering , systems engineering , wireless
Abstract The billions of specimens housed in natural science collections provide a tremendous source of under–utilized data that are useful for scientific research, conservation, commerce, and education. Digitization and mobilization of specimen data and images promises to greatly accelerate their utilization. While digitization of natural science collection specimens has been occurring for decades, the vast majority of specimens remain un–digitized. If the digitization task is to be completed in the near future, innovative, high–throughput approaches are needed. To create a dataset for the study of global change in New England, we designed and implemented an industrial–scale, conveyor–based digitization workflow for herbarium specimen sheets. The workflow is a variation of an object–to–image–to–data workflow that prioritizes imaging and the capture of storage container–level data. The workflow utilizes a novel conveyor system developed specifically for the task of imaging flattened herbarium specimens. Using our workflow, we imaged and transcribed specimen–level data for almost 350,000 specimens over a 131–week period; an additional 56 weeks was required for storage container–level data capture. Our project has demonstrated that it is possible to capture both an image of a specimen and a core database record in 35 seconds per herbarium sheet (for intervals between images of 30 minutes or less) plus some additional overhead for container–level data capture. This rate was in line with the pre–project expectations for our approach. Our throughput rates are comparable with some other similar, high–throughput approaches focused on digitizing herbarium sheets and is as much as three times faster than rates achieved with more conventional non–automated approaches used during the project. We report on challenges encountered during development and use of our system and discuss ways in which our workflow could be improved. The conveyor apparatus software, database schema, configuration files, hardware list, and conveyor schematics are available for download on GitHub.