ParaText : scalable solutions for processing and searching very large document collections : final LDRD report.
Author(s) -
Patricia J. Crossno,
Daniel Dunlavy,
Eric Stanton,
Timothy M. Shead
Publication year - 2010
Language(s) - English
Resource type - Reports
DOI - 10.2172/1007321
Subject(s) - computer science , scalability , paratext , deliverable , information retrieval , visualization , software , scale (ratio) , world wide web , data science , data mining , database , programming language , engineering , systems engineering , art , physics , literature , quantum mechanics
This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages of information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom