Dictionary production for Census Form Conference | Zendy

National Institute of Standards and Technology | Zendy

AI Assistant Blog Pricing

Open Access

Dictionary production for Census Form Conference

Author(s) -

National Institute of Standards and Technology

Publication year - 1993

Language(s) - English

Resource type - Reports

DOI - 10.6028/nist.ir.5180

Subject(s) - census , production (economics) , geography , computer science , data science , demography , sociology , economics , population , macroeconomics

There are two categories of data from which dictionaries can be produced. One uses old data or data from a previous collection and the other uses new data or data from a current collection. The old data creates dictionaries that can be used for possible answer examples, assisting optical character recognition (OCR) systems, and training of recognition systems. The new data is the most useful in testing and scoring system results. For each of the categories above there are two types of dictionaries. These types may be useful for work with the Second Census OCR Conference. The first contains all words that have occurred in the data set being used. For this experimental work, the data set is from the 1980 Census. These words can be misspellings, abbreviations, or correctly spelled words. This first or essential dictionary is easier to create and will not innease the errors which exist in the original data. This dictionary contains all the misspellings. abbreviations and other errors that occur when the original data was keyed from the original paper questionaires. This will make the dictionary useful in descri bing potential con ten ts of a form set. The second dictionary can be buHt from the essential dictionary. The second dictionary is one which has the misspellings corrected. the abbreviations expanded, and all the words stemmed into logical minimal stems. A mapping between the essential dictionary to the second or exploratory dictionary is required. The exploratory dictionary is harder to create and may produce more errors than it corrects. This dictionary needs human assistance to be created since most of the steps can not be fully automated. It is also useful in showing a comprehensive list of the possible entries in a form set from previous data, in our case the 1980 Census. Short and long dictionaries can be produced from both of these dictionaries. The long dictionary contains all of the data from the original Census data, while the short dictionary contains only the phrases and possibly words which occur more than once. The short phrase dictionaries are approximately 16% of the length of the long phrase dictionaries. They also contain 60% to 70% of the phrases in the original sample of 132247 phrases. The short word dictionaries are approximately 45% the size of the long word dictionaries. About 95% of the words found in the long phrase dictionaries are contained in these short …

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom

About

About Careers Publisher Partners Contact Us Our institutional solutions Get Organisational Trial or Quote

Learn

FAQs Blog Terms of Use Privacy Policy

Download the Zendy App

Discover

Explore

Home ZAIA Blog