Dictionary production for Census Form Conference
Author(s) -
Ross Wilkinson
Publication year - 1993
Language(s) - English
Resource type - Reports
DOI - 10.6028/nist.ir.5180
Subject(s) - census , production (economics) , geography , computer science , data science , demography , sociology , economics , population , macroeconomics
There are two categories of data from which dictionaries can be produced. One uses old data or data from a previous collection and the other uses new data or data from a current collection. The old data creates dictionaries that can be used for possible answer examples, assisting optical character recognition (OCR) systems, and training of recognition systems. The new data is the most useful in testing and scoring system results. For each of the categories above there are two types of dictionaries. These types may be useful for work with the Second Census OCR Conference. The first contains all words that have occurred in the data set being used. For this experimental work, the data set is from the 1980 Census. These words can be misspellings, abbreviations, or correctly spelled words. This first or essential dictionary is easier to create and will not innease the errors which exist in the original data. This dictionary contains all the misspellings. abbreviations and other errors that occur when the original data was keyed from the original paper questionaires. This will make the dictionary useful in descri bing potential con ten ts of a form set. The second dictionary can be buHt from the essential dictionary. The second dictionary is one which has the misspellings corrected. the abbreviations expanded, and all the words stemmed into logical minimal stems. A mapping between the essential dictionary to the second or exploratory dictionary is required. The exploratory dictionary is harder to create and may produce more errors than it corrects. This dictionary needs human assistance to be created since most of the steps can not be fully automated. It is also useful in showing a comprehensive list of the possible entries in a form set from previous data, in our case the 1980 Census. Short and long dictionaries can be produced from both of these dictionaries. The long dictionary contains all of the data from the original Census data, while the short dictionary contains only the phrases and possibly words which occur more than once. The short phrase dictionaries are approximately 16% of the length of the long phrase dictionaries. They also contain 60% to 70% of the phrases in the original sample of 132247 phrases. The short word dictionaries are approximately 45% the size of the long word dictionaries. About 95% of the words found in the long phrase dictionaries are contained in these short …
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom