The identification of variable-length, equifrequent character strings in a natural language data base
Author(s) -
Angela Clare
Publication year - 1972
Publication title -
the computer journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.319
H-Index - 64
eISSN - 1460-2067
pISSN - 0010-4620
DOI - 10.1093/comjnl/15.3.259
Subject(s) - character (mathematics) , computer science , word (group theory) , inverted index , zipf's law , rank (graph theory) , variable (mathematics) , base (topology) , poisson distribution , identification (biology) , index (typography) , natural language processing , artificial intelligence , linguistics , statistics , search engine indexing , mathematics , combinatorics , world wide web , mathematical analysis , philosophy , geometry , botany , biology
s Service. This is issued biweekly, and includes the titles, authors' names, and bibliographic references of currently published articles of chemical interest. The issue used was No. 1, 1971, dated 11 January. A typical entry from the issue is shown in Fig. 1; the bibliographic reference is given as the ASTM Coden. The titles are recorded in upper-case characters. An occasional artefact arises through the insertion of additional space symbols; the printed publication includes a KWIC (Key Word In Context) index, and the spaces ensure that certain chemical word stems such as QUINONE in Fig. 1 (the word is normally written as PHYLLOQUINONE) are indexed. A set of simple programs (written in PLAN, the ICL 1900 series assembly language) was devised to produce counts of n-grams (i.e., strings of 1, 2, 3 and 5 characters), including the space character, for values of n between 1 and 5. The program to count single character occurrences used the binary value of the character code to address a position in a 62-word array. The digrams were counted by using a two-dimensional array (62 x 62 = 3844). Longer /j-grams (« = 3 and 5) were created by taking a window equal to that number of characters and moving it along the title record, creating a new record at each position (a space was inserted as the initial character of each title). The records were written to tape, and subsequently sorted, counted and printed.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom