
Calculation of a confidence interval of semantic distance estimates obtained using a large diachronic corpus
Author(s) -
Vladimir V. Bochkarev,
Anna V. Shevlyakova
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1730/1/012031
Subject(s) - statistic , semantics (computer science) , word lists by frequency , word (group theory) , computer science , natural language processing , artificial intelligence , distributional semantics , semantic similarity , mathematics , statistics , geometry , sentence , programming language
Several methods for detection changes in words semantics and appearance of new word meanings have been suggested. These methods use different techniques of estimating semantic distance between words. They are based both on neural network vector models and on simpler vector representations that use frequencies of n-grams including the studied words. This paper proposes a method for calculation the confidence interval of the semantic distance estimations obtained based on the frequency data of n-grams extracted from the large diachronic corpus. This task is complicated because the question about the law of distribution of frequency fluctuations of words and n-grams, despite a number of studies, remains open. The confidence intervals are calculated by statistic modeling using random permutations of n-gram frequencies. To test the proposed method, estimation of semantic distance between two Russian synonyms is used as an example.