UNSUPERVISED INFERENCE OF DATA FORMATS IN HUMAN-READABLE NOTATION
Author(s) -
Christopher Scaffidi
Publication year - 2007
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5220/0002347902360241
Subject(s) - computer science , notation , inference , information retrieval , phone , outlier , data format , data mining , database , artificial intelligence , arithmetic , mathematics , linguistics , philosophy , computer hardware
One common approach to validating data such as email addresses and phone numbers is to check whether values conform to some desired data format. Unfortunately, users may need to learn a specialized notation such as regular expressions to specify the format, and even after learning the notation, specifying formats may take substantial time. To address these problems, this paper introduces Topei, a system that infers a format from an unlabeled collection of examples (which may contain errors). The generated format is presented as understandable English, so users can review and customize the format. In addition, the format can be used to automatically check data against the format and find outliers that do not match. Topei shows substantially higher precision and recall than an alternate algorithm (Lapis) on test data. Topei’s usefulness is demonstrated by integrating it with spreadsheet, database, and web services systems.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom