Premium
Identifying long tail term from large‐scale candidate pairs for big data‐oriented patent analysis
Author(s) -
Qu Peng,
Zhang Junsheng,
Yao Changqing,
Zeng Wen
Publication year - 2016
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.3792
Subject(s) - term (time) , computer science , string (physics) , information retrieval , data mining , set (abstract data type) , scale (ratio) , rank (graph theory) , perspective (graphical) , tf–idf , inverse , artificial intelligence , mathematics , geography , combinatorics , physics , geometry , cartography , quantum mechanics , mathematical physics , programming language
Summary Patent is a very important and valuable type of scientific and technical big data. This paper presents how to mine patent text to obtain valuable information/knowledge from large‐scale candidates obtained from these patents based on massive patent texts. We firstly propose a patent term extraction method using co‐occurrence in the abstract and first‐claim sections of patent records. There are three steps: (1) we extract candidate strings according to our definition of a term; (2) we propose an assumption to verify whether a candidate string is a qualified term or not by using the co‐occurrence of terms in the abstract and first claim; and (3) we use term frequency–inverse document frequencyAUTHOR: TF‐IDF has been defined as “term frequency–inverse document frequency”. Please check if correct. or mutual information to rank and select candidate terms. Secondly, we propose a new method to obtain valuable long tail term from patents. To fulfill the purpose, (1) we firstly build long tail term–common term pair as candidate set; (2) then we evaluate each candidate pair's value; and finally, (3) to demonstrate our method, we give an example on our result. This study provides a new perspective in extracting terms from free texts of patent records and also proposes a new method to obtain valuable long term to aid information analysis with massive patent texts. Copyright © 2016 John Wiley & Sons, Ltd.