
An Identification Method of Question Subjects Based on Word Embedding and LSTM
Author(s) -
Mingxia Gao,
Zihao Fu
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1631/1/012120
Subject(s) - word2vec , computer science , tf–idf , subject (documents) , word (group theory) , word embedding , artificial intelligence , identification (biology) , natural language processing , feature (linguistics) , information retrieval , scope (computer science) , set (abstract data type) , feature extraction , embedding , pattern recognition (psychology) , mathematics , world wide web , term (time) , linguistics , philosophy , physics , geometry , botany , quantum mechanics , biology , programming language
Using the subject of the question can locate the question area, narrow the scope of the query, and provide users with better answers. The question text is usually short text. Therefore, in view of its sparse features and irregular structure, this paper proposes an identification method of question subjects based on word embedding and LSTM (IQS-WE-L), and uses question set on the MadSci website for experimentation, which has three subjects. We firstly use the Word2vec to train the Wikipedia database to generate a dictionary. Then based on word vectors, we propose four feature extraction methods: W2V, W2V-TFIDF, W2V-c-TFIDF and W2V-c, which formalizes the text features into vectors through word embedding and other features. Finally, we build an LSTM network for classification training to identify the subject of the question and quantitative evaluate effect of four feature extraction methods we proposed. Experimental data shows that the method proposed in this paper can effectively identify the subject of the question. When classifying the subject of the question, the F1 value can reach a maximum of 0.9339.