A framework for automated corpus compilation for KeyXtract: Twitter model


The corpus is a limiting factor for a keyword extraction process with a word matching stage. This paper proposes a framework to automate the corpus generation stage required for the Twitter Model of KeyXtract, an algorithm used for essential keyword extraction from tweets. The initial algorithm was designed with two manually compiled corpora that limited the adaptability of the system. The automated framework proposed in the present research is an extension to the keyword extraction process of KeyXtract and would address this limitation of the system. The design was carried out using open-class words of the source text and by matching them against the bag of words compiled by analyzing the tweets. The automated corpus had a total of 138 words, out of which 74 words were also found in the handpicked corpus (which had a total of 206 words). However, when the corpus was used with the keyword extraction system, the average F1 scores of the system showed a decrease of 0.07, proving that the automated corpus cannot perform parallel to the human-made corpus in complexity. This was because the human-made corpus was compiled using syntactic, semantic and pragmatic features while the automated framework focused only on the syntactic features. However, there were individual tweets in which the F1 score showed an increase. Thus, this was a promising first step in the corpus automation process. The automatic corpus generation framework could be made more accurate by including the semantic analysis of the lexical items. Thus, the present framework is able to substantially address the limitation of the corpus compilation which was present in the Twitter Model of KeyXtract.

17th International Conference on Advances in ICT for Emerging Regions, ICTer 2017 - Proceedings