Volume 113, Number 2, January 2016
|Number of page(s)||6|
|Section||Interdisciplinary Physics and Related Areas of Science and Technology|
|Published online||19 February 2016|
Using complex networks for text classification: Discriminating informative and imaginative documents
1 Institute of Mathematics and Computer Science, University of São Paulo - São Carlos, São Paulo, Brazil
2 São Carlos Institute of Physics, University of São Paulo - São Carlos, São Paulo, Brazil
Received: 25 August 2015
Accepted: 1 February 2016
Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.
PACS: 89.75.Fb – Structures and organization in complex systems / 89.75.Hc – Networks and genealogical trees
© EPLA, 2016
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.