In an article recently published in the journal PLOS ONE, researchers performed a comparative analysis of lexicon-based and supervised machine learning (ML) approaches for the automated classification of short texts obtained from social media.
Background
The field of social media studies is expanding rapidly with the growing importance of social networks in people’s lives. Social media platforms are increasingly serving as a crucial mode of communication for political institutions, citizens residing in distant places, business representatives, and civil society organizations.
The activities of social media users provide a large amount of freely available textual data to social scientists, which can be used by them to advance theories in different areas, including public opinion, media studies, comparative politics, and international relations.
For instance, the microblogging site Twitter is primarily used to package information in the form of short messages. Researchers label or categorize texts based on their stance, relevance, or topic during the study of short texts before performing further analysis.
However, the text classification process is extremely challenging as manual classification of a substantial number of texts is not feasible, which necessitates the use of automated methods that have not been adequately evaluated in the social science research context. Additionally, the social science concepts are complex and ambiguous, which increases the difficulties for automated classifiers to label/categorize texts based on those concepts.
Automated text classification approaches
Lexicon-based methods are used extensively by social scientists for automated text classification. These methods are based on expert-crafted dictionaries/lexicons of meaningful words on a specific topic that can be availed by researchers.
However, such lexicons are not always available to the research community, which is a major disadvantage of using lexicon-based methods and increasing the adoption of machine learning (ML) methods, including both unsupervised learning and supervised learning methods. However, supervised learning methods are more effective compared to unsupervised methods for classifying texts using labels defined in advance.
In recent years, deep learning (DL) has attained human-level performance while executing several tasks, such as text translation and image recognition. Although DL methods can efficiently handle nonlinear complex relationships within data, they require a large number of advanced computations to perform the task.
The study
In this study, researchers identified the most extensively used supervised ML and lexicon-based text classification methods from the literature on text classification in the social sciences and compared their performance for the automated categorization of short texts in a small, labeled, and imbalanced dataset to identify the best-performing algorithm among them.
Nine text classification methods were selected from the literature for comparative analysis in this study, including a lexicon-based classifier, five traditional ML classifiers including support vector machines (SVM), random forests (RF), k nearest neighbors (KNN), Naïve Bayes (NB), and logistic regression (LR), and three DL methods including fully-connected neural networks (FCNNs), convolutional neural networks (CNN), and long short-term memory (LSTM) neural networks.
In social science research, automated categorization is used frequently by researchers to analyze short texts such as article abstracts, tweets, paragraphs, or sentences. Researchers investigated the performance of the selected text classification methods in a common social science research setting that involved a small labeled dataset with rare-event data.
Specifically, they used a novel dataset on Twitter communication about climate change by eight United Nations (UN) organizations in various policy areas. The initial dataset contained 222,191 tweets related to both climate change and non-climate change issues posted by eight Twitter accounts, including @Refugees, @UNDRR, @UNDP, @UNICEF, @WHO, @FAO, @UNDP, and @UNOCHA, from the beginning of their tweeting history to the end of 2019. Every tweet was considered as an observation in the dataset.
Subsequently, 5,750 tweets were selected randomly from the entire dataset and manually labeled as either “not climate change-related” or “climate change-related”. This dataset was then split into a training set and a test set for model training and performance assessment, respectively. The test set was never used to prevent data leakage until the final performance evaluation step.
Additionally, a part of the training dataset was used as a validation set for hyperparameter tuning. The F1 score, which is calculated as the harmonic mean of the recall and precision of a model, was used as the key metric for the performance evaluation of the text classifiers considered in this study.
Significance of the findings
RF and LR classifiers displayed the best performance at significantly lower computational complexity among all classifiers, including the sophisticated neural classifiers, on the original small, labeled, and imbalanced dataset. However, the F1 scores of all classifiers were low on the original dataset due to the small size and imbalanced nature of the dataset. The F1 scores of the best LR and RF classifiers were 0.647619 and 0.649573, respectively, while the F1 score of the third-best lexicon-based classifier was 0.618557.
Researchers investigated the impact of the imbalanced dataset on the classification performance of classifiers by preparing a balanced dataset. They supplemented the original dataset with additional tweets obtained from Kaggle’s “Twitter Climate Change Sentiment Dataset”, which contained 43,943 climate change-related tweets. Researchers collected sufficient samples from the Kaggle dataset to balance the original dataset and used the new, balanced dataset for performance evaluation of the text classifiers again.
The classification performance of all classifiers improved significantly on the new, balanced dataset, with the CNN, RF, LR, FCNN, SVM, LSTM, and NB classifiers displaying very high F1 scores of more than 96%, while the lexicon-based and KNN classifiers displaying F1 scores above 85%. The lexicon-based classifier that displayed the third-best performance on the original small, imbalanced dataset demonstrated the worst classification performance on the new balanced dataset.
The CNN demonstrated the best performance on the balanced dataset with an F1 score above 97%, which indicated that the CNN is the most suitable alternative for text classification on balanced datasets when attaining the best possible performance is the primary objective. RF and LR showed the second and third-best classification performance on the balanced dataset with F1 scores of 0.969891 and 0.968928, respectively.
Overall, all classifiers, except the KNN and the lexicon-based classifier, showed almost a similar classification performance, which indicated that the selection of the classification algorithm is less important when the dataset is balanced.
To summarize, the findings of this study demonstrated that traditional ML algorithms are more suitable for short text classification compared to computationally demanding deep neural architectures as the advantages of such architectures for classification are marginal, while their training requires substantially more time than traditional algorithms. Deep neural architectures are only suitable when the datasets are of a large scale, and balanced and computational resources are available extensively.