Comparing AI and Lexicon-Based Approaches for Short Text Classification in Social Sciences

In an article recently published in the journal PLOS ONE, researchers performed a comparative analysis of lexicon-based and supervised machine learning (ML) approaches for the automated classification of short texts obtained from social media. 

Study: Comparing AI and Lexicon-Based Approaches for Short Text Classification in Social Sciences. Image credit: wavebreakmedia/Shutterstock
Study: Comparing AI and Lexicon-Based Approaches for Short Text Classification in Social Sciences. Image credit: wavebreakmedia/Shutterstock

Background

The field of social media studies is expanding rapidly with the growing importance of social networks in people’s lives. Social media platforms are increasingly serving as a crucial mode of communication for political institutions, citizens residing in distant places, business representatives, and civil society organizations.

The activities of social media users provide a large amount of freely available textual data to social scientists, which can be used by them to advance theories in different areas, including public opinion, media studies, comparative politics, and international relations.

For instance, the microblogging site Twitter is primarily used to package information in the form of short messages. Researchers label or categorize texts based on their stance, relevance, or topic during the study of short texts before performing further analysis.

However, the text classification process is extremely challenging as manual classification of a substantial number of texts is not feasible, which necessitates the use of automated methods that have not been adequately evaluated in the social science research context. Additionally, the social science concepts are complex and ambiguous, which increases the difficulties for automated classifiers to label/categorize texts based on those concepts.

Automated text classification approaches

Lexicon-based methods are used extensively by social scientists for automated text classification. These methods are based on expert-crafted dictionaries/lexicons of meaningful words on a specific topic that can be availed by researchers.

However, such lexicons are not always available to the research community, which is a major disadvantage of using lexicon-based methods and increasing the adoption of machine learning (ML) methods, including both unsupervised learning and supervised learning methods. However, supervised learning methods are more effective compared to unsupervised methods for classifying texts using labels defined in advance.

In recent years, deep learning (DL) has attained human-level performance while executing several tasks, such as text translation and image recognition. Although DL methods can efficiently handle nonlinear complex relationships within data, they require a large number of advanced computations to perform the task.

The study

In this study, researchers identified the most extensively used supervised ML and lexicon-based text classification methods from the literature on text classification in the social sciences and compared their performance for the automated categorization of short texts in a small, labeled, and imbalanced dataset to identify the best-performing algorithm among them.

Nine text classification methods were selected from the literature for comparative analysis in this study, including a lexicon-based classifier, five traditional ML classifiers including support vector machines (SVM), random forests (RF), k nearest neighbors (KNN), Naïve Bayes (NB), and logistic regression (LR), and three DL methods including fully-connected neural networks (FCNNs), convolutional neural networks (CNN), and long short-term memory (LSTM) neural networks.

In social science research, automated categorization is used frequently by researchers to analyze short texts such as article abstracts, tweets, paragraphs, or sentences. Researchers investigated the performance of the selected text classification methods in a common social science research setting that involved a small labeled dataset with rare-event data.

Specifically, they used a novel dataset on Twitter communication about climate change by eight United Nations (UN) organizations in various policy areas. The initial dataset contained 222,191 tweets related to both climate change and non-climate change issues posted by eight Twitter accounts, including @Refugees, @UNDRR, @UNDP, @UNICEF, @WHO, @FAO, @UNDP, and @UNOCHA, from the beginning of their tweeting history to the end of 2019. Every tweet was considered as an observation in the dataset.

Subsequently, 5,750 tweets were selected randomly from the entire dataset and manually labeled as either “not climate change-related” or “climate change-related”. This dataset was then split into a training set and a test set for model training and performance assessment, respectively. The test set was never used to prevent data leakage until the final performance evaluation step.

Additionally, a part of the training dataset was used as a validation set for hyperparameter tuning. The F1 score, which is calculated as the harmonic mean of the recall and precision of a model, was used as the key metric for the performance evaluation of the text classifiers considered in this study.

Significance of the findings

RF and LR classifiers displayed the best performance at significantly lower computational complexity among all classifiers, including the sophisticated neural classifiers, on the original small, labeled, and imbalanced dataset. However, the F1 scores of all classifiers were low on the original dataset due to the small size and imbalanced nature of the dataset. The F1 scores of the best LR and RF classifiers were 0.647619 and 0.649573, respectively, while the F1 score of the third-best lexicon-based classifier was 0.618557.

Researchers investigated the impact of the imbalanced dataset on the classification performance of classifiers by preparing a balanced dataset. They supplemented the original dataset with additional tweets obtained from Kaggle’s “Twitter Climate Change Sentiment Dataset”, which contained 43,943 climate change-related tweets. Researchers collected sufficient samples from the Kaggle dataset to balance the original dataset and used the new, balanced dataset for performance evaluation of the text classifiers again.

The classification performance of all classifiers improved significantly on the new, balanced dataset, with the CNN, RF, LR, FCNN, SVM, LSTM, and NB classifiers displaying very high F1 scores of more than 96%, while the lexicon-based and KNN classifiers displaying F1 scores above 85%. The lexicon-based classifier that displayed the third-best performance on the original small, imbalanced dataset demonstrated the worst classification performance on the new balanced dataset.

The CNN demonstrated the best performance on the balanced dataset with an F1 score above 97%, which indicated that the CNN is the most suitable alternative for text classification on balanced datasets when attaining the best possible performance is the primary objective. RF and LR showed the second and third-best classification performance on the balanced dataset with F1 scores of 0.969891 and 0.968928, respectively.

Overall, all classifiers, except the KNN and the lexicon-based classifier, showed almost a similar classification performance, which indicated that the selection of the classification algorithm is less important when the dataset is balanced.

To summarize, the findings of this study demonstrated that traditional ML algorithms are more suitable for short text classification compared to computationally demanding deep neural architectures as the advantages of such architectures for classification are marginal, while their training requires substantially more time than traditional algorithms. Deep neural architectures are only suitable when the datasets are of a large scale, and balanced and computational resources are available extensively.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, October 03). Comparing AI and Lexicon-Based Approaches for Short Text Classification in Social Sciences. AZoAi. Retrieved on December 22, 2024 from https://www.azoai.com/news/20231003/Comparing-AI-and-Lexicon-Based-Approaches-for-Short-Text-Classification-in-Social-Sciences.aspx.

  • MLA

    Dam, Samudrapom. "Comparing AI and Lexicon-Based Approaches for Short Text Classification in Social Sciences". AZoAi. 22 December 2024. <https://www.azoai.com/news/20231003/Comparing-AI-and-Lexicon-Based-Approaches-for-Short-Text-Classification-in-Social-Sciences.aspx>.

  • Chicago

    Dam, Samudrapom. "Comparing AI and Lexicon-Based Approaches for Short Text Classification in Social Sciences". AZoAi. https://www.azoai.com/news/20231003/Comparing-AI-and-Lexicon-Based-Approaches-for-Short-Text-Classification-in-Social-Sciences.aspx. (accessed December 22, 2024).

  • Harvard

    Dam, Samudrapom. 2023. Comparing AI and Lexicon-Based Approaches for Short Text Classification in Social Sciences. AZoAi, viewed 22 December 2024, https://www.azoai.com/news/20231003/Comparing-AI-and-Lexicon-Based-Approaches-for-Short-Text-Classification-in-Social-Sciences.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
EMOv2 Sets New Benchmark in Lightweight Vision Models