In an article published in the journal Data in Brief, researchers presented a detailed dataset for sentiment analysis in the public security domain, addressing the complexities of sarcasm and bilingual code-mixed content.
Data from social media was systematically collected and annotated by experts to determine sentiment and sarcasm, with language identification. This dataset aimed to enhance natural language processing and machine learning for multilingual regions in Southeast Asia.
Background
In the digital era, social media has become a critical barometer of public sentiment, especially within the public security domain. Analyzing social media content allows for the real-time assessment of public emotions, opinions, and attitudes toward security-related events and crises. However, effective sentiment analysis faces significant challenges due to the complexities introduced by sarcasm and bilingual code-mixed content, which can obscure the intended sentiment.
Previous research has highlighted the scarcity of domain-independent datasets for public security sentiment analysis, creating a gap that hinders comprehensive understanding and response to public sentiment in crises. Moreover, the intricate nature of sarcasm further complicates sentiment interpretation, while bilingual code-mixed content presents additional difficulties in accurate sentiment detection.
This paper introduced a novel dataset specifically designed to address these challenges. The dataset included systematically collected and annotated social media data from platforms such as Twitter and TikTok, focusing on public security-related content. Expert annotators meticulously labeled the data for sentiment, sarcasm, and language, ensuring high-quality annotations that reflected the complexities of real-world communication.
By offering a dataset that encompassed bilingual, code-mixed content and accounted for sarcasm, this research provided a valuable resource for advancing natural language processing, computational linguistics, and machine learning. It facilitated more accurate sentiment analysis and sarcasm detection, particularly in multilingual contexts, thus bridging the gaps in previous studies and enhancing the capabilities of public security threat detection systems.
Dataset Overview and Annotation Details
The dataset included 10,000 rows of comments and tweets from Twitter and TikTok, annotated for sentiment and sarcasm. Each entry was labeled as 'positive,' 'negative,' or 'neutral' for sentiment and 'sarcastic' or 'not sarcastic' for sarcasm. The data collection and annotation process involved three expert annotators and an additional expert for language identification, categorizing content as English, Malay, or code-mixed.
Sentiment labels showed 3072 positive, 4197 negative, and 2569 neutral entries, while sarcasm labels included 2355 sarcastic and 7645 non-sarcastic comments. Language distribution revealed that Malay has the most significant presence, followed by code-mixed and English content. The dataset was valuable for advancing sentiment analysis and sarcasm detection in multilingual contexts, offering extensive data for natural language processing, machine learning, and public security threat detection applications. Figures and tables provide detailed breakdowns of sentiment, sarcasm, and language identification labels.
Experimental Design, Materials, and Methods
The researchers outlined the systematic process of constructing and annotating a comprehensive dataset for sentiment analysis and sarcasm detection from social media platforms, specifically Twitter and TikTok. The dataset construction involved two main phases: data acquisition and data annotation.
Data acquisition began with keyword searching across a predefined set related to both natural and non-natural disasters in the public security domain. Using these keywords, data scraping was conducted using the respective application programming interfaces (APIs), Twitter API for tweets, and a custom scraper for TikTok comments), yielding a substantial volume of raw content. This phase ensured a diverse collection encompassing various disaster scenarios and public responses.
Data annotation followed, where the acquired data underwent meticulous selection and refinement to ensure relevance and quality. Attributes common to both platforms were chosen, and content selections were filtered based on meaningfulness. A total of 10,000 meaningful contents were selected and merged using OpenRefine, balancing representation across disaster types and platforms.
Annotation tasks were carried out by three expert annotators from diverse backgrounds. Using the Doccano annotation tool, they labeled each content for the sentiment (positive, negative, neutral) and sarcasm (sarcastic, not sarcastic), with a majority voting system resolving disagreements.
This structured approach ensured the dataset's quality and richness, facilitating advanced research in natural language processing, computational linguistics, and machine learning. The annotated dataset not only supported academic studies but also enhanced practical applications in public security and crisis management, offering insights into public sentiment and communication dynamics during disasters.
Conclusion
In conclusion, the researchers developed a detailed dataset for sentiment analysis in the public security domain, addressing sarcasm and bilingual code-mixed content complexities. Social media data from Twitter and TikTok was systematically collected and annotated by experts for sentiment, sarcasm, and language identification.
This dataset enhanced natural language processing, computational linguistics, and machine learning, particularly for multilingual contexts in Southeast Asia. By offering high-quality, annotated data, the research supported advanced sentiment analysis, sarcasm detection, and practical applications in public security and crisis management, providing valuable insights into public sentiment and communication during disasters.
Journal reference:
- Mohd Suhairi Md Suhaimin, Mohd Hanafi Ahmad Hijazi, Ervin Gubin Moung, Annotated dataset for sentiment analysis and sarcasm detection: Bilingual code-mixed English-Malay social media data in the public security domain, Data in Brief, 2024, 110663, ISSN 2352-3409, DOI: 10.1016/j.dib.2024.110663, https://www.sciencedirect.com/science/article/pii/S2352340924006309