Emojis Aid Social Media Sentiment Analysis: Stop Cleaning Them Out!
Leverage emojis in social media sentiment analysis to improve accuracy.
Including emojis in the social media sentiment analysis would robustly improve the sentiment classification accuracy no matter what model you use or how you incorporate emojis in the loop:
More than half of the popular BERT-based encoders don’t support emojis
Twitter-RoBERTa encoder performs the best in sentiment analysis and coordinates well with emojis
Instead of cleaning emojis out, converting them to their textual description can help boost sentiment classification accuracy and handle the out-of-vocabulary issue.
As social media has become an essential part of people’s lives, the content that people share on the Internet is highly valuable to many parties. Many modern natural language processing (NLP) techniques were deployed to understand the general public’s social media posts. Sentiment Analysis is one of the most popular and critical NLP topics that focuses on analyzing opinions, sentiments, emotions, or attitudes toward entities in written texts computationally [1]. Social media sentiment analysis (SMSA) is thus a field of understanding and learning representations for the sentiments expressed in short social media posts.
Another important feature of this project is the cute little in-text graphics — emojis😄. These graphical symbols have increasingly gained ground in social media communications. According to Emojipedia’s statistics in 2021, a famous emoji reference site, over one-fifth of the tweets now contains emojis (21.54%), while over half of the comments on Instagram include emojis. Emojis are handy and concise ways to express emotions and convey meanings, which may explain their great popularity
However ubiquitous emojis are in network communications, they are not favored by the field of NLP and SMSA. In the stage of preprocessing data, emojis are usually removed alongside other unstructured information like URLs, stop words, unique characters, and pictures [2]. While some researchers have started to study the potential of including emojis in SMSA in recent years, it remains a niche approach and awaits further research. This project aims to examine the emoji-compatibility of trending BERT encoders and explore different methods of incorporating emojis in SMSA to improve accuracy.
1 Background Knowledge
What is here is some background knowledge about SMSA you might want to know before looking into the actual experiment. No technical background/math is required so far. Let me first explain the intuition of the most typical SMSA task:
As the picture above shows, given a social media post, the model (represented by the gray robot) will output the prediction of its sentiment label. In this example, the model responds that this post is 57.60% likely to express positive sentiment, 12.38% likely to be negative, and 30.02% likely to be neutral. Some studies classify posts in a binary way, i.e. positive/negative, but others consider “neutral” as an option as well. This project follows the latter.
Development of Sentiment Analysis Methodologies
To my best knowledge, the first quantitative approach to studying social media sentiment is using the lexicon-based method. The model has a predefined lexicon that maps each token to a sentiment score. So, given a sentence, the model consults the lexicon, aggregates the sentiment scores of each word, and outputs the overall sentiment score. Very intuitive, right?
SentiWordNet and VADER are the two paradigms of this kind that have been favored by both the industry and academia
With the development of machine learning, classifiers like SVM, Random Forests, Multi-layer Perceptron, etc., gained ground in sentiment analysis. However, textual input isn’t valid for those models, so those classifiers are compounded with word embedding models to perform sentiment analysis tasks. Word embedding models convert words into numerical vectors that machines could play with. Google’s word2vec embedding model was a great breakthrough in representation learning for textual data, followed by GloVe by Pennington et al. and fasttext by Facebook
Due to the sequential nature of natural language and the immense popularity of Deep Learning, Recurrent Neural Network (RNN) then becomes “the popular kid.” RNN decodes, or “reads”, the sequence of word embeddings in order, preserving the sequential structure in the loop, which lexicon-based models and traditional machine learning models didn’t achieve The evolved workflow is explained in the diagram above. Word embeddings are passed into an RNN model that outputs the last hidden state(s) (If you don’t know what the last hidden state is, it’s intuitively the “summary” composed by the RNN after “reading” all the text.) Lastly, we use a feed-forward fully connected neural network to map the high-dimensional hidden state to a sentiment label.
We are almost there! The last piece of the puzzle is the Transformer models. Even if you haven’t learned NLP, you still might have heard about “Attention is All You Need” [3]. In this paper, they proposed the self-attention technique and developed the Transformer Model. These models are so powerful that it transcends the previous models in almost every subtask of NLP. If you are not familiar with Transformer models, I strongly recommend you read this introductory article by Giuliano Giacaglia.
Both industry and academia have started to use the pretrained Transformer models on a large scale due to their unbeatable performance. Thanks to the Hugging Face transformer package, developers can now easily import and deploy those large pretrained models. BERT, aka. Bidirectional Encoder Representations for Transformer, is the most famous transformer-based encoder model that learns excellent representations for text. Later on, RoBERTa, BERTweet, DeBERTa, etc., were developed based on BERT.
2 Experiments
With all those background knowledge, we can now dive into the experiments and programming parts! If you do feel not confident with the mechanism of NLP, I recommend you to skip the technical details or go read some introductory blogs about NLP on Towards Data Science. Let’s clarify our experiment objectives first. We want to know:
how compatible the currently trending pretrained BERT-based models are with emoji data.
how the performance would be influenced if we incorporate emojis in the SMSA process.
what exactly we should do in the data processing stage to include the emojis.
Model Design
Our model follows the aforementioned neural network paradigm that consists of a pretrained BERT-based encoder, a Bi-LSTM layer, and a feedforward fully connected network. The diagram is shown below:
To be clear, a preprocessed tweet is first passed through the pretrained encoder and becomes a sequence of representational vectors. Then, the representational vectors are passed through the Bi-LSTM layer. The two last hidden states of the two directions of LSTM will be processed by the feedforward layer to output the final prediction of the tweet’s sentiment.
We alter the encoder models and emoji preprocessing methods to observe the varying performance. The Bi-LSTM and feedforward layers are configured in the same way for all experiments in order to control variables. In the training process, we only train the Bi-LSTM and feed-forward layers. The parameters of pretrained encoder models are frozen. The PyTorch implementation of this model and other technical details can be found in my GitHub Repo.
So, whenever you want to conduct Twitter sentiment analysis, make sure you first validate the dataset if the dataset store tweets by their Tweet ID, which require you to spend extra effort to retrieve the original text. Tweets can easily perish if the dataset is from years ago. Also, don’t expect too much on applying for Twitter API. My mentor, who is an assistant professor at a prestigious American university, can’t even meet their requirement (for some unknown reason). Lastly, to preserve the emojis, don’t ever save them in csv or tsv format. Pickle, xlsx, or json can be your good choices.
Anyways, to find a dataset that retains emojis, has sentiment labels, and is of desirable size was extremely hard for me. Eventually, I found this Novak et al’s dataset satisfies all criteria.
Emoji-compatibility Test of the BERT family
Before implementing the BERT-based encoders, we need to know whether they are compatible with emojis, i.e. whether they can produce unique representations for emoji tokens. More specifically, before passing the tweet into an encoder, it will first be tokenized by a model tokenizer that is unique to the encoder (e.g. RoBERTa-base uses the RoBERTa-base tokenizer, while BERT-base uses the BERT-base tokenizer). What the tokenizer does is splitting the long strings of textual input into individual word tokens that are in the vocabulary
In our case, if emojis are not in the tokenizer vocabulary, then they will all be tokenized into an unknown token (e.g. “<UNK>”). Encoder models will thus produce the same vector representation for all those unknown tokens, in which case cleaning or not cleaning out the emojis will technically not make any difference in the model performance.