The Covid-19 pandemic has sparked significant discussions on social media, especially on Twitter. This project aims to classify the sentiment of tweets related to Covid-19 using various machine learning models. The analysis involves several steps, including data preprocessing, model construction, and evaluation.
Link to Notebook.
Dataset
The dataset consists of tweets about Covid-19, collected using the Twitter API. Key data fields include tweet text, user information, timestamps, and additional metadata. This dataset is used to train and evaluate sentiment classification models.
The training set spans from March 16 to March 30 and contains over 41,000 rows. The test set covers the period from March 12 to March 16, comprising over 3,000 rows.
Initially, the dataset included five sentiment classes: extreme positive, positive, neutral, negative, and extreme negative. However, due to significant class imbalance (with the extreme negative class being almost half the size of the positive class), the classes were reduced to three: positive, neutral, and negative. This adjustment helps in achieving a more balanced and effective training process for the sentiment classification models.
The dataset is available here.
Data Preprocessing
Preprocessing is crucial for preparing the text data for model training. The steps taken include:
-
Cleaning Text: Numbers are replaced with ‘@’ symbols, URLs with ‘[WEBSITE]’, and mentions, hashtags, and non-alphanumeric characters are removed.
-
Vectorization: TensorFlow’s
TextVectorization
layer transforms raw text data into numerical representations that can be used by machine learning models. Maximum number of tokens (words) was set to 56000 (approx 99th percentile) and the length of the output sequence was set to 50 (approx 95th percentile) to capture most of the patterns. -
Embedding: Token embeddings are created using TensorFlow’s
Embedding
layer with the input dimension equal to the vocabulary length from the vectorizer above. -
Label Encoding: Neural Network expects integer labels so the sentiment labels are encoded into integers using scikit learn’s
LabelEncoder
.
The data is then converted into TensorFlow datasets and batched for training to improve the efficiency, scalability, and performance of the machine learning pipeline.
Model Experiments
Various machine learning models were initially trained to determine the performance of convolutional, recurrent and pretrained embedding layers in classifying tweet sentiments:
-
1D Convolutional Neural Network (Conv1D): A 1D CNN model was created to process text sequences, capturing local patterns and features within the text using convolutional filters.
-
Long Short-Term Memory Network (LSTM): LSTM layers were used to capture long-term dependencies and sequential patterns in the text data, making it suitable for understanding context in sentences.
-
Gated Recurrent Unit (GRU): GRU layers, similar to LSTM but with a simpler structure, were implemented to efficiently learn from sequential data while reducing computational complexity.
-
Bidirectional LSTM: This model combines forward and backward LSTM layers to capture context from both directions in the text, enhancing the understanding of the sequence.
-
Pretrained Embeddings (Universal Sentence Encoder): The Universal Sentence Encoder was used to leverage pre-trained embeddings for capturing semantic meaning, followed by dense layers for classification.
The results are shown below in the table below:
Model | Architecture | Train Accuracy | Test Accuracy |
---|---|---|---|
Model_1 | 1D Convolutional Neural Network (Conv1D) | 0.906699 | 0.819905 |
Model_2 | Long Short-Term Memory Network (LSTM) | 0.943485 | 0.827804 |
Model_3 | Gated Recurrent Unit (GRU) | 0.944116 | 0.841759 |
Model_4 | Bidirectional LSTM | 0.942513 | 0.844392 |
Model_5 | Pretrained Embeddings (Universal Sentence Encoder + Dense) | 0.746240 | 0.664560 |
Overall, Models 2, 3, and 4 demonstrated strong performance with good generalization capabilities, while Model 5 underperformed, indicating that solely relying on pre-trained embeddings without adequate fine-tuning may not capture the specific contextual details required for this sentiment analysis task.
Sample Predictions
A few sample predictions made by the sentiment analysis model have been presented to understand the model performance. The table below showcases the original tweets, the true sentiment labels, and the predicted sentiment labels.
Tweet | True Sentiment | Predicted Sentiment |
---|---|---|
I have summarized the most important points from the paper in this thread [WEBSITE] | Positive | Negative |
Breaking New Jersey officials urge residents to stock up for a two week coronavirus quarantine Just in case COVID [WEBSITE] | Neutral | Negative |
Another day another supermarket shelf cleared out of toilet paper Covid Coles Wendouree [WEBSITE] | Positive | Negative |
So started this weight loss challenge work however with every grocery store ransacked I don’t think I’ll have a problem losing weight thanks Covid | Negative | Negative |
Coronavirus be like the end of the world Especially at the grocery store [omy] TwitterOfTime COVID KeepSafeEveryone pleaselike | Positive | Negative |
Just going to leave this one here Full story at [WEBSITE] [WEBSITE] | Negative | Negative |
The sample predictions reveal that the model accurately identifies negative sentiments but often misclassifies positive and neutral sentiments as negative. This could be due to certain phrases or tones that the model interprets incorrectly. Additionally, some tweets may have been mislabeled, such as the first one, which appears to be more neutral than positive. These ambiguities highlight the model’s limitations in distinguishing between subtle sentiment variations, suggesting a need for further refinement in context understanding and sentiment differentiation.
Suggested Improvements
To enhance the performance of the sentiment analysis model and address the identified limitations, the following three improvements are recommended:
-
Fine-Tuning Pretrained Models: The pretrained Universal Sentence Encoder underperformed due to lack of fine-tuning. By fine-tuning the encoder on the specific Covid-19 tweet dataset, the model can better adapt to the nuances and context of these tweets, improving its ability to correctly classify sentiments.
-
Advanced Preprocessing: Enhance text preprocessing by incorporating more sophisticated techniques. This includes handling negations, correcting spelling errors, and considering the semantic context of words. These improvements can help the model better understand the true sentiment of the text, reducing misclassifications.
-
Transformer Models: Explore advanced transformer-based models like BERT, RoBERTa, or GPT-3, which have demonstrated state-of-the-art performance in various NLP tasks. These models can capture the context and sentiment in complex sentences more effectively, leading to improved classification accuracy.