Twitter Sentiment Analysis using spaCy
Hi everyone, thank you to everyone who is reading and sharing my articles, I really appreciate it.
Now, I think it is the best article about spaCy so far.
I going to test an actual project using spaCy.
Yeahhhhhhhhh…
ABOUT THIS PROJECT:
This project was on Kaggle 2 years ago.
Tweet Sentiment Extraction → Extract support phrases for sentiment labels
The prize for this project was U$15.000, a lot of money.
The link to this project: https://www.kaggle.com/competitions/tweet-sentiment-extraction
Summarizing this project:
More than 30 thousand tweets were extracted and the sentiments about this tweet were labeled (positive, negative, or neutral).
The objective of this competition was to construct a model that can do the same — look at the labeled sentiment for a given tweet and figure out what word or phrase best supports it.
Very interesting project.
PROJECT
There 3 files in this challenge, the train data, the test data, and the sample_submission.
Let’s take a look at the train and test data.
I will not be going to do a full EDA analysis of this dataset, I just want to take a look at the label distribution.
Train data:
neutral 11118
positive 8582
negative 7781
Test data:
neutral 1430
positive 1103
negative 1001
Ok, the data is balanced, so we don’t need to make any preprocessing on these datasets.
df_train.head()
There are 4 columns, but for this analysis, let’s focus only on the text and sentiment column.
Let’s understand a little bit more about the spaCy training model
You can access the spacy documentation to read more about it, but with spaCy, you can train the following model types:
There are at least 17 pipelines that you can train your model in (of course there are many particular cases in each of these models, but there is a huge opportunity in using spaCy for NLP.)
In this specific case, I will use TextCategorizer.
But, what’s Text Categorizer?
The text categorizer predicts categories over a whole document and comes in two flavors: textcat
and textcat_multilabel
.
Predictions will be saved to doc.cats
as a dictionary, where the key is the name of the category and the value is a score between 0 and 1 (inclusive).
There much more features and attributes to study about this pipeline, but for now, let’s start.
Let’s Code….
I don’t think we can start now, there is only one important thing to talk about the Text Categorizer.
It’s very important to know the correct format of the data for training in spaCy, so we need some preprocessing functions to prepare the data for use.
We can’t simply load CSV files inside spaCy and let them train the pipeline, there is a specific format for that.
In this case, Text Categorizer, we have to preprocess the dataset to look like this:
{'text': text, 'label': {'label1': score_label1, 'label2': score_label2}}
So we have to transform the dataset into this format.
Now, the code part:
# load data
df_train = pd.read_csv("assets/train.csv")
df_test = pd.read_csv("assets/test.csv")
# Prepare the data
df_train= df_train[['text', 'sentiment']]# Drop Na
df_test = df_test.dropna()# Preprocess the data to the correct format
# output ->
[{'text': 'Last session of the day http://twitpic.com/67ezh', 'label': {'positive': 0.0, 'negative': 0.0, 'neutral': 1.0}}, {'text': ' Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China: (SH) (BJ).', 'label': {'positive': 1.0, 'negative': 0.0, 'neutral': 0.0}}, {'text': 'Recession hit Veronique Branquinho, she has to quit her company, such a shame!', 'label': {'positive': 0.0, 'negative': 1.0, 'neutral': 0.0}},
Ok, now the data is in the correct format to start our training pipeline.
# First, we have to start out spacy pipeline
# We use a blank with the English language
nlp = spacy.blank("en")
# We have to convert dictionary into DocBin() format
def convert(data, output_path):
db = DocBin()
for line in data:
doc = nlp.make_doc(line['text'])
doc.cats = line['label']
db.add(doc)
db.to_disk(output_path)
# Call the function and save the train and dev data
convert(annot_data, "corpus/train.spacy")
convert(dev_data, "corpus/dev.spacy")
Okkkkkkkkk, the code part is over.
Yeah, you read that right, the code part is over.
To train and evaluate the model you don’t need to code anymore, but now there are some tricks, so pay attention.
QUICKSTART
Now, you have to configure the spaCy, and you can use the QuickStart (https://spacy.io/usage/training#quickstart)
This is the spaCy quickstart, you have to choose the language, the pipeline components (you can have multiple selections), hardware, optimize (if using the predefined weight vector or not)
This will generate a base_config.cfg file.
All that you have to do is copy or download this file and put it in a folder inside your project.
CONFIG FILE
For the next steps, we will use the terminal.
After you’ve saved the starter config to a file base_config.cfg
, you can use the init fill-config
command to fill in the remaining defaults. Training configs should always be complete and without hidden defaults, to keep your experiments reproducible.
$ python -m spacy init fill-config base_config.cfg config.cfg
Now, there is a new file, the config.cfg.
You have to do one thing in this file, the path to the train and dev files.
This is how config.cfg looks like.
There are so many things and resources to talk about, but I will take that for another article. For now, think that the config.cfg
is considered the “single source of truth” about the training pipeline.
Today I want to show how easy it is to train your model using spaCy.
TRAIN
The part more exciting about this article is the train part.
It’s really easy.
In this article, we are following the spaCy documentation steps for training.
Training config files include all settings and hyperparameters for training your pipeline. Instead of providing lots of arguments on the command line, you only need to pass your config.cfg
file to spacy train
Important:
Training data for NLP projects come in many different formats. For some common formats such as CoNLL, spaCy provides converters you can use from the command line. In other cases, you’ll have to prepare the training data yourself.
To train your pipeline, you will use the command line.
# Training command
$python -m spacy
train # the command run
config/config.cfg # the path for config file,
--output ./training #the path for the output files
--paths.train corpus/train.spacy #the path where is you train file
--paths.dev corpus/dev.spacy # the path where is you dev file
Note:
To train a spaCy pipeline, your data must be in .spacy format.
Execution of the command, the pipeline with initializing.
And after initializing, start the training part.
We want to make several passes over the data during the training, each pass over the data is also called ‘epochs’.
Within each epoch, spaCy outputs the accuracy score for every 200 examples. These are the steps shown in the second column. It’s possible to change the frequency in the config.
Each line shows the loss and calculated accuracy score at this point during training.
The most interesting score to keep an eye on is the combined score in the last column. It reflects how accurately your model predicted the correct answers in the evaluation data.
The training runs until the model stops improving and exits automatically.
So, our model has a score of 67%.
That’s awesome.
But there is one more thing that spaCy can show us.
EVALUATE
Evaluate a trained pipeline.
Let’s run.
$python -m spacy
evaluate # the command to run
training/model-best # the path to the training best model
corpus/dev.spacy # Location of evaluation data
--output metrics/scores2.json # Output JSON file for metrics
That’s the result.
To explain all of this, I can use a whole article to explain.
But the important things:
Results:
Tokenize 100% of words / Overal f1 score: 67% / Speed: words per second that model could process
TextCat F:
The precision, recall, and F1 score for each label.
TextCat ROC AUC:
The name explains itself.
Wowwwwwwwwww
That was a lot to cover about spaCy, this is just the first practical project, but as you can see it’s very easy to use and train your models.
In upcoming articles, I will cover more about the training pipeline and go into detail about each of the steps to train a model.
That’s all folks, if you liked it, shared this article.
Thank you.