Improving Text Classification Models using spaCy

Conrado Bio
3 min readMay 27, 2022

--

Photo by Kevin Ku on Unsplash

Hello, here I’m again.

In the last post, I wrote a case study using spaCy to predict sentiments in Twitter data.

There is a lot more to talk about about the fundamental principles of the spaCy training process.

But today, the topic it’s about how to improve the pipeline for Text Classification.

Summarizing the last post:

  1. Extract the data (the case study was a Kaggle competition, so the dataset was ready to go)
  2. Keys concepts about spaCy Text Classification
  3. Preprocess the data to fit into the spaCy pattern (we didn't do any cleaning in the text)
  4. Convert the data to DocBin format
  5. Set the config.cfg
  6. Train
  7. Evaluate

Ok…. Let’s start with the optimization part.

. Original Model:

We have trained our model using the spacy model with a blank English model.

And the results weren't so good.

ROC AUC: 0.80

F1 score: 0.63

. First optimization:

Let’s train using spacy’s small model. (I've used small model about my computer capacity, to train in the large model I will use Google Colab)

nlp = spacy.load("en_core_web_sm")

To change this, we have to new config.cfg

Accessing -> https://spacy.io/usage/training

. Second Optimization:

Let’s clean the text.

For these, I use 3 functions (of course you can improve with a lot of other functions and cleaning pipelines)

def remove_emoji(text):
emoji_pattern = re.compile("["
u"\\U0001F600-\\U0001F64F" # emoticons
u"\\U0001F300-\\U0001F5FF" # symbols & pictographs
u"\\U0001F680-\\U0001F6FF" # transport & map symbols
u"\\U0001F1E0-\\U0001F1FF" # flags (iOS)
u"\\U00002702-\\U000027B0"
u"\\U000024C2-\\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)

def remove_url(text):
url_pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
return url_pattern.sub(r'', text)

def clean_text(text ):
delete_dict = {sp_character: '' for sp_character in string.punctuation}
delete_dict[' '] = ' '
table = str.maketrans(delete_dict)
text1 = text.translate(table)
textArr= text1.split()
return (' '.join([w for w in textArr if (not w.isdigit() and ( not w.isdigit() and len(w)>3))])).lower()

There are a lot of similar projects, and I found this function on this repo:

The results are much better.
After training in 20 epochs and almost 1 hour of training, look at the training page:

Wowww, the model is much better, the F1 score for the model is 85%.

Future Optimizations:

. Use the large model : en_core_web_lg or en_core_web_trf

. Improve the test cleaning pipeline

. Test another approaches: balance the data, evaluate the outliers

And there is much more to talk about optimization.

Thank you so much for reading this article.

If you have some suggestions for the next topics, please tell me.

That’s all folks, if you liked it, shared this article.

Thank you.

--

--

Responses (4)