Using spaCy to extract NER

Conrado Bio
4 min readJun 9, 2022

--

Photo by Scott Graham on Unsplash

Hello everyone, I’m still on my article series writing about spaCy and NLP.
I already wrote about:

  1. NLP, first steps using spaCy

2. Rule-based Matching with spaCy

3. Twitter Sentiment Analysis using spaCy

4. Improving Text Classification Models using spaCy

And now let’s explore one of the essential spaCy features, NER, Named Entity Recognition.

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

You can find much more information by accessing the spaCy documentation.

What’s Named Entity Recognition?

A named entity is a “real-world object” that’s assigned a name — for example, a person, a country, a product, or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction.

Let’s go to the first example:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
print(ent.text, ent.label_)
-OUTPUT
Apple ORG
U.K. GPE
$1 billion MONEY

Amazing!!!!!

The model identified the named entity, in this case, an ORG (Companies, agencies, institutions.), GPE (Geopolitical entity), and MONEY (Monetary values, including units.).

We also look at using the displaCy visualizer.

displacy.serve(doc, style="ent")
Photo by Jamie Templeton on Unsplash

NER Project

Okay, now that I’ve covered the basics of Named Entity, let’s talk about starting the NER project.
There are 2 ways to start a NER project, the first and easiest is to use the spaCy pre-trained models, such as en_core_web_lg or en_core_web_trf.
Using these two pre-trained models it’s fast to get started.

Buuuut, almost all the time you want to train some specific labels like brand names found in Twitter texts or drug names or even find some specific pattern of company names and so on.
In cases like these, you will need to train your own model with your annotated data.

Yeah, you read that right!

You have to annotate your own data to fit your project target.

Annotation

This is one of the most important tasks when building an NLP project.
You need to annotate the correct data because machines can learn from written texts by processing the crucial information.
And the quality of the annotation has a huge impact on the final model.
There are many tools that do this annotation process, such as Prodigy, Stanford CoreNLP, TagEditor, and doccano.

doccano

Is an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence tasks.

So, you can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on.

You can find more information on the doccano github

https://github.com/doccano/doccano

Classification Texts

This is one of the tools to use when annotation your own data to build your NLP pipeline and train your own models.

It’s very easy to use this kind of tool and the results of training with our own data it's very high.

Tips for building your own NLP model:

. Have a clear goal for your project

The main question is: what’s the object of my project?
Will it be a classification project (eg sentiment analysis)? Will it be a Named Entity Recognition project?

. Where is the data?

Will the data come from Twitter or any other social media? Will I extract some articles from an API?

. Which annotation tool I will use for this project?

. What are the important labels to annotate?

If it’s sentiment analysis, I’ll mark the texts as positive or negative or any other label.

. What are the guidelines to follow when annotating?

What are the text features to classify a text as negative? (the presence of negative words is obvious, but if the text has some complaints, what are these patterns to follow?)

I know the annotation pipeline is much more complex than these questions, but this is a good step to take in our first project.

Now with your own dataset, you can build your spaCy pipeline to train NER or whatever.

In the next article, I’ll do a project from scratch to show how easy and fast it is to train a new NLP project from scratch.

--

--

No responses yet