NLP, first steps using spaCy

Conrado Bio
4 min readMay 5, 2022
Photo by Paul Calescu on Unsplash

Hi, everyone, let’s talk about one more step how to create and use NLP pipelines.

So, let’s start to see what spaCy can make for us.

To understand more about the spaCy structure, this image is pretty cool.

https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg

This image is from the spaCy documentation.

So let’s summarize (this text is from spaCy documentation):

The central data structures in spaCy are the Language class, the Vocab and the Doc object. The Language class is used to process a text and turn it into an Doc object. It’s typically stored as a variable called nlp

The Doc object owns the sequence of tokens and all their annotations. By centralizing strings, word vectors, and lexical attributes in the Vocab, we avoid storing multiple copies of this data. This saves memory and ensures there’s a single source of truth

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it.

The Doc object is constructed by the Tokenizerand then modified in place by the components of the pipeline. The Language object coordinates these components. It takes the raw text and sends it through the pipeline, returning an annotated document.

It also orchestrates training and serialization.

Predicting Part-of-speech Tags

spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context.

import spacy
import pandas as pd
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
text_attributes = [
dict(Text=token.text, Lemma=token.lemma_, Pos=token.pos_, Tag=token.tag_, Dep=token.dep_, Shape=token.shape_, Is_Alpha=token.is_alpha, Is_stop=token.is_stop)
for token in doc
]
text_df = pd.DataFrame(text_attributes)

There is a lot to talk about this.

The meaning of every object:

  • Text: The original word text.
  • Lemma: The base form of the word.
  • POS: The simple UPOS part-of-speech tag.
  • Tag: The detailed part-of-speech tag.
  • Dep: Syntactic dependency, i.e. the relation between tokens.
  • Shape: The word shape — capitalization, punctuation, digits.
  • is alpha: Is the token an alpha character?
  • is stop: Is the token part of a stop list, i.e. the most common words of the language?
displacy.serve(doc, style="dep")

This is just part of this display.

displacy.serve(doc, style="ent")

Let’s code now, to better understand this data structure

# Create an nlp object
from spacy.lang.en import English
nlp = English()
# Import the Doc class
from spacy.tokens import Doc
# The words and spaces to create the doc from
words = ['Hi', ',', 'my', 'name', 'is', 'Conrado', '.']
spaces = [False, True, True, True, True, False, False]
# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)
output: 'Hi, my name is Conrado.'

Now, let’s play around it with this a little bit and create some Span

#Import the Doc and Span classes
from spacy.tokens import Doc, Span
# The words and spaces to create the doc from
words = ['Hi', ',', 'my', 'name', 'is', 'Conrado', '.']
spaces = [False, True, True, True, True, False, False]
# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
# Create a span manually
span = Span(doc, 5, 7)
# Print Span
print(span)
output: 'Conrado'# Create a span with a label
span_with_label = Span(doc, 5, 6, label="NAME")
# Add span to the doc.ents
doc.ents = [span_with_label]
# Print Entities
print(doc.ents)
output: '(Conrado,)'

This is very powerful when creating some pipelines to analyze NLP.

I covered many topics in this post, it’s very important to understand these main concepts to go further in the analysis.

Next week, I’ll post another one.
Thanks.

--

--