NLP, first steps using spaCy
Hi, everyone, let’s talk about one more step how to create and use NLP pipelines.
So, let’s start to see what spaCy can make for us.
To understand more about the spaCy structure, this image is pretty cool.
This image is from the spaCy documentation.
So let’s summarize (this text is from spaCy documentation):
The central data structures in spaCy are the Language
class, the Vocab
and the Doc
object. The Language
class is used to process a text and turn it into an Doc
object. It’s typically stored as a variable called nlp
The Doc
object owns the sequence of tokens and all their annotations. By centralizing strings, word vectors, and lexical attributes in the Vocab
, we avoid storing multiple copies of this data. This saves memory and ensures there’s a single source of truth
Text annotations are also designed to allow a single source of truth: the Doc
object owns the data, and Span
and Token
are views that point into it.
The Doc
object is constructed by the Tokenizer
and then modified in place by the components of the pipeline. The Language
object coordinates these components. It takes the raw text and sends it through the pipeline, returning an annotated document.
It also orchestrates training and serialization.
Predicting Part-of-speech Tags
spaCy can parse and tag a given Doc
. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context.
import spacy
import pandas as pd
from spacy import displacynlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")text_attributes = [
dict(Text=token.text, Lemma=token.lemma_, Pos=token.pos_, Tag=token.tag_, Dep=token.dep_, Shape=token.shape_, Is_Alpha=token.is_alpha, Is_stop=token.is_stop)
for token in doc
]text_df = pd.DataFrame(text_attributes)
There is a lot to talk about this.
The meaning of every object:
- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple UPOS part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape — capitalization, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?
displacy.serve(doc, style="dep")
This is just part of this display.
displacy.serve(doc, style="ent")
Let’s code now, to better understand this data structure
# Create an nlp object
from spacy.lang.en import English
nlp = English()# Import the Doc class
from spacy.tokens import Doc# The words and spaces to create the doc from
words = ['Hi', ',', 'my', 'name', 'is', 'Conrado', '.']
spaces = [False, True, True, True, True, False, False]# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)print(doc.text)
output: 'Hi, my name is Conrado.'
Now, let’s play around it with this a little bit and create some Span
#Import the Doc and Span classes
from spacy.tokens import Doc, Span# The words and spaces to create the doc from
words = ['Hi', ',', 'my', 'name', 'is', 'Conrado', '.']
spaces = [False, True, True, True, True, False, False]# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)# Create a span manually
span = Span(doc, 5, 7)# Print Span
print(span)output: 'Conrado'# Create a span with a label
span_with_label = Span(doc, 5, 6, label="NAME")# Add span to the doc.ents
doc.ents = [span_with_label]# Print Entities
print(doc.ents)output: '(Conrado,)'
This is very powerful when creating some pipelines to analyze NLP.
I covered many topics in this post, it’s very important to understand these main concepts to go further in the analysis.
Next week, I’ll post another one.
Thanks.