Rule-bases Matching with spaCy
So, let’s go to the third chapter of demystifying NLP with spaCy.
I’ve already written about the basics of NLP:
The second was to delve into spaCy and evaluate what is capable to do with spaCy.
And in this article, the topic is how to use Rule-based Matching on spaCy.
Let’s start…
What are Statistical Predictions vs. Rule-Based Systems?
Rule-based matching
Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for — they also give you access to the tokens within the document and their relationships.
This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents.
Statistical Predictions
Use cases: application needs to generalize based on examples
Real-world examples: product_name, person names, subject relationships
spaCy features: entity recognizer, dependency parser, part-of-speech tagger
Rule-Based System
Use cases: dictionary with a finite number of examples
Real-world examples: countries of the word, cities, drug names
spaCy features: a tokenizer, Matcher, PhraseMatcher
But, where do I use one or another?
(There is a lot of good information in spaCy documentation, so, if you want to learn more about spaCy, you should read deeply the documentation.)
For complex tasks, it’s usually better to train a statistical entity recognition model.
However, statistical models require training data, so for many situations, rule-based approaches are more practical.
This is especially true at the start of a project: you can use a rule-based approach as part of a data collection process, to help you “bootstrap” a statistical model.
Rule-based systems are a good choice if there’s a more or less finite number of examples that you want to find in the data, or if there’s a very clear, structured pattern you can express with token rules or regular expressions. For instance, country names, IP addresses, or URLs are things you might be able to handle well with a purely rule-based approach.
Matcher x Phrase Matcher
The Matcher
allows you to write very abstract representations of the tokens you’re looking for, using lexical attributes, linguistic features predicted by the model, operators, set membership, and rich comparison.
For example, you can find a noun, followed by a verb with the lemma “love” or “like”, followed by an optional determiner and another token that’s at least 10 characters long.
The PhraseMatcher
is useful if you already have a large terminology list or gazetteer consisting of single or multi-token phrases that you want to find exact instances of in your data.
Let’s see an example for your better understanding.
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span, DocBin
TEXTS = [
'How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?',
'The iPhone 8 reviews are here', "iPhone 11 vs iPhone 8: What's the difference?",
'I need a new phone! Any tips?'
]
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
matches = matcher(doc)
spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
print(spans)
doc.ents = spans
docs.append(doc)
doc_bin = DocBin(docs=docs)
doc_bin.to_disk("./train.spacy")
In this example, I cover many important topics when using spaCy.
I create a blank pipeline of a given language class.
nlp = spacy.blank("en")
The vocabulary object must be shared with the documents the matcher will operate on.
matcher = Matcher(nlp.vocab)
The two patterns that match any pattern of ‘iphone’ or ‘IPHONE’ and ‘iphone {followed by any digit}’
You can find more pattern examples in spaCy documentation.
# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
Add a rule to the matcher, consisting of an ID key (’GADGET’) and one or more patterns.
# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])
It’s passing the ‘TEXTS’ into NLP to transform into a Doc object.
docs = []
for doc in nlp.pipe(TEXTS):
Find all tokens sequences matching the supplied patterns on the Doc
matches = matcher(doc)
The matches object has a match_id, start and end.
So, with the start and end, it’s possible to build a span object.
spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
We attribute the spans to the doc.ents object.
doc.ents = spans
If we print the span, the result will be:
(... look at the complete code...)
spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
print(spans)
output
[iPhone X]
[iPhone X]
[iPhone X]
[iPhone 8]
[iPhone 11, iPhone 8]
[]
So, in the last text, we didn’t find the pattern (’’I need a new phone! Any tips?’”).
The pattern was to find the text ‘iphone’ with an ‘x’ or a ‘digit’ and in this text, it’s just the text, so it didn’t match.
You can use the matcher in several ways, as in the examples below:
# Matches "love cats" or "likes flowers"
pattern1 = [{"LEMMA": {"IN": ["like", "love"]}},
{"POS": "NOUN"}]
# Matches tokens of length >= 10
pattern2 = [{"LENGTH": {">=": 10}}]
# Match based on morph attributes
pattern3 = [{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}]
# "", "Number=Sing" and "Number=Sing|Gender=Neut" will match as subsets
# "Number=Plur|Gender=Neut" will not match
# "Number=Sing|Gender=Neut|Polite=Infm" will not match because it's a superset
You can use also regular expressions
nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")
expression = r"[Uu](nited|\\.?) ?[Ss](tates|\\.?)"
for match in re.finditer(expression, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
if span is not None:
print('Found Match:', span.text)output:
Found Match: United States
Found Match: United States
Found Match: U.S.
Found Match: US
And even more complex patterns like finding ‘GOOGLE I/O
from spacy.lang.en import English
from spacy.matcher import Matcher
from spacy.tokens import Span
nlp = English()
matcher = Matcher(nlp.vocab)
def add_event_ent(matcher, doc, i, matches):
# Get the current match and create a tuple of entity label, start and end.
_, start, end = matches[i]
entity = Span(doc, start, end, label="EVENT")
doc.ents += (entity,)
print(entity.text)
pattern = [
{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}
]
matcher.add("GoogleIO", [pattern], on_match=add_event_ent)
doc = nlp("This is a text about Google I/O")
matches = matcher(doc)
from spacy import displacy
html = displacy.render(doc, style="ent", page=True,
options={"ents": ["EVENT"]})
There is a lot more you can do using spaCy, but as I said in the beginning you can use a rule-based approach as part of a data collection process, to help you “bootstrap” a statistical model.
There is a lot of information to talk about NLP, so stay aware of the next publication.