NLP, from the basics.
Hi, everyone, this is my first article about Data Science topics, and I’m soo excited about it.
So, let’s start talking about a subject that I appreciate talking about.
NLP — Natural Language Processing.
And what is that?
Natural language processing (NLP) refers to the branch of computer science — and more specifically, the branch of artificial intelligence or AI — concerned with giving computers the ability to understand the text and spoken words in much the same way human beings can. (IBM definition)
There are several fields of application to use NLP, you can use:
. Speech Recognition (speech-to-text): is the task of converting voice data into text data;
. Part of speech tagging;
. Named Entity Recognition (NER): identifies words of Phares as useful entities, one of the most exciting NLP fields.
. Sentiment Analysis attempts to extract subjective qualities, attitudes, emotions, and confusion from the text.
. Natural Language Generation.
Ok, now that you have a clear view about what are the fields that NLP can help you with, let’s dive into the code.
SPACY
It’s not possible to talk about NLP and don’t talk about Spacy.
spaCy is a open-source software library for advanced natural language processing.
spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow PyTorch.
The nlp object contains the processing pipeline and includes language-specific rules for tokenization etc.
# Import the English language class
from spacy.lang.en import English# Create the nlp object
nlp = English()
When you process text with the nlp object. spaCy creates a Doc object (short for ‘document’).
The Doc allows you to access information about the text in a structured way, and no information is lost.
# Created by processing a string of text with the nlp object
doc = nlp("NLP using spaCy!")# Iterate over tokens in a Doc
for token in doc:
print(token.text)# Output
NLP
using
spacy
!
Token objects represent the token in a document.
To get a token from a specific position, you can index it into the Doc.
Token objects also provide various attributes that let you access more information about the tokens. Like token.text, token.lemma_.
doc = nlp("NLP using spaCy!")# Index into the Doc to get a single Token
token = doc[1]# Get the token text via the .text attribute
print(token.text)# Output
NLP
This is only the begging of the conversation about NLP using spaCy.
There is a lot of information to talk about NLP, so stay aware of the next publication.