Basics of Language Processing
In any language, below are some language analysis categories. I will try to write basic processing using spaCy and NLTK
Category
meaning
Lexical
Segmenting text into words
Morphology
Shape and structure of words
Syntax
Rules of words in a sentence
Semantic
Meaning of words in a sentence
Acoustics
Representation of sound
Phonetics
Mapping sound to speech
Phonemics
Mapping speech to language
Lexical Analysis
Lexical analysis is the task of segmenting text into its lexical expressions i.e. words/tokens
Tokenization
Converting sentence into tokens/words called as tokenization. There are many ways to do this. I will discuss some of them below. I am also creating 3 sentences as below
demo_sent1 = "@uday can't wait for the #nlp notes YAAAAAAY!!! #deeplearning https://udai.gitbook.io/practical-ml/"
demo_sent2 = "That U.S.A. poster-print costs $12.40..."
demo_sent3 = "I am writing NLP basics."
all_sents = [demo_sent1, demo_sent2, demo_sent3]
print(all_sents)output:
White Space Tokenizer
We can tokenize the data by splitting the data at space. check the code below
For some of the words, it is working perfectly like U.S.A., poster-printer but we are getting @uday, basics., $12.40..., #nlp as words but we have to remove those #,@,. etc... So this tokenizer may give bad results if we have words like this.
NLTK Word Tokenizer
It follows the conventions of the Penn Treebank.
It is giving better results compared to the white space tokenizer but some words like can't and web addresses are not working fine.
NLTK Regex Tokenizer
We can write our own regex to split the sentence into tokens/words.
spaCy Tokenizer
Works on predefined regular expression rules for prefix_search, suffix_search, infix_finditer, token_match and also Dependency Parsing to find sentence boundaries.
There are some tokenizers like SentencePiece that can learn how to tokenize form corpus of the data. I will discuss this in another blog.
Our analysis and model performance also depends on the Tokenization algorithm so be careful while choosing the tokenization algorithm. If possible try with two or more algorithms or try to write custom rules based on your dataset/task.
Morphological Analysis
In linguistics, morphology is the study of the internal structure of words. I will try to explain some of them below.
Lemmatization
Using morphological analysis to return the dictionary form of a word i.e. the entry in a dictionary you would find all forms under. In Lemmatization root word is called Lemma.
Stemming
Stemming is the process of producing morphological variants of a root/base word.
I will try to explain some other pre-processing techniques like POS tagging, Dependency Parsing while doing deep learning.
You can get total code written in this blog from below GitHub link
Stop Words
stop words usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. We can remove the stop words if you don't need exact meaning of a sentence. For text classification, we don't need those most of the time but, we need those for question and answer systems. word not is also a stop word in NLTK and this may be useful while classifying positive/negative sentence so be careful while removing the stopwords. .You can get the stop words from NLTK as below
NLP pipeline

Text Preprocessing
You may get data from PDF files, speech, OCR, Docs, Web so you have to preprocess the data to get the better raw text. I would recommend you to read this blog.

Once you done with basic cleaning, i would suggest you to do your everything with spaCy. It is very easy to write total pipeline. I took imdb dataset and written a pipeline to clean the data and get the tokens/words from the data. Before going to that, please check the below notebook that explains spaCy.
I have written a class TextPreprocess which takes a raw text and gives tokens which will be given to the ML/DL algorithm. It will be very useful while deploying the algorithm in production if we write a clear pipeline like below. Writing this may take several days of analysis on the real-life text data. Once you have done with total analysis, please try to write a structured function/class which takes raw data and gives a data that will be fed to the algorithm or another preprocessing pipeline.
You can utilize that as below,
You can get that notebook from below link
Feature Extraction
You can use BOW, TFIDF, Word2Vec based models to get structured features from cleaned text data.
Modeling
You can do modeling using Machine Learning/Deep Learning algorithms.
Last updated
Was this helpful?