Basics of Language Processing

In any language, below are some language analysis categories. I will try to write basic processing using spaCy and NLTK

Lexical Analysis

Lexical analysis is the task of segmenting text into its lexical expressions i.e. words/tokens

Tokenization

Converting sentence into tokens/words called as tokenization. There are many ways to do this. I will discuss some of them below. I am also creating 3 sentences as below

demo_sent1 = "@uday can't wait for the #nlp notes YAAAAAAY!!! #deeplearning https://udai.gitbook.io/practical-ml/"
demo_sent2 = "That U.S.A. poster-print costs $12.40..."
demo_sent3 = "I am writing NLP basics."
all_sents = [demo_sent1, demo_sent2, demo_sent3]
print(all_sents)

output:

["@uday can't wait for the #nlp notes YAAAAAAY!!! #deeplearning https://udai.gitbook.io/practical-ml/", 'That U.S.A. poster-print costs $12.40...', 'I am writing NLP basics.']

White Space Tokenizer

We can tokenize the data by splitting the data at space. check the code below

for sent in all_sents:
    print(sent.split(' '))

['@uday', "can't", 'wait', 'for', 'the', '#nlp', 'notes', 'YAAAAAAY!!!', '#deeplearning', 'https://udai.gitbook.io/practical-ml/']
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40...']
['I', 'am', 'writing', 'NLP', 'basics.']

For some of the words, it is working perfectly like U.S.A., poster-printer but we are getting @uday, basics., $12.40..., #nlp as words but we have to remove those #,@,. etc... So this tokenizer may give bad results if we have words like this.

NLTK Word Tokenizer

It follows the conventions of the Penn Treebank.

from nltk.tokenize import word_tokenize
for sent in all_sents:
    print(word_tokenize(sent))

['@', 'uday', 'ca', "n't", 'wait', 'for', 'the', '#', 'nlp', 'notes', 'YAAAAAAY', '!', '!', '!', '#', 'deeplearning', 'https', ':', '//udai.gitbook.io/practical-ml/']
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']
['I', 'am', 'writing', 'NLP', 'basics', '.']

It is giving better results compared to the white space tokenizer but some words like can't and web addresses are not working fine.

NLTK Regex Tokenizer

We can write our own regex to split the sentence into tokens/words.

pattern = r'''(?x)     # set flag to allow verbose regexps
...     (?:[A-Z]\.)+       # abbreviations, 
...   | \w+(?:-\w+)*       # words with optional internal hyphens
...   | \$?\d+(?:\.\d+)?%? # currency and percentages, 
...   | \.\.\.             # ellipsis
...   | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
 '''
for sent in all_sents:
    print(nltk.regexp_tokenize(sent, pattern))

['@', 'uday', 'can', "'", 't', 'wait', 'for', 'the', 'nlp', 'notes', 'YAAAAAAY', 'deeplearning', 'https', ':', 'udai', '.', 'gitbook', '.', 'io', 'practical-ml']
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
['I', 'am', 'writing', 'NLP', 'basics', '.']

There are many more NLTK tokenizers. You can refer all of them in this link.

spaCy Tokenizer

Works on predefined regular expression rules for prefix_search, suffix_search, infix_finditer, token_match and also Dependency Parsing to find sentence boundaries.

##loading spaCy english module
nlp = spacy.load("en_core_web_lg")
#printing
for sent in all_sents:
    print([token.text for token in nlp(sent)])

['@uday', 'ca', "n't", 'wait', 'for', 'the', '#', 'nlp', 'notes', 'YAAAAAAY', '!', '!', '!', '#', 'deeplearning', 'https://udai.gitbook.io/practical-ml/']
['That', 'U.S.A.', 'poster', '-', 'print', 'costs', '$', '12.40', '...']
['I', 'am', 'writing', 'NLP', 'basics', '.']

We can customize the spaCy tokenization pipeline. Please go to this notebook to know more about this. You can also read spaCy documentation.

There are some tokenizers like SentencePiece that can learn how to tokenize form corpus of the data. I will discuss this in another blog.

Our analysis and model performance also depends on the Tokenization algorithm so be careful while choosing the tokenization algorithm. If possible try with two or more algorithms or try to write custom rules based on your dataset/task.

Morphological Analysis

In linguistics, morphology is the study of the internal structure of words. I will try to explain some of them below.

Lemmatization

Using morphological analysis to return the dictionary form of a word i.e. the entry in a dictionary you would find all forms under. In Lemmatization root word is called Lemma.

##nltk lemmatizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running'))
print(lemmatizer.lemmatize('runner'))
print(lemmatizer.lemmatize('runners'))

running
runner
runner

Stemming

Stemming is the process of producing morphological variants of a root/base word.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('running'))
print(stemmer.stem('runner'))
print(stemmer.stem('runners'))

run
runner
runner

I will try to explain some other pre-processing techniques like POS tagging, Dependency Parsing while doing deep learning.

You can get total code written in this blog from below GitHub link

GitHub - UdiBhaskar/Natural-Language-Processing: NLP guide with tf.kerasGitHub

Stop Words

stop words usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. We can remove the stop words if you don't need exact meaning of a sentence. For text classification, we don't need those most of the time but, we need those for question and answer systems. word not is also a stop word in NLTK and this may be useful while classifying positive/negative sentence so be careful while removing the stopwords. .You can get the stop words from NLTK as below

from nltk.corpus import stopwords
stopwords.words('english')

NLP pipeline

Text Preprocessing

You may get data from PDF files, speech, OCR, Docs, Web so you have to preprocess the data to get the better raw text. I would recommend you to read this blog.

Once you done with basic cleaning, i would suggest you to do your everything with spaCy. It is very easy to write total pipeline. I took imdb dataset and written a pipeline to clean the data and get the tokens/words from the data. Before going to that, please check the below notebook that explains spaCy.

Natural-Language-Processing/spaCy Tokenization.ipynb at master · UdiBhaskar/Natural-Language-ProcessingGitHub

I have written a class TextPreprocess which takes a raw text and gives tokens which will be given to the ML/DL algorithm. It will be very useful while deploying the algorithm in production if we write a clear pipeline like below. Writing this may take several days of analysis on the real-life text data. Once you have done with total analysis, please try to write a structured function/class which takes raw data and gives a data that will be fed to the algorithm or another preprocessing pipeline.

## Check below link to know more about pipeline
class TextPreprocess():
    def __init__(self):
        ##loading nlp object of spacy
        self.nlp = spacy.load("en_core_web_lg", disable=["tagger", "parser"])
        # adding it to nlp object
        self.merge_entities_ = self.nlp.create_pipe("merge_entities")
        self.nlp.add_pipe(self.merge_entities_)
        
        ##removing not, neitherm never from stopwords,
        ##you can check all the spaCy stopwords from https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py
        self.nlp.vocab["not"].is_stop = False
        self.nlp.vocab['neither'].is_stop = False
        self.nlp.vocab['never'].is_stop = False
        
    def clean_raw_text(self, text, remove_html=True, clean_dots=True, clean_quotes=True, 
               clean_whitespace=True, convert_lowercase=True):
        """
        Clean the text data.
        text: input raw text data
        remove_html: if True, it removes the HTML tags and gives the only text data. 
        clean_dots: cleans all type of dots to fixed one
        clean_quotes: changes all type of quotes to fixed type like "
        clean_whitespaces: removes 2 or more white spaces
        convert_lowercase: converts text to lower case
        """
        if remove_html:
            # remove HTML
            ##separator=' ' to replace tags with space. othewise, we are getting some unwanted type like
            ## "make these characters come alive.<br /><br />We wish" --> make these characters come alive.We wish (no space between sentences)
            text = BeautifulSoup(text, 'html.parser').get_text(separator=' ')  
            
        # https://github.com/blendle/research-summarization/blob/master/enrichers/cleaner.py#L29
        if clean_dots:
            text = re.sub(r'…', '...', text)
        if clean_quotes:
            text = re.sub(r'[`‘’‛⸂⸃⸌⸍⸜⸝]', "'", text)
            text = re.sub(r'[„“]|(\'\')|(,,)', '"', text)
            text = re.sub(r'[-_]', " ", text)
        if clean_whitespace:
            text = re.sub(r'\s+', ' ', text).strip()
        if convert_lowercase:
            text = text.lower()
        return text
    
    def get_token_list(self, text, get_spacy_tokens=False):
        '''
        gives the list of spacy tokens/word strings
        text: cleaned text
        get_spacy_tokens: if true, it returns the list of spacy token objects
                          else, returns tokens in string format
        '''
        ##nlp object
        doc = self.nlp(text)
        out_tokens = []
        for token in doc:
            if token.ent_type_ == "":
                if not(token.is_punct or token.is_stop):
                    if get_spacy_tokens:
                        out_tokens.append(token)
                    else:
                        out_tokens.append(token.norm_)
        return out_tokens
    
    def get_preprocessed_tokens(self, text, remove_html=True, clean_dots=True, clean_quotes=True, 
               clean_whitespace=True, convert_lowercase=True, get_tokens=True, get_spacy_tokens=False, get_string=False):
        """
        returns the cleaned text
        text: input raw text data
        remove_html: if True, it removes the HTML tags and gives the only text data. 
        clean_dots: cleans all type of dots to fixed one
        clean_quotes: changes all type of quotes to fixed type like "
        clean_whitespaces: removes 2 or more white spaces
        convert_lowercase: converts text to lower case
        get_tokens: if true, returns output after tokenization else after cleaning only.
        get_spacy_tokens: if true, it returns the list of spacy token objects
                          else, returns tokens in string format
        get_string: returns string output(combining all tokens by space separation) only if get_spacy_tokens=False
        """
        text = self.clean_raw_text(text, remove_html, clean_dots, clean_quotes, clean_whitespace, convert_lowercase)
        if get_tokens:
            text = self.get_token_list(text, get_spacy_tokens)
            if (get_string and (not get_spacy_tokens)):
                text = " ".join(text)
        return text

You can utilize that as below,

preprocessor = TextPreprocess()
###getting tokens in string format
print("RAW Text:")
print()
print(data.review[4])
print('-'*100)
print("Preprocess List of Tokens(string format)")
print()
out = preprocessor.get_preprocessed_tokens(data.review[4])
print(out)
print()
print("Type of each object in above list")
print(type(out[0]))

RAW Text:

Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.
----------------------------------------------------------------------------------------------------
Preprocess List of Tokens(string format)

['visually', 'stunning', 'film', 'watch', 'mr', 'offers', 'vivid', 'portrait', 'human', 'relations', 'movie', 'telling', 'money', 'power', 'success', 'people', 'different', 'situations', 'encounter', 'variation', 'play', 'theme', 'director', 'transfers', 'action', 'present', 'time', 'different', 'characters', 'meet', 'connect', 'connected', 'way', 'person', 'know', 'previous', 'point', 'contact', 'stylishly', 'film', 'sophisticated', 'luxurious', 'look', 'taken', 'people', 'live', 'world', 'live', 'habitat', 'thing', 'gets', 'souls', 'picture', 'different', 'stages', 'loneliness', 'inhabits', 'big', 'city', 'not', 'exactly', 'best', 'place', 'human', 'relations', 'find', 'sincere', 'fulfillment', 'discerns', 'case', 'people', 'encounter', 'acting', 'good', 'mr', 'direction', 'rest', 'talented', 'cast', 'characters', 'come', 'alive', 'wish', 'mr', 'good', 'luck', 'await', 'anxiously', 'work']

Type of each object in above list
<class 'str'>

You can get that notebook from below link

Natural-Language-Processing/Preprocess_pipeline.ipynb at master · UdiBhaskar/Natural-Language-ProcessingGitHub

Feature Extraction

You can use BOW, TFIDF, Word2Vec based models to get structured features from cleaned text data.

Modeling

You can do modeling using Machine Learning/Deep Learning algorithms.

Previousoverview NextFeature Extraction Methods

Last updated 5 years ago

Was this helpful?