Basic Feature Extraction Methods

Document Term Matrix:

It is a matrix with rows contains unique documents and the column contain the unique words/tokens. Let's take sample documents and store them in the sample_documents.

sample_documents = ['This is the NLP notebook', 
                    'This is basic NLP. NLP is easy',
                    'NLP is awesome']

In the above sample_documents, we have 3 documents and 8 unique words. The Document Term matrix contains 3 rows and 8 columns as below.

DTM

awesome

basic

easy

is

NLP

notebook

the

this

Document-1

Document-2

Document-3

There are many ways to determine the value in the above matrix. I will discuss some of the ways below.

Bag of Words

In this, we will fill with the number of times that word occurred in the same document.

BOW

awesome

basic

easy

is

NLP

notebook

the

this

Document-1

0

0

0

1

1

1

1

1

Document-2

0

1

1

2

2

0

0

1

Document-3

1

0

0

1

1

0

0

0

If you check the above matrix, "nlp" occurred two times in the document-2 so value corresponding to that is 2. If it occurs n times in the document, the value corresponding is n. We can do the same in the using CountVectorizer in sklearn.

output:

We can also get the ngram words as vocab. please check below code. That was written for unigrams and bi-grams.

N-grams are simply all combinations of adjacent words of length n that you can find in your source text.

TF-IDF

In this, we will fill with TF*IDF.

Term Frequency:

TFK=No of times word K occurred in that documentTotal number of words in that document TF_K = \frac{\text{No of times word K occurred in that document}}{\text{Total number of words in that document}}

TF of a word is only dependent on the particular document. It won't depend on the total corpus of documents. TF value of word changes from document to document

Inverse Document Frequency:

IDFK=log(Total number of documentsNumber of documents with word K)IDF_K = log(\frac{\text{Total number of documents}}{\text{Number of documents with word K}} )

IDF of a word dependent on total corpus of documents. IDF value of word is constant for total corpus.

You can think IDF as information content of the word.

Information Content=−log(Probability of Word)\text{Information Content} = -log(\text{Probability of Word})
Probability of Word K=Number of documents with word KTotal number of documents\text{Probability of Word K} = \frac{\text{Number of documents with word K}}{\text{Total number of documents}}

We can calculate the TFIDF vectors using TfidfVectorizer in sklearn.

With the TfidfVectorizeralso we can get the ngrams and we can give our own tokenization algorithm.

What if we have so much vocab in our corpus?

If we have many unique words, our BOW/TFIDFvectors will be very high dimensional that may cause curse of dimensionality problem. We can solve this with the below methods.

Limiting the number of vocab in BOW/TFIDF:

In CountVectorize, we can do this using max_features, min_df, max_df. You can use vocabulary parameter to get specific words only. Try to read the documentation of CountVectorize to know better about those. You can check the sample code below.

You can do similar thing with TfidfVectorizerwith same parameters. Please read the documentation.

Some of the problems with the CountVectorizer and TfidfVectorizer

  • If we have a large corpus, vocabulary will also be large and for fit function, you have to get all documents into RAM. This may be impossible if you don't have sufficient RAM.

  • building the vocab requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.

  • After the fit, we have to store the vocab dict, which takes so much memory. If we want to deploy in memory-constrained environments like amazon lambda, IoT devices, mobile devices, etc.., these may be not useful.

I have written a sample code to do that for the same data. I have iterated over the data, created vocab, and using that vocab, created BOW. We can write a much more optimized version of the code, This is just a sample to show.

The above result is similar to the one we printed while doing the BOW, you can check that.

Using above BOW sparse matrix and the TfidfTransformer, we can create the TFIDF vectors. you can check below code.

The above result is similar to the one we printed while doing the TFIDF, you can check that.

Other than our own iterator/generator, if we have data in one file or multiple files, we can directly give input parameter as file/filename and while fit function, we can give file path. Please read the documentation.

Another way to solve all above problems are hashing. We can convert a word into fixed index number using the hash function. so, there will be no training process to get the vocabulary and no need to save the vocab. It was implemented in sklearn with HashingVectorizer. In HashingVectorizer, you have to mention number of features you need, by default it takes 2202^{20} . below you can see some code to use HashingVectorizer.

You can normalize your vectors using norm.

Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero. This mechanism is enabled by default with alternate_sign=True and is particularly useful for small hash table sizes (n_features < 10000).

We can convert above vector to TFIDF using TfidfTransformer. check the below code

This vectorizer is memory efficient but there are some cons for this as well, some of them are

  • There is no way to compute the inverse transform of the Hashing so there will be no interpretability of the model.

  • There can be collisions in the hashing.

You can get total code written in this blog from below GitHub link

References:

Last updated

Was this helpful?