# Basic Feature Extraction Methods

##

## Document Term Matrix:

It is a matrix with rows contains unique documents and the column contain the unique words/tokens. Let's take sample documents and store them in the `sample_documents`.

```python
sample_documents = ['This is the NLP notebook', 
                    'This is basic NLP. NLP is easy',
                    'NLP is awesome']
```

In the above `sample_documents`, we have 3 documents and 8 unique words. The Document Term matrix contains 3 rows and 8 columns as below.

| DTM        | **awesome** | **basic** | easy | **is** | NLP | **notebook** | **the** | **this** |
| ---------- | :---------: | :-------: | ---- | :----: | :-: | :----------: | :-----: | :------: |
| Document-1 |             |           |      |        |     |              |         |          |
| Document-2 |             |           |      |        |     |              |         |          |
| Document-3 |             |           |      |        |     |              |         |          |

There are many ways to determine the value in the above matrix. I will discuss some of the ways below. &#x20;

### Bag of Words

In this, we will fill with the number of times that word occurred in the same document.&#x20;

| BOW        | **awesome** | **basic** | easy | **is** | NLP | **notebook** | **the** | **this** |
| ---------- | :---------: | :-------: | :--: | :----: | :-: | :----------: | :-----: | :------: |
| Document-1 |      0      |     0     |   0  |    1   |  1  |       1      |    1    |     1    |
| Document-2 |      0      |     1     |   1  |    2   |  2  |       0      |    0    |     1    |
| Document-3 |      1      |     0     |   0  |    1   |  1  |       0      |    0    |     0    |

If you check the above matrix, "`nlp`" occurred two times in the document-2 so value corresponding to that is 2.  If it occurs `n times` in the document, the value corresponding is `n`. We can do the same in the using [`CountVectorizer` ](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)in  `sklearn`.&#x20;

```python
##import count vectorizer
from sklearn.feature_extraction.text import CountVectorizer
#creating CountVectorizer instance
bow_vec = CountVectorizer(lowercase=True, ngram_range=(1,1), analyzer='word')
#fitting with our data
bow_vec.fit(sample_documents)
#transforming the data to the vector
sample_bow_metrix = bow_vec.transform(sample_documents)
#printing
print("Unique words -->", bow_vec.get_feature_names())
print("BOW Matrix -->",sample_bow_metrix.toarray())
print("vocab to index dict -->", bow_vec.vocabulary_)
```

*output:*

```
Unique words --> ['awesome', 'basic', 'easy', 'is', 'nlp', 'notebook', 'the', 'this']
BOW Matrix --> [[0 0 0 1 1 1 1 1]
 [0 1 1 2 2 0 0 1]
 [1 0 0 1 1 0 0 0]]
vocab to index dict --> {'this': 7, 'is': 3, 'the': 6, 'nlp': 4, 'notebook': 5, 'basic': 1, 'easy': 2, 'awesome': 0}
```

{% hint style="warning" %}
**How `CountVectorizer` gets the unique words?**

It first splits the documents into words and then it gets the unique words. `CountVectorizer` uses `token_pattern` or `tokenizer`, we can give our custom `tokenization` algorithm to get words from a sentence. Please try to read the documentation of the `sklearn` to know more about it.&#x20;
{% endhint %}

We can also get the ngram words as vocab. please check below code. That was written for unigrams and bi-grams.&#x20;

{% hint style="info" %}
N-grams are simply all combinations of adjacent words of length n that you can find in your source text.
{% endhint %}

```python
#creating CountVectorizer instance with ngram_range = (1,2) i.e uni-gram and bi-gram
bow_vec = CountVectorizer(lowercase=True, ngram_range=(1,2), analyzer='word')
#fitting with our data
bow_vec.fit(sample_documents)
#transforming the data to the vector
sample_bow_metrix = bow_vec.transform(sample_documents)
#printing
print("Unique words -->", bow_vec.get_feature_names())
print("BOW Matrix -->",sample_bow_metrix.toarray())
print("vocab to index dict -->", bow_vec.vocabulary_)
```

```python
Unique words --> ['awesome', 'basic', 'basic nlp', 'easy', 'is', 'is awesome', 'is basic', 'is easy', 'is the', 'nlp', 'nlp is', 'nlp nlp', 'nlp notebook', 'notebook', 'the', 'the nlp', 'this', 'this is']
BOW Matrix --> [[0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 1 1 1]
 [0 1 1 1 2 0 1 1 0 2 1 1 0 0 0 0 1 1]
 [1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0]]
vocab to index dict --> {'this': 16, 'is': 4, 'the': 14, 'nlp': 9, 'notebook': 13, 'this is': 17, 'is the': 8, 'the nlp': 15, 'nlp notebook': 12, 'basic': 1, 'easy': 3, 'is basic': 6, 'basic nlp': 2, 'nlp nlp': 11, 'nlp is': 10, 'is easy': 7, 'awesome': 0, 'is awesome': 5}
```

### TF-IDF

In this, we will fill with `TF*IDF`.&#x20;

#### Term Frequency:

$$
TF\_K = \frac{\text{No of times word K occurred in that document}}{\text{Total number of words in that document}}
$$

{% hint style="info" %}
`TF` of a word is only dependent on the particular document. It won't depend on the total corpus of documents. `TF` value of word changes from document to document
{% endhint %}

#### Inverse Document Frequency:

$$
IDF\_K = log(\frac{\text{Total number of documents}}{\text{Number of documents with word K}} )
$$

{% hint style="info" %}
`IDF` of a word dependent on total corpus of documents. `IDF` value of word is constant for total corpus.
{% endhint %}

You can think `IDF` as **information content of the word**.

$$
\text{Information Content} = -log(\text{Probability of Word})
$$

$$
\text{Probability of Word K} = \frac{\text{Number of documents with word K}}{\text{Total number of documents}}
$$

We can calculate the `TFIDF` vectors using `TfidfVectorizer` in `sklearn`.&#x20;

```python
from sklearn.feature_extraction.text import TfidfVectorizer
#creating TfidfVectorizer instance
tfidf_vec = TfidfVectorizer()
#fitting with our data
tfidf_vec.fit(sample_documents)
#transforming the data to the vector
sample_tfidf_metrix = tfidf_vec.transform(sample_documents)
#printing
print("Unique words -->", tfidf_vec.get_feature_names())
print("TFIDF Matrix -->", '\n',sample_tfidf_metrix.toarray())
print("vocab to index dict -->", tfidf_vec.vocabulary_)
```

```python
Unique words --> ['awesome', 'basic', 'easy', 'is', 'nlp', 'notebook', 'the', 'this']
TFIDF Matrix --> 
 [[0.         0.         0.         0.32630952 0.32630952 0.55249005
  0.55249005 0.42018292]
 [0.         0.43157129 0.43157129 0.50978591 0.50978591 0.
  0.         0.32822109]
 [0.76749457 0.         0.         0.45329466 0.45329466 0.
  0.         0.        ]]
vocab to index dict --> {'this': 7, 'is': 3, 'the': 6, 'nlp': 4, 'notebook': 5, 'basic': 1, 'easy': 2, 'awesome': 0}
```

With the `TfidfVectorizer`also we can get the ngrams and we can give our own `tokenization` algorithm.

### What if we have so much `vocab` in our corpus?&#x20;

If we have many unique words, our `BOW/TFIDF`vectors will be very high dimensional that may cause [`curse of dimensionality`](https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/) problem.  We can solve this with the below methods.

#### Limiting the number of vocab in BOW/TFIDF:

In `CountVectorize`, we can do this using `max_features`, `min_df`, `max_df`. You can use `vocabulary` parameter to get specific words only. Try to read the documentation of CountVectorize to know better about those. You can check the sample code below.&#x20;

```python
#creating CountVectorizer instance, limited to 4 features only
bow_vec = CountVectorizer(lowercase=True, ngram_range=(1,1),
                                analyzer='word', max_features=4)
#fitting with our data
bow_vec.fit(sample_documents)
#transforming the data to the vector
sample_bow_metrix = bow_vec.transform(sample_documents)
#printing
print("Unique words -->", bow_vec.get_feature_names())
print("BOW Matrix -->",sample_bow_metrix.toarray())
print("vocab to index dict -->", bow_vec.vocabulary_)
```

```python
Unique words --> ['awesome', 'is', 'nlp', 'this']
BOW Matrix --> [[0 1 1 1]
 [0 2 2 1]
 [1 1 1 0]]
vocab to index dict --> {'this': 3, 'is': 1, 'nlp': 2, 'awesome': 0}
```

You can do similar thing with `TfidfVectorizer`with same parameters. Please read the documentation.

### Some of the problems with the `CountVectorizer` and `TfidfVectorizer` &#x20;

* If we have a large corpus, vocabulary will also be large and for `fit` function, you have to get all documents into RAM. This may be impossible if you don't have sufficient RAM.
* building the `vocab` requires a **full pass** over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
* After the `fit`, we have to store the `vocab dict`, which takes so much memory. If we want to deploy in **memory-constrained** environments like amazon lambda, IoT devices, mobile devices, etc.., these may be not useful.

{% hint style="success" %}
We can solve the first problem with an iterator over the total data and building the `vocab` then, using that vocab, we can create the BOW matrix in the sparse format and then `TFIDF` vectors using `TfidfTransformer`. The sparse matrix won't take much space so, we can store the BOW sparse matrix in our RAM to create the `TFIDF` sparse matrix.
{% endhint %}

I have written a sample code to do that for the same data. I have iterated over the data,  created vocab, and using that vocab, created BOW.  We can write a much more optimized version of the code, This is just a sample to show.&#x20;

```python
##for tokenization
import nltk
#vertical stack of sparse matrix
from scipy.sparse import vstack
#vocab set
vocab_set = set()
#looping through the points(for huge data, you will get from your disk/table)
for data_point in sample_documents:
    #getting words
    for word in nltk.tokenize.word_tokenize(data_point):
        if word.isalpha():
            vocab_set.add(word.lower())

vectorizer_bow = CountVectorizer(vocabulary=vocab_set)

bow_data = [] 
for data_point in sample_documents: # use a generator
    ##if we give the vocab, there will be no data lekage for fit_transform so we can use that
    bow_data.append(vectorizer_bow.fit_transform([data_point]))

final_bow = vstack(bow_data)

print("Unique words -->", vectorizer_bow.get_feature_names())
print("BOW Matrix -->",final_bow.toarray())
print("vocab to index dict -->", vectorizer_bow.vocabulary_)
```

```python
Unique words --> ['awesome', 'basic', 'easy', 'is', 'nlp', 'notebook', 'the', 'this']
BOW Matrix --> [[0 0 0 1 1 1 1 1]
 [0 1 1 2 2 0 0 1]
 [1 0 0 1 1 0 0 0]]
vocab to index dict --> {'awesome': 0, 'basic': 1, 'easy': 2, 'is': 3, 'nlp': 4, 'notebook': 5, 'the': 6, 'this': 7}
```

The above result is similar to the one we printed while doing the BOW, you can check that.

Using above `BOW` sparse matrix and the `TfidfTransformer`, we can create the `TFIDF` vectors. you can check below code.

```python
#importing
from sklearn.feature_extraction.text import TfidfTransformer
#instanciate the class
vec_tfidftransformer = TfidfTransformer()
#fit with the BOW sparse data 
vec_tfidftransformer.fit(final_bow)
vec_tfidf = vec_tfidftransformer.transform(final_bow)
print(vec_tfidf.toarray())
```

```python
[[0.         0.         0.         0.32630952 0.32630952 0.55249005
  0.55249005 0.42018292]
 [0.         0.43157129 0.43157129 0.50978591 0.50978591 0.
  0.         0.32822109]
 [0.76749457 0.         0.         0.45329466 0.45329466 0.
  0.         0.        ]]
```

The above result is similar to the one we printed while doing the `TFIDF,` you can check that.

{% hint style="info" %}
**Other than our own iterator/generator, if we have data in one file or multiple files, we can directly give `input` parameter as `file/filename` and while fit function, we can give `file path`. Please read the documentation.**
{% endhint %}

Another way to solve all above problems are **hashing**. We can convert a word into fixed index number using the hash function. so, there will be no training process to get the vocabulary and no need to save the vocab. It was implemented in `sklearn` with `HashingVectorizer`. In `HashingVectorizer`, you have to mention number of features you need, by default it takes $$2^{20}$$ . below you can see some code to use `HashingVectorizer`.

```python
#importing the hashvectorizer
from sklearn.feature_extraction.text import HashingVectorizer
#instanciating the HashingVectorizer
hash_vectorizer = HashingVectorizer(n_features=5, norm=None, alternate_sign=False)
#transforming the data, No need to fit the data because, it is stateless
hash_vector = hash_vectorizer.transform(sample_documents)
#printing the output
print("Hash vectors -->",hash_vector.toarray())
```

```python
Hash vectors --> [[0. 1. 3. 1. 0.]
 [0. 1. 5. 1. 0.]
 [0. 0. 3. 0. 0.]]
```

{% hint style="info" %}
You can normalize your vectors using norm.

Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero. This mechanism is enabled by default with `alternate_sign=True` and is particularly useful for small hash table sizes (`n_features < 10000`).
{% endhint %}

We can convert above vector to `TFIDF` using `TfidfTransformer`. check the below code

```python
#instanciate the class
vec_idftrans = TfidfTransformer()
#fit with the hash BOW sparse data 
vec_idftrans.fit(hash_vector)
##transforming the data
vec_tfidf2 = vec_idftrans.transform(hash_vector)
print("tfidf using hash BOW -->",vec_tfidf2.toarray())
```

```python
tfidf using hash BOW --> [[0.         0.36691832 0.85483442 0.36691832 0.        ]
 [0.         0.2419863  0.93961974 0.2419863  0.        ]
 [0.         0.         1.         0.         0.        ]]
```

This vectorizer is memory efficient but there are some cons for this as well, some of them are

* There is no way to compute the inverse transform of the Hashing so there will be no `interpretability` of the model.&#x20;
* There can be collisions in the hashing.&#x20;

You can get total code written in this blog from below GitHub link

{% embed url="<https://github.com/UdiBhaskar/Natural-Language-Processing>" %}

References:

1. <https://scikit-learn.org/stable/>&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://udai.gitbook.io/practical-ml/natural-language-processing/basics-of-nlp-and-feature-extraction-methods/basic-feature-extraction-methods.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
