Text mining using nltk and scikit-learn in Python¶

NLP is short for Natural Language Processing. Examples of NLP includes sentiment analysis (classifying words to have positive or negative connotations) or to make predictions in classification models. This demo starts with some basic techniques about pre-processing of the text data in order to extract better features. Then, we will apply two machine learning models to find topics of recently published articles from AJS.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
#nltk.download('stopwords') # first time use nltk, you should download this 
#nltk.download('punkt')# first time use nltk, you should download this 
#nltk.download('wordnet')# first time use nltk, you should download this 
import pandas as pd 
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt

Let's first take a look of the data¶

df=pd.read_csv("title10_19.csv")
print(df.shape)
df.head()

(1800, 3)

1. Lower case apply to all titles¶

df['title'] = df['title'].apply(lambda x: " ".join( x.lower() for x in x.split()))
df.head()

2. Remove Punctuation¶

# [^\w\s] is a regular expression, which means to remove anything that is not words or whitespace
df['title'] = df['title'].str.replace('[^\w\s]','')
df.head()

3. Remove stop words¶

Stop words include ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’ etc. Not helpful to make meaningful prediction. We imported a list of the most frequently used words from the NL Toolkit. Since we are examining academic articles, it is unlikely these titles will contain a lot of stop words. It is still safe to do so. Moremover, this is also a standardized step for most text cleaning.

stop_words=stopwords.words("english")
print(stop_words) ## a quick check of all stopwords

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

df['title']=df['title'].apply(lambda x: ' '.join (x for x in x.split() if x not in stop_words))
df.head()

4. Tokenization¶

Tokenization refers to dividing the text into a sequence of words or sentences. Here is a example:

from textblob import TextBlob ## use a libarary textblob
TextBlob(df['title'][0]).words
word_tokenize(df['title'][0]) ## this also works

['decoupling', 'marital', 'violence', 'struggle', 'divorce', 'china']

5. Stemming¶

Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. It cuts off prefixes and/or endings of words It is not always helpful because often times the new word loses its actual meaning

st=PorterStemmer()
df['title'][:5].apply(lambda x: ' '.join([st.stem(w) for w in x.split()]))

0           decoupl marit violenc struggl divorc china
1    organiz minor form use social tie evid teacher...
2    sort machin ethnic boundari make stratifi scho...
3             crimin court respond time crisi evid 911
4    gender pay gap us feder scienc agenc organiz a...
Name: title, dtype: object

Obviously, this is not ideal. most words lost their original meaning¶

6. Lemmatization¶

Lemmatization is a more effective option. it converts the word into its root. Lemmatization is prefered

lem=WordNetLemmatizer()
df['title']=df['title']. apply (lambda x: ' '.join([lem.lemmatize(w) for w in x.split()]))
df.head()

## use textblob is the same
from textblob import Word
df['title'] = df['title'].apply(lambda x: " ".join([Word(w).lemmatize() for w in x.split()]))
df[:100]

After cleaning the text. We can begin our analysis¶

First, I convert all titles into a long string.

text=' '.join(df['title'].tolist())

a. n-gram¶

N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) etc. It captures the language structure, like what letter or word is likely to follow the given one. This might also not useful here since article titles are normally short. But it is useful for finding most frequently appeared words pairs.

TextBlob(text).ngrams(3)[0:10]

[WordList(['decoupling', 'marital', 'violence']),
 WordList(['marital', 'violence', 'struggle']),
 WordList(['violence', 'struggle', 'divorce']),
 WordList(['struggle', 'divorce', 'china']),
 WordList(['divorce', 'china', 'organizational']),
 WordList(['china', 'organizational', 'minority']),
 WordList(['organizational', 'minority', 'form']),
 WordList(['minority', 'form', 'use']),
 WordList(['form', 'use', 'social']),
 WordList(['use', 'social', 'tie'])]

b. Frequency distribution¶

Finding the most frequent used world in the titles. We first tokenize the long text into single words and use the built-in function to display the 20 most frequent words.

from nltk.probability import FreqDist
test2=TextBlob(text).words ## tokenize all words
fdist = FreqDist(test2)
print(fdist)
fdist.most_common(20)

<FreqDist with 3914 samples and 12521 outcomes>

[('social', 166),
 ('american', 142),
 ('politics', 128),
 ('state', 125),
 ('race', 99),
 ('america', 89),
 ('new', 87),
 ('life', 74),
 ('inequality', 71),
 ('gender', 68),
 ('family', 67),
 ('work', 65),
 ('movement', 63),
 ('city', 62),
 ('woman', 60),
 ('united', 60),
 ('making', 58),
 ('right', 57),
 ('global', 57),
 ('culture', 56)]

fdist.plot(30, cumulative=True)

b. Word cloud¶

Let's make a simple world cloud.

from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image

wordcloud=WordCloud(max_font_size=50, max_words=200, background_color="white", width=600, height=300).generate(text)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Apply text mining using scikit-learn -- a machine learning method¶

Topic modeling¶

Latent Dirichlet Allocation (LDA)¶

Latent Dirichlet Allocation (LDA) is a algorithms used to discover the topics that are present in a corpus.

Non-negative Matrix Factorization (NMF)¶

Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix into small matrices.

Creation of the bag of words matrix¶

First, to prepare the input a bag of word matrix. We first convert our "title" into a long list. Then vectorize them.

ltext= df['title'].values.tolist()

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
features = 500
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=features, stop_words='english')
tfidf  = tfidf_vectorizer.fit_transform(ltext)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=features, stop_words='english')
LDA = tf_vectorizer.fit_transform(ltext)
tf_feature_names = tf_vectorizer.get_feature_names()

NMF and LDA with Scikit Learn¶

from sklearn.decomposition import NMF, LatentDirichletAllocation

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
topic_words = 5
no_topics = 5

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, 
          alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf )

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=5, 
                                learning_method='online', 
                                learning_offset=50.,
                                random_state=0).fit(LDA)

display_topics(nmf, tfidf_feature_names, topic_words)

Topic 0:
race politics gender class city
Topic 1:
social movement network theory change
Topic 2:
state united france welfare labor
Topic 3:
american city racial culture asian
Topic 4:
america community religion immigrant immigration

display_topics(lda, tf_feature_names, topic_words)

Topic 0:
social culture inequality movement class
Topic 1:
state united family life school
Topic 2:
race woman america black community
Topic 3:
global china labor science politics
Topic 4:
american new welfare immigrant society

reference: https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/ https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f https://feng.li/files/statcase/L9.2-Probabilistic-Topic-Models.html Code are adapted from above posts.

	id	title	journal
0	1	Decoupling: Marital Violence and the Struggle ...	AMERICAN JOURNAL OF SOCIOLOGY
1	2	How Organizational Minorities Form and Use Soc...	AMERICAN JOURNAL OF SOCIOLOGY
2	3	More Than a Sorting Machine: Ethnic Boundary M...	AMERICAN JOURNAL OF SOCIOLOGY
3	4	How Do Criminal Courts Respond in Times of Cri...	AMERICAN JOURNAL OF SOCIOLOGY
4	5	Gender Pay Gaps in US Federal Science Agencies...	AMERICAN JOURNAL OF SOCIOLOGY

	id	title	journal
0	1	decoupling: marital violence and the struggle ...	AMERICAN JOURNAL OF SOCIOLOGY
1	2	how organizational minorities form and use soc...	AMERICAN JOURNAL OF SOCIOLOGY
2	3	more than a sorting machine: ethnic boundary m...	AMERICAN JOURNAL OF SOCIOLOGY
3	4	how do criminal courts respond in times of cri...	AMERICAN JOURNAL OF SOCIOLOGY
4	5	gender pay gaps in us federal science agencies...	AMERICAN JOURNAL OF SOCIOLOGY

	id	title	journal
0	1	decoupling marital violence and the struggle t...	AMERICAN JOURNAL OF SOCIOLOGY
1	2	how organizational minorities form and use soc...	AMERICAN JOURNAL OF SOCIOLOGY
2	3	more than a sorting machine ethnic boundary ma...	AMERICAN JOURNAL OF SOCIOLOGY
3	4	how do criminal courts respond in times of cri...	AMERICAN JOURNAL OF SOCIOLOGY
4	5	gender pay gaps in us federal science agencies...	AMERICAN JOURNAL OF SOCIOLOGY

	id	title	journal
0	1	decoupling marital violence struggle divorce c...	AMERICAN JOURNAL OF SOCIOLOGY
1	2	organizational minorities form use social ties...	AMERICAN JOURNAL OF SOCIOLOGY
2	3	sorting machine ethnic boundary making stratif...	AMERICAN JOURNAL OF SOCIOLOGY
3	4	criminal courts respond times crisis evidence 911	AMERICAN JOURNAL OF SOCIOLOGY
4	5	gender pay gaps us federal science agencies or...	AMERICAN JOURNAL OF SOCIOLOGY

	id	title	journal
0	1	decoupling marital violence struggle divorce c...	AMERICAN JOURNAL OF SOCIOLOGY
1	2	organizational minority form use social tie ev...	AMERICAN JOURNAL OF SOCIOLOGY
2	3	sorting machine ethnic boundary making stratif...	AMERICAN JOURNAL OF SOCIOLOGY
3	4	criminal court respond time crisis evidence 911	AMERICAN JOURNAL OF SOCIOLOGY
4	5	gender pay gap u federal science agency organi...	AMERICAN JOURNAL OF SOCIOLOGY