Text mining using nltk and scikit-learn in Python

NLP is short for Natural Language Processing. Examples of NLP includes sentiment analysis (classifying words to have positive or negative connotations) or to make predictions in classification models. This demo starts with some basic techniques about pre-processing of the text data in order to extract better features. Then, we will apply two machine learning models to find topics of recently published articles from AJS.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
#nltk.download('stopwords') # first time use nltk, you should download this 
#nltk.download('punkt')# first time use nltk, you should download this 
#nltk.download('wordnet')# first time use nltk, you should download this 
import pandas as pd 
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt

Let's first take a look of the data

In [2]:
df=pd.read_csv("title10_19.csv")
print(df.shape)
df.head()
(1800, 3)
Out[2]:
id title journal
0 1 Decoupling: Marital Violence and the Struggle ... AMERICAN JOURNAL OF SOCIOLOGY
1 2 How Organizational Minorities Form and Use Soc... AMERICAN JOURNAL OF SOCIOLOGY
2 3 More Than a Sorting Machine: Ethnic Boundary M... AMERICAN JOURNAL OF SOCIOLOGY
3 4 How Do Criminal Courts Respond in Times of Cri... AMERICAN JOURNAL OF SOCIOLOGY
4 5 Gender Pay Gaps in US Federal Science Agencies... AMERICAN JOURNAL OF SOCIOLOGY

1. Lower case apply to all titles

In [3]:
df['title'] = df['title'].apply(lambda x: " ".join( x.lower() for x in x.split()))
df.head()
Out[3]:
id title journal
0 1 decoupling: marital violence and the struggle ... AMERICAN JOURNAL OF SOCIOLOGY
1 2 how organizational minorities form and use soc... AMERICAN JOURNAL OF SOCIOLOGY
2 3 more than a sorting machine: ethnic boundary m... AMERICAN JOURNAL OF SOCIOLOGY
3 4 how do criminal courts respond in times of cri... AMERICAN JOURNAL OF SOCIOLOGY
4 5 gender pay gaps in us federal science agencies... AMERICAN JOURNAL OF SOCIOLOGY

2. Remove Punctuation

In [4]:
# [^\w\s] is a regular expression, which means to remove anything that is not words or whitespace
df['title'] = df['title'].str.replace('[^\w\s]','')
df.head()
Out[4]:
id title journal
0 1 decoupling marital violence and the struggle t... AMERICAN JOURNAL OF SOCIOLOGY
1 2 how organizational minorities form and use soc... AMERICAN JOURNAL OF SOCIOLOGY
2 3 more than a sorting machine ethnic boundary ma... AMERICAN JOURNAL OF SOCIOLOGY
3 4 how do criminal courts respond in times of cri... AMERICAN JOURNAL OF SOCIOLOGY
4 5 gender pay gaps in us federal science agencies... AMERICAN JOURNAL OF SOCIOLOGY

3. Remove stop words

Stop words include ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’ etc. Not helpful to make meaningful prediction. We imported a list of the most frequently used words from the NL Toolkit. Since we are examining academic articles, it is unlikely these titles will contain a lot of stop words. It is still safe to do so. Moremover, this is also a standardized step for most text cleaning.

In [5]:
stop_words=stopwords.words("english")
print(stop_words) ## a quick check of all stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [6]:
df['title']=df['title'].apply(lambda x: ' '.join (x for x in x.split() if x not in stop_words))
df.head()
Out[6]:
id title journal
0 1 decoupling marital violence struggle divorce c... AMERICAN JOURNAL OF SOCIOLOGY
1 2 organizational minorities form use social ties... AMERICAN JOURNAL OF SOCIOLOGY
2 3 sorting machine ethnic boundary making stratif... AMERICAN JOURNAL OF SOCIOLOGY
3 4 criminal courts respond times crisis evidence 911 AMERICAN JOURNAL OF SOCIOLOGY
4 5 gender pay gaps us federal science agencies or... AMERICAN JOURNAL OF SOCIOLOGY

4. Tokenization

Tokenization refers to dividing the text into a sequence of words or sentences. Here is a example:

In [7]:
from textblob import TextBlob ## use a libarary textblob
TextBlob(df['title'][0]).words
word_tokenize(df['title'][0]) ## this also works
Out[7]:
['decoupling', 'marital', 'violence', 'struggle', 'divorce', 'china']

5. Stemming

Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. It cuts off prefixes and/or endings of words It is not always helpful because often times the new word loses its actual meaning

In [8]:
st=PorterStemmer()
df['title'][:5].apply(lambda x: ' '.join([st.stem(w) for w in x.split()]))
Out[8]:
0           decoupl marit violenc struggl divorc china
1    organiz minor form use social tie evid teacher...
2    sort machin ethnic boundari make stratifi scho...
3             crimin court respond time crisi evid 911
4    gender pay gap us feder scienc agenc organiz a...
Name: title, dtype: object

Obviously, this is not ideal. most words lost their original meaning

6. Lemmatization

Lemmatization is a more effective option. it converts the word into its root. Lemmatization is prefered

In [9]:
lem=WordNetLemmatizer()
df['title']=df['title']. apply (lambda x: ' '.join([lem.lemmatize(w) for w in x.split()]))
df.head()
Out[9]:
id title journal
0 1 decoupling marital violence struggle divorce c... AMERICAN JOURNAL OF SOCIOLOGY
1 2 organizational minority form use social tie ev... AMERICAN JOURNAL OF SOCIOLOGY
2 3 sorting machine ethnic boundary making stratif... AMERICAN JOURNAL OF SOCIOLOGY
3 4 criminal court respond time crisis evidence 911 AMERICAN JOURNAL OF SOCIOLOGY
4 5 gender pay gap u federal science agency organi... AMERICAN JOURNAL OF SOCIOLOGY
In [10]:
## use textblob is the same
from textblob import Word
df['title'] = df['title'].apply(lambda x: " ".join([Word(w).lemmatize() for w in x.split()]))
df[:100]
Out[10]:
id title journal
0 1 decoupling marital violence struggle divorce c... AMERICAN JOURNAL OF SOCIOLOGY
1 2 organizational minority form use social tie ev... AMERICAN JOURNAL OF SOCIOLOGY
2 3 sorting machine ethnic boundary making stratif... AMERICAN JOURNAL OF SOCIOLOGY
3 4 criminal court respond time crisis evidence 911 AMERICAN JOURNAL OF SOCIOLOGY
4 5 gender pay gap u federal science agency organi... AMERICAN JOURNAL OF SOCIOLOGY
... ... ... ...
95 96 navigating conflict youth handle trouble highp... AMERICAN JOURNAL OF SOCIOLOGY
96 97 black migrant athlete medium race diaspora sport AMERICAN JOURNAL OF SOCIOLOGY
97 98 way woman age using refusing cosmetic interven... AMERICAN JOURNAL OF SOCIOLOGY
98 99 protect serve deport rise policing immigration... AMERICAN JOURNAL OF SOCIOLOGY
99 100 cost girl working teen origin gender wage gap AMERICAN JOURNAL OF SOCIOLOGY

100 rows × 3 columns

After cleaning the text. We can begin our analysis

First, I convert all titles into a long string.

In [11]:
text=' '.join(df['title'].tolist())

a. n-gram

N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) etc. It captures the language structure, like what letter or word is likely to follow the given one. This might also not useful here since article titles are normally short. But it is useful for finding most frequently appeared words pairs.

In [12]:
TextBlob(text).ngrams(3)[0:10]
Out[12]:
[WordList(['decoupling', 'marital', 'violence']),
 WordList(['marital', 'violence', 'struggle']),
 WordList(['violence', 'struggle', 'divorce']),
 WordList(['struggle', 'divorce', 'china']),
 WordList(['divorce', 'china', 'organizational']),
 WordList(['china', 'organizational', 'minority']),
 WordList(['organizational', 'minority', 'form']),
 WordList(['minority', 'form', 'use']),
 WordList(['form', 'use', 'social']),
 WordList(['use', 'social', 'tie'])]

b. Frequency distribution

Finding the most frequent used world in the titles. We first tokenize the long text into single words and use the built-in function to display the 20 most frequent words.

In [13]:
from nltk.probability import FreqDist
test2=TextBlob(text).words ## tokenize all words
fdist = FreqDist(test2)
print(fdist)
fdist.most_common(20)
<FreqDist with 3914 samples and 12521 outcomes>
Out[13]:
[('social', 166),
 ('american', 142),
 ('politics', 128),
 ('state', 125),
 ('race', 99),
 ('america', 89),
 ('new', 87),
 ('life', 74),
 ('inequality', 71),
 ('gender', 68),
 ('family', 67),
 ('work', 65),
 ('movement', 63),
 ('city', 62),
 ('woman', 60),
 ('united', 60),
 ('making', 58),
 ('right', 57),
 ('global', 57),
 ('culture', 56)]
In [14]:
fdist.plot(30, cumulative=True)

b. Word cloud

Let's make a simple world cloud.

In [15]:
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
In [16]:
wordcloud=WordCloud(max_font_size=50, max_words=200, background_color="white", width=600, height=300).generate(text)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Apply text mining using scikit-learn -- a machine learning method

Topic modeling

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a algorithms used to discover the topics that are present in a corpus.

Non-negative Matrix Factorization (NMF)

Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix into small matrices.

Creation of the bag of words matrix

First, to prepare the input a bag of word matrix. We first convert our "title" into a long list. Then vectorize them.

In [17]:
ltext= df['title'].values.tolist()
In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
features = 500
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=features, stop_words='english')
tfidf  = tfidf_vectorizer.fit_transform(ltext)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=features, stop_words='english')
LDA = tf_vectorizer.fit_transform(ltext)
tf_feature_names = tf_vectorizer.get_feature_names()

NMF and LDA with Scikit Learn

In [23]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
topic_words = 5
no_topics = 5

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, 
          alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf )

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, 
                                max_iter=5, 
                                learning_method='online', 
                                learning_offset=50.,
                                random_state=0).fit(LDA)
In [24]:
display_topics(nmf, tfidf_feature_names, topic_words)
Topic 0:
race politics gender class city
Topic 1:
social movement network theory change
Topic 2:
state united france welfare labor
Topic 3:
american city racial culture asian
Topic 4:
america community religion immigrant immigration
In [25]:
display_topics(lda, tf_feature_names, topic_words)
Topic 0:
social culture inequality movement class
Topic 1:
state united family life school
Topic 2:
race woman america black community
Topic 3:
global china labor science politics
Topic 4:
american new welfare immigrant society