NLP is short for Natural Language Processing. Examples of NLP includes sentiment analysis (classifying words to have positive or negative connotations) or to make predictions in classification models. This demo starts with some basic techniques about pre-processing of the text data in order to extract better features. Then, we will apply two machine learning models to find topics of recently published articles from AJS.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
#nltk.download('stopwords') # first time use nltk, you should download this
#nltk.download('punkt')# first time use nltk, you should download this
#nltk.download('wordnet')# first time use nltk, you should download this
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
df=pd.read_csv("title10_19.csv")
print(df.shape)
df.head()
df['title'] = df['title'].apply(lambda x: " ".join( x.lower() for x in x.split()))
df.head()
# [^\w\s] is a regular expression, which means to remove anything that is not words or whitespace
df['title'] = df['title'].str.replace('[^\w\s]','')
df.head()
Stop words include ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’ etc. Not helpful to make meaningful prediction. We imported a list of the most frequently used words from the NL Toolkit. Since we are examining academic articles, it is unlikely these titles will contain a lot of stop words. It is still safe to do so. Moremover, this is also a standardized step for most text cleaning.
stop_words=stopwords.words("english")
print(stop_words) ## a quick check of all stopwords
df['title']=df['title'].apply(lambda x: ' '.join (x for x in x.split() if x not in stop_words))
df.head()
Tokenization refers to dividing the text into a sequence of words or sentences. Here is a example:
from textblob import TextBlob ## use a libarary textblob
TextBlob(df['title'][0]).words
word_tokenize(df['title'][0]) ## this also works
Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. It cuts off prefixes and/or endings of words It is not always helpful because often times the new word loses its actual meaning
st=PorterStemmer()
df['title'][:5].apply(lambda x: ' '.join([st.stem(w) for w in x.split()]))
Lemmatization is a more effective option. it converts the word into its root. Lemmatization is prefered
lem=WordNetLemmatizer()
df['title']=df['title']. apply (lambda x: ' '.join([lem.lemmatize(w) for w in x.split()]))
df.head()
## use textblob is the same
from textblob import Word
df['title'] = df['title'].apply(lambda x: " ".join([Word(w).lemmatize() for w in x.split()]))
df[:100]
First, I convert all titles into a long string.
text=' '.join(df['title'].tolist())
N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) etc. It captures the language structure, like what letter or word is likely to follow the given one. This might also not useful here since article titles are normally short. But it is useful for finding most frequently appeared words pairs.
TextBlob(text).ngrams(3)[0:10]
Finding the most frequent used world in the titles. We first tokenize the long text into single words and use the built-in function to display the 20 most frequent words.
from nltk.probability import FreqDist
test2=TextBlob(text).words ## tokenize all words
fdist = FreqDist(test2)
print(fdist)
fdist.most_common(20)
fdist.plot(30, cumulative=True)
Let's make a simple world cloud.
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
wordcloud=WordCloud(max_font_size=50, max_words=200, background_color="white", width=600, height=300).generate(text)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
Latent Dirichlet Allocation (LDA) is a algorithms used to discover the topics that are present in a corpus.
Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix into small matrices.
First, to prepare the input a bag of word matrix. We first convert our "title" into a long list. Then vectorize them.
ltext= df['title'].values.tolist()
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
features = 500
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(ltext)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=features, stop_words='english')
LDA = tf_vectorizer.fit_transform(ltext)
tf_feature_names = tf_vectorizer.get_feature_names()
from sklearn.decomposition import NMF, LatentDirichletAllocation
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print ("Topic %d:" % (topic_idx))
print (" ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]]))
topic_words = 5
no_topics = 5
# Run NMF
nmf = NMF(n_components=no_topics, random_state=1,
alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf )
# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics,
max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0).fit(LDA)
display_topics(nmf, tfidf_feature_names, topic_words)
display_topics(lda, tf_feature_names, topic_words)
reference: https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/ https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f https://feng.li/files/statcase/L9.2-Probabilistic-Topic-Models.html Code are adapted from above posts.