A for Analytics : NLTK

Wednesday, 4 March 2015

Text Analytics : Preprocessing using python - Post 4

In the previous post, we went through the concepts of different text pre-processing steps. In this post, we can learn about how to implement them using python. Here again as an example, we can take a small paragraph "my_para" which needs some of the pre-processing steps mentioned in the previous post. Lets have a look at the python code.

As we can see, there are some extra white spaces in the given paragraph. So we removed them using regular expressions. The given regular expression in the code can be used to remove all the white spaces including tab space and new line characters. The next step is to remove the extra full stops(periods) present at the end of first sentence.

Then in the third step, we have created a list of special characters and removed those special characters if they are present in the given paragraph. In the given example, the smiley got removed as a result since they are made of special characters. In the next step, we have converted the entire paragraph into lower case. So 'Phone' in the first line got converted to 'phone' and hence matched with the 'phone' in the second sentence.

Last step is the removal of stop words. NLTK module itself has an in-built list of stopwords for English language. We have used this in our code here and removed the stopwords. . But we can also have our own stopword list and use that instead.

This is the pythonic way of pre-processing the text before analysis. In the next post, we can see how to do all these pre-processing using R.

Monday, 2 March 2015

Text Analytics : Intro and Tokenization - Post 1

Text analytics / mining is one of the niche field in analytics which has numerous applications in various domains primarily due to the explosion of world wide web. However it is quite different from other analysis that it requires a blend of both programming and analytical skills since the data is unstructured. There are a bunch of commercial and open source tools available in the market to do text analysis in case if you are interested in a ready made solution!

This series of blogs will be focussed on learning the basics of text analytics primarily using Python and R. Python and R have good modules for doing text analysis and we will be mostly using tm package in R and nltk package in python for our blog posts.

We feel that it is always better to learn the concepts through hands-on experimentation which will make the learning more fun. So we will be using this movie review data for learning the concepts. This movie review data is obtained from Cornell university website. This dataset consists of two folders, one for positive and another one for negative reviews with 1000 files each. Each file consists of a movie review and these reviews are obtained from IMDB website.

Before we dive into processing this dataset, let us review some basics concepts with simpler examples. Text data is generally present in the form of paragraphs. So first step in our analysis is to break down the paragraphs into sentences and then into words. How do we do that using python?! We can do it easy peasy using NLTK python module. Let us have a look at the python code to do that.

First two lines of code import the necessary in-built functions from nltk module. Then we create a paragraph and assign it to a variable 'my_para'. Function sent_tokenize then splits the paragraph into sentences. Our next step is to split the sentences into words which is done by word_tokenize function.

Hurray!! We are done with our first experimentation of extracting words from a paragraph using python nltk module. Can we do the same in R? Or in the absence of nltk module, how can we do it in python? Stay thinking..!