Monday 2 March 2015

Text Analytics : Intro and Tokenization - Post 1

Text analytics / mining is one of the niche field in analytics which has numerous applications in various domains primarily due to the explosion of world wide web. However it is quite different from other analysis that it requires a blend of both programming and analytical skills since the data is unstructured. There are a bunch of commercial and open source tools available in the market to do text analysis in case if you are interested in a ready made solution!

This series of blogs will be focussed on learning the basics of text analytics  primarily using Python and R. Python and R have good modules for doing text analysis and we will be mostly using tm package in R and nltk package in python for our blog posts.

We feel that it is always better to learn the concepts through hands-on experimentation which will make the learning more fun. So we will be using this movie review data for learning the concepts. This movie review data is obtained from Cornell university website. This dataset consists of two folders, one for positive and another one for negative reviews with 1000 files each. Each file consists of a movie review and these reviews are obtained from IMDB website.

Before we dive into processing this dataset, let us review some basics concepts with simpler examples. Text data is generally present in the form of paragraphs. So first step in our analysis is to break down the paragraphs into sentences and then into words. How do we do that using python?! We can do it easy peasy using NLTK python module. Let us have a look at the python code to do that.
First two lines of code import the necessary in-built functions from nltk module. Then we create a paragraph and assign it to a variable 'my_para'. Function sent_tokenize then splits the paragraph into sentences. Our next step is to split the sentences into words which is done by word_tokenize function. 

Hurray!! We are done with our first experimentation of extracting words from a paragraph using python nltk module. Can we do the same in R? Or in the absence of nltk module, how can we do it in python? Stay thinking..!