Monday, 2 March 2015

Text Analytics : Intro and Tokenization - Post 1

Text analytics / mining is one of the niche field in analytics which has numerous applications in various domains primarily due to the explosion of world wide web. However it is quite different from other analysis that it requires a blend of both programming and analytical skills since the data is unstructured. There are a bunch of commercial and open source tools available in the market to do text analysis in case if you are interested in a ready made solution!

This series of blogs will be focussed on learning the basics of text analytics  primarily using Python and R. Python and R have good modules for doing text analysis and we will be mostly using tm package in R and nltk package in python for our blog posts.

We feel that it is always better to learn the concepts through hands-on experimentation which will make the learning more fun. So we will be using this movie review data for learning the concepts. This movie review data is obtained from Cornell university website. This dataset consists of two folders, one for positive and another one for negative reviews with 1000 files each. Each file consists of a movie review and these reviews are obtained from IMDB website.

Before we dive into processing this dataset, let us review some basics concepts with simpler examples. Text data is generally present in the form of paragraphs. So first step in our analysis is to break down the paragraphs into sentences and then into words. How do we do that using python?! We can do it easy peasy using NLTK python module. Let us have a look at the python code to do that.
First two lines of code import the necessary in-built functions from nltk module. Then we create a paragraph and assign it to a variable 'my_para'. Function sent_tokenize then splits the paragraph into sentences. Our next step is to split the sentences into words which is done by word_tokenize function. 

Hurray!! We are done with our first experimentation of extracting words from a paragraph using python nltk module. Can we do the same in R? Or in the absence of nltk module, how can we do it in python? Stay thinking..! 

5 comments:

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Ai based Text Analytics Tool

    Text Analytics Solutions

    ReplyDelete
  2. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Ai based Text Analytics Tool

    Text Analytics Solutions

    ReplyDelete
  3. The post is absolutely fantastic! Lots of great information and inspiration both of which we all need! Also like to admire the time and effort you put into your blog.
    online chat
    Play free online games
    free online games
    online games
    Kids Games Online
    Free Kids Games

    ReplyDelete
  4. Such a wonderful blog and very innovative thing...! I was really happy to visit your post and keep doing it...
    Best Divorce Lawyers in Arlington VA
    Solicitation Of A Minor VA

    ReplyDelete