Monday 2 March 2015

Text Analytics : Tokenization - Post 2

In the first post, we saw how to split the paragraphs into sentences and then into words using python nltk module. Now we will see how basic in built functions in python and R  can be used to split the paragraphs into words.

Logically we know that paragraphs are made of sentences and usually a punctuation (full stop) is present in between two sentences. Also we know that words in a sentence are separated by white spaces. So we can make use of this to create a tokenizer on our own. This process of splitting the paragraphs into words is known as Tokenization in text analytics terminology.

First let us see how to do this in R

we are making use of the strsplit function available in R and made the tokenization quite simple. Now lets take a look at the code in python.
Here we use split function to split the string on punctuation and then on white spaces.

But wait.. We can't get all the paragraphs as proper as this. There will be many places where people use more than one punctuation marks towards the end, have more than one space in between, or have some other special characters in the paragraphs. How do we tackle them?! Stay tuned..

1 comment:

  1. #Tokinization used in python with nltk include TextBlob

    >>> import nltk
    >>> import textblob
    >>> from textblob import TextBlob
    >>> s = TextBlob("he is a good boy, he is a poor boy")
    >>> s.tags
    [('he', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('boy', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('poor', 'JJ'), ('boy', 'NN')]
    >>> s.words
    WordList(['he', 'is', 'a', 'good', 'boy', 'he', 'is', 'a', 'poor', 'boy'])
    >>> s.sentences[0]
    Sentence("he is a good boy, he is a poor boy")
    >>> s.sentences
    [Sentence("he is a good boy, he is a poor boy")]
    >>> s.sentences[0].sentiment
    Sentiment(polarity=0.14999999999999997, subjectivity=0.6000000000000001)