In the first post, we saw how to split the paragraphs into sentences and then into words using python nltk module. Now we will see how basic in built functions in python and R can be used to split the paragraphs into words.
Logically we know that paragraphs are made of sentences and usually a punctuation (full stop) is present in between two sentences. Also we know that words in a sentence are separated by white spaces. So we can make use of this to create a tokenizer on our own. This process of splitting the paragraphs into words is known as Tokenization in text analytics terminology.
First let us see how to do this in R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" Input paragraph """ | |
my_para <- "This is first sentence. And this is second." | |
""" splitting the paragraph into sentences using strsplit funtion """ | |
sentencesList <- strsplit(my_para, "\\.") | |
""" Getting the first sentence from the output and printing it """ | |
firstSentence <- unlist(sentencesList)[1] | |
firstSentence | |
# "This is first sentence" # | |
""" Splitting the first sentence into words using strsplit again """ | |
wordsList <- strsplit(firstSentence, " ") | |
wordsList | |
# "This" "is" "first" "sentence" # |
we are making use of the strsplit function available in R and made the tokenization quite simple. Now lets take a look at the code in python.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" Input paragraph """ | |
my_para = "This is first sentence. And this is second." | |
""" splitting the paragraph into sentences using split funtion """ | |
sentences_list = my_para.strip().split(".") | |
""" Getting the first sentence from the output and printing it """ | |
first_sentence = sentences_list[0] | |
print first_sentence | |
# This is first sentence # | |
""" Splitting the first sentence into words using split function again """ | |
words_list = first_sentence.strip().split(" ") | |
print words_list | |
# ['This', 'is', 'first', 'sentence'] # |
Here we use split function to split the string on punctuation and then on white spaces.
But wait.. We can't get all the paragraphs as proper as this. There will be many places where people use more than one punctuation marks towards the end, have more than one space in between, or have some other special characters in the paragraphs. How do we tackle them?! Stay tuned..
#Tokinization used in python with nltk include TextBlob
ReplyDelete>>> import nltk
>>> nltk.download("all")
>>> import textblob
>>> from textblob import TextBlob
>>> s = TextBlob("he is a good boy, he is a poor boy")
>>> s.tags
[('he', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('boy', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('poor', 'JJ'), ('boy', 'NN')]
>>> s.words
WordList(['he', 'is', 'a', 'good', 'boy', 'he', 'is', 'a', 'poor', 'boy'])
>>> s.sentences[0]
Sentence("he is a good boy, he is a poor boy")
>>> s.sentences
[Sentence("he is a good boy, he is a poor boy")]
>>> s.sentences[0].sentiment
Sentiment(polarity=0.14999999999999997, subjectivity=0.6000000000000001)