Wednesday, 4 March 2015

Text Analytics : Preprocessing using python - Post 4

In the previous post, we went through the concepts of different text pre-processing steps. In this post, we can learn about how to implement them using python. Here again as an example, we can take a small paragraph "my_para" which needs some of the pre-processing steps mentioned in the previous post. Lets have a look at the python code.
""" Input paragraph for our pre-processing """
my_para = "I bought a Phone today... The phone is very nice :) "
""" Removing extra white spaces using regular expressions """
import re
my_para = re.sub('\s+', ' ', my_para)
print my_para
# I bought a Phone today... The phone is very nice :) #
""" Removing the extra periods using regular expressions """
my_para = re.sub('\.+', '.', my_para)
print my_para
# I bought a Phone today. The phone is very nice :) #
""" Removing the special characters using string replace """
special_char_list = [':', ';', '?', '}', ')', '{', '(']
for special_char in special_char_list:
my_para = my_para.replace(special_char, '')
print my_para
# I bought a Phone today. The phone is very nice #
""" Standardizing the text by converting them to lower case """
my_para = my_para.strip().lower()
print my_para
# i bought a phone today. the phone is very nice #
""" Import‌ing the necessary modules for stopwords removal"""
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english') ## eng_stopwords is the list of english stopwords
""" Tokenizing the paragraph first and then removing the stop words """
wordList = word_tokenize(my_para) ## Tokenizing the paragraph
wordList = [word for word in wordList if word not in eng_stopwords] ## Removing the stopwords
print wordList
# ['bought', 'phone', 'today', '.', 'phone', 'nice'] #
As we can see, there are some extra white spaces in the given paragraph. So we removed them using regular expressions. The given regular expression in the code can be used to remove all the white spaces including tab space and new line characters. The next step is to remove the extra full stops(periods) present at the end of first sentence.

Then in the third step, we have created a list of special characters and removed those special characters if they are present in the given paragraph. In the given example, the smiley got removed as a result  since they are made of special characters. In the next step, we have converted the entire paragraph into lower case. So 'Phone' in the first line got converted to 'phone' and hence matched with the 'phone' in the second sentence.

Last step is the removal of stop words. NLTK module itself has an in-built list of stopwords for English language. We have used this in our code here and removed the stopwords. . But we can also have our own stopword list and use that instead.

This is the pythonic way of pre-processing the text before analysis. In the next post, we can see how to do all these pre-processing using R.

No comments:

Post a Comment