In the previous post, we went through the concepts of different text pre-processing steps. In this post, we can learn about how to implement them using python. Here again as an example, we can take a small paragraph "my_para" which needs some of the pre-processing steps mentioned in the previous post. Lets have a look at the python code.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" Input paragraph for our pre-processing """ | |
my_para = "I bought a Phone today... The phone is very nice :) " | |
""" Removing extra white spaces using regular expressions """ | |
import re | |
my_para = re.sub('\s+', ' ', my_para) | |
print my_para | |
# I bought a Phone today... The phone is very nice :) # | |
""" Removing the extra periods using regular expressions """ | |
my_para = re.sub('\.+', '.', my_para) | |
print my_para | |
# I bought a Phone today. The phone is very nice :) # | |
""" Removing the special characters using string replace """ | |
special_char_list = [':', ';', '?', '}', ')', '{', '('] | |
for special_char in special_char_list: | |
my_para = my_para.replace(special_char, '') | |
print my_para | |
# I bought a Phone today. The phone is very nice # | |
""" Standardizing the text by converting them to lower case """ | |
my_para = my_para.strip().lower() | |
print my_para | |
# i bought a phone today. the phone is very nice # | |
""" Importing the necessary modules for stopwords removal""" | |
from nltk.tokenize import word_tokenize | |
from nltk.corpus import stopwords | |
eng_stopwords = stopwords.words('english') ## eng_stopwords is the list of english stopwords | |
""" Tokenizing the paragraph first and then removing the stop words """ | |
wordList = word_tokenize(my_para) ## Tokenizing the paragraph | |
wordList = [word for word in wordList if word not in eng_stopwords] ## Removing the stopwords | |
print wordList | |
# ['bought', 'phone', 'today', '.', 'phone', 'nice'] # |
Then in the third step, we have created a list of special characters and removed those special characters if they are present in the given paragraph. In the given example, the smiley got removed as a result since they are made of special characters. In the next step, we have converted the entire paragraph into lower case. So 'Phone' in the first line got converted to 'phone' and hence matched with the 'phone' in the second sentence.
Last step is the removal of stop words. NLTK module itself has an in-built list of stopwords for English language. We have used this in our code here and removed the stopwords. . But we can also have our own stopword list and use that instead.
This is the pythonic way of pre-processing the text before analysis. In the next post, we can see how to do all these pre-processing using R.
No comments:
Post a Comment