In the previous post, we have learnt how to pre-process text data using python. In this post, we will see how to pre-process the data using custom functions in R without external libraries. This may be helpful if we need some flexibility in text pre-processing.
Let us take the same example which we used in the previous post. The first step is to remove the extra white spaces present in the paragraph. 'gsub' function is used to remove these extra white spaces. The code snippet is as follows:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Input paragraph for our pre-processing ### | |
my_para = "I bought a Phone today... The phone is very nice :) " | |
### Removing extra white spaces using gsub and regular expressions ### | |
my_para = gsub('\\s+', ' ', my_para) | |
my_para | |
# "I bought a Phone today... The phone is very nice :) " # |
Again using the 'gsub' function we can remove the extra periods as well and the code is
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Removing the extra periods using regular expressions ### | |
my_para = gsub('\\.+', '.', my_para) | |
my_para | |
# "I bought a Phone today. The phone is very nice :) " # |
Next step is the removal of special characters. We have to create a special characters list and then use 'for' loop to remove the special characters present in the given list.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Removing the special characters using string replace ### | |
special_char_vec = c('!', '/', ')', '\\(', ';', ':', '\\{', '}') | |
for (special_char in special_char_vec){ | |
my_para = gsub(special_char,'',my_para) | |
} | |
my_para | |
# "I bought a Phone today. The phone is very nice " # |
To convert the text to lowercase, we can use the 'tolower()' function in R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Standardizing the text by converting them to lower case ### | |
my_para = tolower(my_para) | |
my_para | |
# "i bought a phone today. the phone is very nice " # |
Finally to remove stopwords, we need to get a list of stopwords. It is readily available in 'tm' library of R and so we can make use of it. Then a 'for' loop to remove those stopwords after tokenizing it.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Importing the tm library and getting English stopwords ### | |
library(tm) | |
eng_stopwords = stopwords('english') | |
### Removing the stop words hard way using inefficient for loop!! ### | |
wordsList = strsplit(my_para, ' ') | |
newWordsList = character(length=0) | |
for (word in unlist(wordsList)){ | |
if (!(word %in% eng_stopwords)){ | |
newWordsList = c(newWordsList, word) | |
} | |
} | |
newWordsList | |
# "bought" "phone" "today." "phone" "nice" "" # |
Ah..! This is a long process and so much of codes! Are there any in-built functions which does this using a single line command instead of 'for' loops. Yes!! There are. Let us wait till our next post to see them.