Thursday, 5 March 2015

Text Analytics : Preprocessing using R - Post 5

In the previous post, we have learnt how to pre-process text data using python. In this post, we will see how to pre-process the data using custom functions in R without external libraries. This may be helpful if we need some flexibility in text pre-processing. 

Let us take the same example which we used in the previous post. The first step is to remove the extra white spaces present in the paragraph. 'gsub' function is used to remove these extra white spaces. The code snippet is as follows:

Again using the 'gsub' function we can remove the extra periods as well and the code is

Next step is the removal of special characters. We have to create a special characters list and then use 'for' loop to remove the special characters present in the given list.

To convert the text to lowercase, we can use the 'tolower()' function in R

Finally to remove stopwords, we need to get a list of stopwords. It is readily available in 'tm' library of R and so we can make use of it. Then a 'for' loop to remove those stopwords after tokenizing it.

Ah..! This is a long process and so much of codes! Are there any in-built functions which does this using a single line command instead of 'for' loops. Yes!! There are. Let us wait till our next post to see them. 

2 comments:

  1. Nice lessons, very simple and clear. It would be nice to continue

    ReplyDelete