A for Analytics : Text Pre-processing

Showing posts with label Text Pre-processing. Show all posts

Thursday, 5 March 2015

Text Analytics : Preprocessing using R - Post 5

In the previous post, we have learnt how to pre-process text data using python. In this post, we will see how to pre-process the data using custom functions in R without external libraries. This may be helpful if we need some flexibility in text pre-processing.

Let us take the same example which we used in the previous post. The first step is to remove the extra white spaces present in the paragraph. 'gsub' function is used to remove these extra white spaces. The code snippet is as follows:

Again using the 'gsub' function we can remove the extra periods as well and the code is

Next step is the removal of special characters. We have to create a special characters list and then use 'for' loop to remove the special characters present in the given list.

To convert the text to lowercase, we can use the 'tolower()' function in R

Finally to remove stopwords, we need to get a list of stopwords. It is readily available in 'tm' library of R and so we can make use of it. Then a 'for' loop to remove those stopwords after tokenizing it.

Ah..! This is a long process and so much of codes! Are there any in-built functions which does this using a single line command instead of 'for' loops. Yes!! There are. Let us wait till our next post to see them.

Wednesday, 4 March 2015

Text Analytics : Preprocessing using python - Post 4

In the previous post, we went through the concepts of different text pre-processing steps. In this post, we can learn about how to implement them using python. Here again as an example, we can take a small paragraph "my_para" which needs some of the pre-processing steps mentioned in the previous post. Lets have a look at the python code.

As we can see, there are some extra white spaces in the given paragraph. So we removed them using regular expressions. The given regular expression in the code can be used to remove all the white spaces including tab space and new line characters. The next step is to remove the extra full stops(periods) present at the end of first sentence.

Then in the third step, we have created a list of special characters and removed those special characters if they are present in the given paragraph. In the given example, the smiley got removed as a result since they are made of special characters. In the next step, we have converted the entire paragraph into lower case. So 'Phone' in the first line got converted to 'phone' and hence matched with the 'phone' in the second sentence.

Last step is the removal of stop words. NLTK module itself has an in-built list of stopwords for English language. We have used this in our code here and removed the stopwords. . But we can also have our own stopword list and use that instead.

This is the pythonic way of pre-processing the text before analysis. In the next post, we can see how to do all these pre-processing using R.

Tuesday, 3 March 2015

Text Analytics : Text Preprocessing - Post 3

We started off our text analytics activity with tokenization in the first two posts. We also took a simple example and created tokens using Python and R. But in most of the cases in real life, we won't get such a clean data as we saw in our previous post.

Most of the text data are entered by humans and each person have his / her unique style of writing such as using emoticons in between sentences, writing in both upper and lower cases alternatively, usage of more punctuations than needed and so on. In such cases, we need to some text pre-processing before we proceed with the analysis. Some of the pre-processing techniques in text analytics are:

1. Removal of extra spaces and periods:

Some of the text data might have more than one white space or tabs in between words or towards the beginning or end of the sentence. So one step in our pre-processing is to remove these extra white spaces / tabs from the given text. Also some people tend to use more than one period at the end of a sentence. For example, "This is text analytics... Extra spaces need to be removed... " has more than one period at the end of sentences. So we need to remove these extra periods (full stops) to ensure that our sentence parser works accurately.

2. Removal of special characters:

At times, text data might contain special characters as well. For example, consider this sentence "I am happy :)". Here this sentence contains some special charaters (an emoticon) and this does not make sense in our analysis. So it is better to remove all those special characters before we proceed with our analysis.

3. Removal of stop-words:

One another key concept in text pre-processing is the removal of stop-words. Stop-words are those which occur commonly in most of the sentences and so including them might affect our analysis. For example, if we plan to get the most frequent words from a document, we might end up getting 'a', 'in', 'can' as the frequent words. But they make no sense, as we actually plan to get the frequent words to get an idea about the document. So these stop-words defeat that purpose. Therefore, it is common in text analysis to remove the stop-words beforehand and then do the analysis.

4. Text standardization:

One another common text pre-processing step is to standardize the text by converting them to lower case. Again for example, if we plan to get the most frequent words, we should not count 'Text' and 'text' as two different words as there is a change in character case. So it is always good to standardize the text by converting them to lower case before we start our analysis.

5. Correction of spelling and chat-words:

Sometimes, text data might have mistyped spellings or chat words such as 'ppl' instead of 'people' and so on. So another pre-processing step that we could include in our text analysis is spell correction. Spell correction is usually done by comparing a given word against a dictionary of words to see if the word exists or not and if not, use a word which is very close to the given word from the dictionary. And also at times, text data might have chat words. So if we have a lexicon for chat words and the corrected words, it could be used to correct the chat words present in the data.

In the following posts, we will see how we could make use of Python and R to do the above mentioned pre-processing steps. Happy learning !!