Thursday, 5 March 2015

Text Analytics : Preprocessing using R - Post 5

In the previous post, we have learnt how to pre-process text data using python. In this post, we will see how to pre-process the data using custom functions in R without external libraries. This may be helpful if we need some flexibility in text pre-processing. 

Let us take the same example which we used in the previous post. The first step is to remove the extra white spaces present in the paragraph. 'gsub' function is used to remove these extra white spaces. The code snippet is as follows:

Again using the 'gsub' function we can remove the extra periods as well and the code is

Next step is the removal of special characters. We have to create a special characters list and then use 'for' loop to remove the special characters present in the given list.

To convert the text to lowercase, we can use the 'tolower()' function in R

Finally to remove stopwords, we need to get a list of stopwords. It is readily available in 'tm' library of R and so we can make use of it. Then a 'for' loop to remove those stopwords after tokenizing it.

Ah..! This is a long process and so much of codes! Are there any in-built functions which does this using a single line command instead of 'for' loops. Yes!! There are. Let us wait till our next post to see them. 

Wednesday, 4 March 2015

Text Analytics : Preprocessing using python - Post 4

In the previous post, we went through the concepts of different text pre-processing steps. In this post, we can learn about how to implement them using python. Here again as an example, we can take a small paragraph "my_para" which needs some of the pre-processing steps mentioned in the previous post. Lets have a look at the python code.
As we can see, there are some extra white spaces in the given paragraph. So we removed them using regular expressions. The given regular expression in the code can be used to remove all the white spaces including tab space and new line characters. The next step is to remove the extra full stops(periods) present at the end of first sentence.

Then in the third step, we have created a list of special characters and removed those special characters if they are present in the given paragraph. In the given example, the smiley got removed as a result  since they are made of special characters. In the next step, we have converted the entire paragraph into lower case. So 'Phone' in the first line got converted to 'phone' and hence matched with the 'phone' in the second sentence.

Last step is the removal of stop words. NLTK module itself has an in-built list of stopwords for English language. We have used this in our code here and removed the stopwords. . But we can also have our own stopword list and use that instead.

This is the pythonic way of pre-processing the text before analysis. In the next post, we can see how to do all these pre-processing using R.

Tuesday, 3 March 2015

Text Analytics : Text Preprocessing - Post 3

We started off our text analytics activity with tokenization in the first two posts. We also took a simple example and created tokens using Python and R. But in most of the cases in real life, we won't get such a clean data as we saw in our previous post. 

Most of the text data are entered by humans and each person have his / her unique style of writing such as using emoticons in between sentences, writing in both upper and lower cases alternatively, usage of more punctuations than needed and so on. In such cases, we need to some text pre-processing before we proceed with the analysis. Some of the pre-processing techniques in text analytics are:

1. Removal of extra spaces and periods:
Some of the text data might have more than one white space or tabs in between words or towards the beginning or end of the sentence. So one step in our pre-processing is to remove these extra white spaces / tabs from the given text. Also some people tend to use more than one period at the end of a sentence. For example, "This is text analytics... Extra spaces need to be removed... " has more than one period at the end of sentences.  So we need to remove these extra periods (full stops) to ensure that our sentence parser works accurately. 

2. Removal of special characters:
At times, text data might contain special characters as well. For example, consider this sentence "I am happy :)". Here this sentence contains some special charaters (an emoticon) and this does not make sense in our analysis. So it is better to remove all those special characters before we proceed with our analysis.  

3. Removal of stop-words:
One another key concept in text pre-processing is the removal of stop-words. Stop-words are those which occur commonly in most of the sentences and so including them might affect our analysis. For example, if we plan to get the most frequent words from a document, we might end up getting 'a', 'in', 'can' as the frequent words. But they make no sense, as we actually plan to get the frequent words to  get an idea about the document. So these stop-words defeat that purpose. Therefore, it is common in text analysis to remove the stop-words beforehand and then do the analysis. 

4. Text standardization:
One another common text pre-processing step is to standardize the text by converting them to lower case. Again for example, if we plan to get the most frequent words, we should not count 'Text' and 'text' as two different words as there is a change in character case. So it is always good to standardize the text by converting them to lower case before we start our analysis.

5. Correction of spelling and chat-words:
Sometimes, text data might have mistyped spellings or chat words such as 'ppl' instead of 'people' and so on. So another pre-processing step that we could include in our text analysis is spell correction. Spell correction is usually done by comparing a given word against a dictionary of words to see if the word exists or not and if not, use a word which is very close to the given word from the dictionary. And also at times, text data might have chat words. So if we have a lexicon for chat words and the corrected words, it could be used to correct the chat words present in the data. 

In the following posts, we will see how we could make use of Python and R to do the above mentioned pre-processing steps. Happy learning !!

Monday, 2 March 2015

Text Analytics : Tokenization - Post 2

In the first post, we saw how to split the paragraphs into sentences and then into words using python nltk module. Now we will see how basic in built functions in python and R  can be used to split the paragraphs into words.

Logically we know that paragraphs are made of sentences and usually a punctuation (full stop) is present in between two sentences. Also we know that words in a sentence are separated by white spaces. So we can make use of this to create a tokenizer on our own. This process of splitting the paragraphs into words is known as Tokenization in text analytics terminology.

First let us see how to do this in R

we are making use of the strsplit function available in R and made the tokenization quite simple. Now lets take a look at the code in python.
Here we use split function to split the string on punctuation and then on white spaces.

But wait.. We can't get all the paragraphs as proper as this. There will be many places where people use more than one punctuation marks towards the end, have more than one space in between, or have some other special characters in the paragraphs. How do we tackle them?! Stay tuned..

Text Analytics : Intro and Tokenization - Post 1

Text analytics / mining is one of the niche field in analytics which has numerous applications in various domains primarily due to the explosion of world wide web. However it is quite different from other analysis that it requires a blend of both programming and analytical skills since the data is unstructured. There are a bunch of commercial and open source tools available in the market to do text analysis in case if you are interested in a ready made solution!

This series of blogs will be focussed on learning the basics of text analytics  primarily using Python and R. Python and R have good modules for doing text analysis and we will be mostly using tm package in R and nltk package in python for our blog posts.

We feel that it is always better to learn the concepts through hands-on experimentation which will make the learning more fun. So we will be using this movie review data for learning the concepts. This movie review data is obtained from Cornell university website. This dataset consists of two folders, one for positive and another one for negative reviews with 1000 files each. Each file consists of a movie review and these reviews are obtained from IMDB website.

Before we dive into processing this dataset, let us review some basics concepts with simpler examples. Text data is generally present in the form of paragraphs. So first step in our analysis is to break down the paragraphs into sentences and then into words. How do we do that using python?! We can do it easy peasy using NLTK python module. Let us have a look at the python code to do that.
First two lines of code import the necessary in-built functions from nltk module. Then we create a paragraph and assign it to a variable 'my_para'. Function sent_tokenize then splits the paragraph into sentences. Our next step is to split the sentences into words which is done by word_tokenize function. 

Hurray!! We are done with our first experimentation of extracting words from a paragraph using python nltk module. Can we do the same in R? Or in the absence of nltk module, how can we do it in python? Stay thinking..!