Thursday, 5 March 2015

Text Analytics : Preprocessing using R - Post 5

In the previous post, we have learnt how to pre-process text data using python. In this post, we will see how to pre-process the data using custom functions in R without external libraries. This may be helpful if we need some flexibility in text pre-processing. 

Let us take the same example which we used in the previous post. The first step is to remove the extra white spaces present in the paragraph. 'gsub' function is used to remove these extra white spaces. The code snippet is as follows:

Again using the 'gsub' function we can remove the extra periods as well and the code is

Next step is the removal of special characters. We have to create a special characters list and then use 'for' loop to remove the special characters present in the given list.

To convert the text to lowercase, we can use the 'tolower()' function in R

Finally to remove stopwords, we need to get a list of stopwords. It is readily available in 'tm' library of R and so we can make use of it. Then a 'for' loop to remove those stopwords after tokenizing it.

Ah..! This is a long process and so much of codes! Are there any in-built functions which does this using a single line command instead of 'for' loops. Yes!! There are. Let us wait till our next post to see them. 

Wednesday, 4 March 2015

Text Analytics : Preprocessing using python - Post 4

In the previous post, we went through the concepts of different text pre-processing steps. In this post, we can learn about how to implement them using python. Here again as an example, we can take a small paragraph "my_para" which needs some of the pre-processing steps mentioned in the previous post. Lets have a look at the python code.
As we can see, there are some extra white spaces in the given paragraph. So we removed them using regular expressions. The given regular expression in the code can be used to remove all the white spaces including tab space and new line characters. The next step is to remove the extra full stops(periods) present at the end of first sentence.

Then in the third step, we have created a list of special characters and removed those special characters if they are present in the given paragraph. In the given example, the smiley got removed as a result  since they are made of special characters. In the next step, we have converted the entire paragraph into lower case. So 'Phone' in the first line got converted to 'phone' and hence matched with the 'phone' in the second sentence.

Last step is the removal of stop words. NLTK module itself has an in-built list of stopwords for English language. We have used this in our code here and removed the stopwords. . But we can also have our own stopword list and use that instead.

This is the pythonic way of pre-processing the text before analysis. In the next post, we can see how to do all these pre-processing using R.

Tuesday, 3 March 2015

Text Analytics : Text Preprocessing - Post 3

We started off our text analytics activity with tokenization in the first two posts. We also took a simple example and created tokens using Python and R. But in most of the cases in real life, we won't get such a clean data as we saw in our previous post. 

Most of the text data are entered by humans and each person have his / her unique style of writing such as using emoticons in between sentences, writing in both upper and lower cases alternatively, usage of more punctuations than needed and so on. In such cases, we need to some text pre-processing before we proceed with the analysis. Some of the pre-processing techniques in text analytics are:

1. Removal of extra spaces and periods:
Some of the text data might have more than one white space or tabs in between words or towards the beginning or end of the sentence. So one step in our pre-processing is to remove these extra white spaces / tabs from the given text. Also some people tend to use more than one period at the end of a sentence. For example, "This is text analytics... Extra spaces need to be removed... " has more than one period at the end of sentences.  So we need to remove these extra periods (full stops) to ensure that our sentence parser works accurately. 

2. Removal of special characters:
At times, text data might contain special characters as well. For example, consider this sentence "I am happy :)". Here this sentence contains some special charaters (an emoticon) and this does not make sense in our analysis. So it is better to remove all those special characters before we proceed with our analysis.  

3. Removal of stop-words:
One another key concept in text pre-processing is the removal of stop-words. Stop-words are those which occur commonly in most of the sentences and so including them might affect our analysis. For example, if we plan to get the most frequent words from a document, we might end up getting 'a', 'in', 'can' as the frequent words. But they make no sense, as we actually plan to get the frequent words to  get an idea about the document. So these stop-words defeat that purpose. Therefore, it is common in text analysis to remove the stop-words beforehand and then do the analysis. 

4. Text standardization:
One another common text pre-processing step is to standardize the text by converting them to lower case. Again for example, if we plan to get the most frequent words, we should not count 'Text' and 'text' as two different words as there is a change in character case. So it is always good to standardize the text by converting them to lower case before we start our analysis.

5. Correction of spelling and chat-words:
Sometimes, text data might have mistyped spellings or chat words such as 'ppl' instead of 'people' and so on. So another pre-processing step that we could include in our text analysis is spell correction. Spell correction is usually done by comparing a given word against a dictionary of words to see if the word exists or not and if not, use a word which is very close to the given word from the dictionary. And also at times, text data might have chat words. So if we have a lexicon for chat words and the corrected words, it could be used to correct the chat words present in the data. 

In the following posts, we will see how we could make use of Python and R to do the above mentioned pre-processing steps. Happy learning !!

Monday, 2 March 2015

Text Analytics : Tokenization - Post 2

In the first post, we saw how to split the paragraphs into sentences and then into words using python nltk module. Now we will see how basic in built functions in python and R  can be used to split the paragraphs into words.

Logically we know that paragraphs are made of sentences and usually a punctuation (full stop) is present in between two sentences. Also we know that words in a sentence are separated by white spaces. So we can make use of this to create a tokenizer on our own. This process of splitting the paragraphs into words is known as Tokenization in text analytics terminology.

First let us see how to do this in R

we are making use of the strsplit function available in R and made the tokenization quite simple. Now lets take a look at the code in python.
Here we use split function to split the string on punctuation and then on white spaces.

But wait.. We can't get all the paragraphs as proper as this. There will be many places where people use more than one punctuation marks towards the end, have more than one space in between, or have some other special characters in the paragraphs. How do we tackle them?! Stay tuned..

Text Analytics : Intro and Tokenization - Post 1

Text analytics / mining is one of the niche field in analytics which has numerous applications in various domains primarily due to the explosion of world wide web. However it is quite different from other analysis that it requires a blend of both programming and analytical skills since the data is unstructured. There are a bunch of commercial and open source tools available in the market to do text analysis in case if you are interested in a ready made solution!

This series of blogs will be focussed on learning the basics of text analytics  primarily using Python and R. Python and R have good modules for doing text analysis and we will be mostly using tm package in R and nltk package in python for our blog posts.

We feel that it is always better to learn the concepts through hands-on experimentation which will make the learning more fun. So we will be using this movie review data for learning the concepts. This movie review data is obtained from Cornell university website. This dataset consists of two folders, one for positive and another one for negative reviews with 1000 files each. Each file consists of a movie review and these reviews are obtained from IMDB website.

Before we dive into processing this dataset, let us review some basics concepts with simpler examples. Text data is generally present in the form of paragraphs. So first step in our analysis is to break down the paragraphs into sentences and then into words. How do we do that using python?! We can do it easy peasy using NLTK python module. Let us have a look at the python code to do that.
First two lines of code import the necessary in-built functions from nltk module. Then we create a paragraph and assign it to a variable 'my_para'. Function sent_tokenize then splits the paragraph into sentences. Our next step is to split the sentences into words which is done by word_tokenize function. 

Hurray!! We are done with our first experimentation of extracting words from a paragraph using python nltk module. Can we do the same in R? Or in the absence of nltk module, how can we do it in python? Stay thinking..! 

Thursday, 11 December 2014

An Extended Discussion on Statistical Experimentation

I would like to add some points in the experimentation blog, blogged by @thauckzulily's in Zulily’s engineering website - Link. The experimental approach had an interesting exploration~exploitation battle for website optimization. They took an approach in the use of simulation for experimentation. I would like to add some points on the Power Discussion which was the prime factor of the discussion. As per the Power calculation (and as the graphical picture depicts), there are two ways to achieve the high points in Power:-
  • The Larger the Difference in conversion, the smaller the chances of un-detectable
  • The Larger the sample size, the smaller the chances of un-detectable ( Type II error)

Mean Difference in Conversion Rate @ a = 0.05

The need for experiment starts with the detection of improvement with the Hypothesis check on KPI's. Most of the times, the experimentation would be a check in detecting the statistically significant increment in KPI's considered. Hence, Sample size gets bigger to decrease the chances of un-detectable significance.

Consider this scenario, if the experimentation is actually “not-significant”, the probability that the shift will be detected on the first sample is 1- β, 
         the second sample is b(1-β)
         the rth sample is βr-1(1-β)
             Hence 1 / (1-β)

Nature -  In Control
Nature -  Out of Control
We Conclude - In Control
Confidence, 1 - α
Experimentation error, β
Conclude - Out Of Control
Error - α
Power, 1 - β

Above table showcases the hypothesis for Power and Confidence.

Type I error: Concluding there is no significant conversion when the actual scenario is the conversion is significant

Type II error: Concluding there is a significant conversion when the actual scenario is that it doesn’t have any conversion

Ie., P{type I error} = P{reject H0 |H0 is true}
        =P{conclude no significant conversion |although conversion is significant}
Type II error (consumer’s risk): P{type II error} = P{fail to reject H0 |H0 is false}
         =P {conversion is significant | although no significant conversion }

Power of the test: Power = 1 - β    = P {reject H0 |H0 is false}

The ultimate aims of experiments are to find a statistical significance in finding a difference between two treatments. 

Confidence:-
Consider we have started our experimentation; we are interested in knowing a minimum no. of samples where we could detect the Significance. The definition is out-of-control significance. We would like to know the minimum sample number from which we could detect the difference.
N min = 1/ α. The best analogy for confidence and Power was with the simulation result as in referred blog.

Though Power and confidence aren't related in terms of Hypothesis, the increase in confidence would result in increase in Sample size need, thereby increasing the power value.

Logistic Regression cheat sheet:-
Consider this Logit Regression result, the small trick in seeing Confidence Interval is that it the coef’s CI ~ C (treatment) T.B ranges from Positive to Negative. This Positive to Negative value ambivalence would also be reflected in Odds Ration’s CI.

It conveys that there is a Statistical insignificance in the model’s intercept which means either there is insignificance with respect to the Contol vs Treatment Method or there is no statistical evidence that Treatment is better than the Control Exposure.




Contingency Table Calculation:-
Contingency table is on testing the hypothesis of rows vs columns dependency. Consider this example (numbers are made up for illustration), we could analyze the hypothesis on how well the Advertising medium is independent of Landing Page Variable. Observed Frequency Table illustrates the Advertising medium, our user went through the Landing Page A or Landing Page B. Expected Frequency Table illustrates the expected value in the table as an ideal condition for independence.

Example: -
Observed Frequency
Landing Page
Advertisement Medium

Medium I
Medium II
Medium III
Total
Landing A
160
140
40
340
Landing B
40
60
60
160
Total
200
200
100
500

Expected Frequency
Landing Page
Advertisement Medium

Medium I
Medium II
Medium III
Total
Landing A
136
136
68
340
Landing B
64
64
32
160
Total
200
200
100
500

Consider the Hypothesis: - Landing Page is Independent of Advertising Medium at α = 0.05
After the calculation on Chi-Square with these Degrees of Freedom, we could conclude on the Hypothesis which we are testing.

Fewer the facts, Stronger the opinions. – “Arnold Glasow”

Thanks for your time in reading through the blog. Please feel free to comment on any of the terminologies, calculations as said above.

Thursday, 28 November 2013

Road accidents in India - A Visualization

As a kid I thought a best novelist was one who churned out nerve wracking stories, with every other page replete with cliff hanging moments and plot twists. But as maturity dawned on, I realised the best ones were the good story tellers. One who could bring out a vivid picture in reader’s mind, precisely conveying only the matters they intend to convey.

Drawing a parallel to analytics field, as data scientists mature, they nurture story telling abilities, evolving into good story tellers. The most frequent mistake a newbie would do is spending a lot of time and energy on analysis part, only to haphazardly put together a result without giving much thought on receptiveness of the audience. In current scenario as the tentacles of analytics proliferates into every single department of a business, more and more people who are novice when it comes to the art of data interpretation are suddenly thrust upon with duty of decision making based on data. The onus is now on a data scientist to bring out their story telling skills to convey the story in an easily understandable manner to wide range of audience.
This case study shows how one can leverage data visualization tools in conveying the right story to any range of audience.

Data collected from Government Ministries and Departments are made available at Data Portal India, which is a platform for supporting Open Data initiative of Government of India. Dataset about Road accidents in India is obtained from this data portal and Tableau Public is used to visualize the data.



You can find here an interactive presentation of data which we would use going forward.

Now for an analyst, the foremost issue while presenting any data is - should this be at a granular level presenting every break up of data, the way a veteran end user of the data would prefer. Or should it be at a macroscopic level for the benefit of novice audiences? This is where data visualization tools come to the rescue. For instance Tableau enables us to show the data at macroscopic view, conveying the story at a glance or if needed turns interactive, providing the data at required granularity. Feel free to play with the report by clicking on states on the map and see if they tell a different story from the one at first glance.


The first map shows the distribution of number of accidents in each state. Now what story does it tell you? Ok, northern and eastern states are relatively safer, probably due to unhurried pace of their lives. Larger states have worst traffic record and you start wondering what could be the correlation there. Then you want to dig deeper and start looking at state wise data, ‘Oh oh, wait there’ you say when the realization hit that the numbers here are absolute and merely reflects the population distribution of these states. So it would be fair enough to assume that this view is of less importance and might even be misleading. 





The second view shows the percentage increase in number of accidents in each state over the years (2003 to 2011). Maharashtra that was painted dark red in previous view now sports a blemish-less white canvas. The state had a mere 4% rise in accidents over these 9 years. But when you look at granular state level data even the 4% translates to a mind boggling 3K accidents. To give a perspective this is half the number of accidents in the state of J&K by year 2013. So let’s just blame the base effect (higher denominator) for the flat rate and not look much into those white states. But still, this view combined with the previous one conveys an important story - Uttar Pradesh, inspite of being a huge state, has high rise in number of accidents. 



Finally since the absolute numbers did not make much sense, let’s look at ratio of number of accidents to population in each state (4th tab in the tableau). It should hardly take 3 seconds for anyone to scream ‘Oh see there is GOA’. Viola!! We have a clear outlier and you can guess the reason :)



Now catching the same trends from multiple traditional data table requires a lot of skill that comes only from years of practice. Looking the data laid out on a map rather than in a spreadsheet gives a perspective that is simply unparalleled. Also time taken to discern those patterns, validate the assumptions, compare multiple data sets reduces drastically. This is where data visualization could be of a big value addition and an irreplaceable piece in the armor of a data scientist.