We started off our text analytics activity with tokenization in the first two posts. We also took a simple example and created tokens using Python and R. But in most of the cases in real life, we won't get such a clean data as we saw in our previous post.
Most of the text data are entered by humans and each person have his / her unique style of writing such as using emoticons in between sentences, writing in both upper and lower cases alternatively, usage of more punctuations than needed and so on. In such cases, we need to some text pre-processing before we proceed with the analysis. Some of the pre-processing techniques in text analytics are:
1. Removal of extra spaces and periods:
Some of the text data might have more than one white space or tabs in between words or towards the beginning or end of the sentence. So one step in our pre-processing is to remove these extra white spaces / tabs from the given text. Also some people tend to use more than one period at the end of a sentence. For example, "This is text analytics... Extra spaces need to be removed... " has more than one period at the end of sentences. So we need to remove these extra periods (full stops) to ensure that our sentence parser works accurately.
2. Removal of special characters:
At times, text data might contain special characters as well. For example, consider this sentence "I am happy :)". Here this sentence contains some special charaters (an emoticon) and this does not make sense in our analysis. So it is better to remove all those special characters before we proceed with our analysis.
3. Removal of stop-words:
One another key concept in text pre-processing is the removal of stop-words. Stop-words are those which occur commonly in most of the sentences and so including them might affect our analysis. For example, if we plan to get the most frequent words from a document, we might end up getting 'a', 'in', 'can' as the frequent words. But they make no sense, as we actually plan to get the frequent words to get an idea about the document. So these stop-words defeat that purpose. Therefore, it is common in text analysis to remove the stop-words beforehand and then do the analysis.
4. Text standardization:
One another common text pre-processing step is to standardize the text by converting them to lower case. Again for example, if we plan to get the most frequent words, we should not count 'Text' and 'text' as two different words as there is a change in character case. So it is always good to standardize the text by converting them to lower case before we start our analysis.
5. Correction of spelling and chat-words:
Sometimes, text data might have mistyped spellings or chat words such as 'ppl' instead of 'people' and so on. So another pre-processing step that we could include in our text analysis is spell correction. Spell correction is usually done by comparing a given word against a dictionary of words to see if the word exists or not and if not, use a word which is very close to the given word from the dictionary. And also at times, text data might have chat words. So if we have a lexicon for chat words and the corrected words, it could be used to correct the chat words present in the data.
In the following posts, we will see how we could make use of Python and R to do the above mentioned pre-processing steps. Happy learning !!