Basic Text Mining :3 Simple Paths

The locals of Punjab (india). These are the tr...
Image via Wikipedia

Text Mining in which you search alpha numeric data for meaningful patterns is relatively more complex than plain numeric variable data crunching. The reason for that is human eye can measure only a few hundred rows of data before getting tired, and analytics software algorithms need to properly programmed else they miss the relevant solution or text. An example, how many Punjabis live in Delhi (Stats needed), suppose you have a Dataset that has all the names in Delhi,in order to send an sms contest (Marketing Decision) on Lohri (Punjabi specific Festival)

Text Manipulation can be done by TRIM and LOWER functions in EXCEL and corresponding functions in SAS. For Mining use the following options-

1)SAS Basic Text Mining -Using Only Base SAS

In SAS you can use the INDEXW function for text mining.

As per SAS Online DOc

INDEXW(source, excerpt)


specifies the character expression to search.
specifies the string of characters to search for in the character expression. SAS removes the leading and trailing blanks from excerpt.

The INDEXW function searches source, from left to right, for the first occurrence of excerpt and returns the position in source of the substring’s first character. If the substring is not found in source, INDEXW returns a value of 0. If there are multiple occurrences of the string, INDEXW returns only the position of the first



You can use MS Excel for text mining too. I recommend Office 2007 simply because it can handle more rows.

The function in Excel is SEARCH



In MS Access you can use LIKE Queries to create a different table or append a value to certain columns


Some problems can?t be solved with comparisons : e.g. ?His name begins with Mc or Mac. In a case like this, wildcards are required, and are represented in SQL with the % sign and the LIKE keyword.


SELECT au_lname, city

FROM authors

WHERE au_lname LIKE ?Mc&? or au_lanme LIKE ?Mac%?

UPDATED- The above post is now obsolete- there are easier and better ways to to text mining. That includes weka and R

%d bloggers like this: