R for Business Analytics- Book by Ajay Ohri

So the cover art is ready, and if you are a reviewer, you can reserve online copies of the book I have been writing for past 2 years. Special thanks to my mentors, detractors, readers and students- I owe you a beer!

You can also go here-

http://www.springer.com/statistics/book/978-1-4614-4342-1

 

R for Business Analytics

R for Business Analytics

Ohri, Ajay

2012, 2012, XVI, 300 p. 208 illus., 162 in color.

Hardcover
Information

ISBN 978-1-4614-4342-1

Due: September 30, 2012

(net)

approx. 44,95 €
  • Covers full spectrum of R packages related to business analytics
  • Step-by-step instruction on the use of R packages, in addition to exercises, references, interviews and useful links
  • Background information and exercises are all applied to practical business analysis topics, such as code examples on web and social media analytics, data mining, clustering and regression models

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages.  With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.

 

This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

Content Level » Professional/practitioner

Keywords » Business Analytics – Data Mining – Data Visualization – Forecasting – GUI – Graphical User Interface – R software – Text Mining

Related subjects » Business, Economics & Finance – Computational Statistics – Statistics

TABLE OF CONTENTS

Why R.- R Infrastructure.- R Interfaces.- Manipulating Data.- Exploring Data.- Building Regression Models.- Data Mining using R.- Clustering and Data Segmentation.- Forecasting and Time-Series Models.- Data Export and Output.- Optimizing your R Coding.- Additional Training Literature.- Appendix

Data Quality in R #rstats

Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.

The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.

 

After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23)  into data frame numeric formats, I found that the way to handle Date-Time conversions in R is

Dataset$Var2= strptime(as.character(Dataset$Var1),”%M:%S”)

The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45-  By default R will add today’s date to it.)  while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).

this can be handled using the as.difftime function

dataset$Var2=as.difftime(paste(dataset$Var1))

or to get purely numeric values so we can do numeric analysis (like summary)

dataset$Var2=as.numeric(as.difftime(paste(dataset$Var1)))

(#Maybe there is  a more elegant way here- but I dont know)

The kind of data is usually one we get in web analytics for average time on site , etc.

 

 

 

 

 

and

for factor variables

Dataset$Var2= as.numeric(as.character(Dataset$Var1))

 

or

Dataset$Var2= as.numeric(paste(Dataset$Var1))

 

Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504

The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.

 

dataset$Variable2=as.numeric(paste(gsub(“,”,””,dataset$Variable)))

 

Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%

again we use the gsub function to replace the % value in the string with  (nothing).

 

dataset$Variable2=as.numeric(paste(gsub(“%”,””,dataset$Variable)))

 

 

If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.

Dataset$Var2= as.numeric(Dataset$Var1)

An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.

 

iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.

The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.

 

 

Text Mining Barack Obama using R #rstats

  • We copy and paste President Barack Obama’s “Yes We Can” speech in a text document and read it in. For a word cloud we need a dataframe with two columns, one with words and the the other with frequency.We read in the transcript from http://www.nytimes.com/2008/01/08/us/politics/08text-obama.html?pagewanted=all&_r=0  and paste in the file located in the local directory- /home/ajay/Desktop/new. Note tm is a powerful package and will read ALL the text documents within the particular folder

library(tm)

library(wordcloud)

txt2=”/home/ajay/Desktop/new”

b=Corpus(DirSource(txt2), readerControl = list(language = “eng”))

> b b b tdm m1 v1 d1 wordcloud(d1$word,d1$freq)

Now it seems we need to remove some of the very commonly occuring words like “the” and “and”. We are not using the standard stopwords in english (the tm package provides that see Chapter 13 Text Mining case studies), as the words “we” and “can” are also included .

> b tdm m1 v1 d1 wordcloud(d1$word,d1$freq)

But let’s see how the wordcloud changes if we remove all English Stopwords.

> b tdm m1 v1 d1 wordcloud(d1$word,d1$freq)

and you can draw your own conclusions from the content of this famous speech based on your political preferences.

Politicians can give interesting speeches but they may be full of simple sounding words…..

Citation-

1. Ingo Feinerer (2012). tm: Text Mining Package. R package version0.5-7.1.

Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining
Infrastructure in R. Journal of Statistical Software 25/5. URL:
http://www.jstatsoft.org/v25/i05/

2. Ian Fellows (2012). wordcloud: Word Clouds. R package version 2.0.

http://CRAN.R-project.org/package=wordcloud

3. You can see more than 100 of Obama’s speeches at http://obamaspeeches.com/

Quote- numbers dont lie, people do.

.

Moving data between Windows and Ubuntu VMWare partition

I use Windows 7 on my laptop (it came pre-installed) and Ubuntu using the VMWare Player. What are the advantages of using VM Player instead of creating a dual-boot system? Well I can quickly shift from Ubuntu to Windows and bakc again without restarting my computer everytime. Using this approach allows me to utilize software that run only on Windows and run software like Rattle, the R data mining GUI, that are much easier installed on Linux.

However if your statistical software is on your Virtual Disk , and your data is on your Windows disk, you need a way to move data from Windows to Ubuntu.

The solution to this as per Ubuntu forums is –http://communities.vmware.com/thread/55242

Open My Computer, browse to the folder you want to share.  Right-click on the folder, select Properties.  Sharing tab.  Select the radio button to “Share this Folder”.  Change the default generated name if you wish; add a description if you wish.  Click the Permissions button to modify the security settings of what users can read/write to the share.

On the Linux side, it depends on the distro, the shell, and the window manager.

Well Ubuntu makes it really easy to configure the Linux steps to move data within Windows and Linux partitions.

 

NEW UPDATE-

VMmare makes it easy to share between your Windows (host) and Linux (guest) OS

 

Step 1

and step 2

Do this

 

and

Start the Wizard

when you finish the wizard and share a drive or folder- hey where do I see my shared ones-

 

see this folder in Linux- /mnt/hgfs (bingo!)

Hacker HW – Make this folder //mnt/hgfs a shortcut in Places your Ubuntu startup

Hacker Hw 2-

Upload using an anon email your VM dark data to Ubuntu one

Delete VM

Purge using software XX

Reinstall VM and bring back backup

 

Note time to do this

 

 

 

-General Sharing in Windows

 

 

Just open the Network tab in Ubuntu- see screenshots below-

Windows will now ask your Ubuntu user for login-

Once Logged in Windows from within Ubuntu Vmware, this is what happens

You see a tab called “users on “windows username”- pc appear on your Ubuntu Desktop  (see top right of the screenshot)

If you double click it- you see your windows path

You can now just click and drag data between your windows and linux partitions , just the way you do it in Windows .

So based on this- if you want to build  decision trees, artifical neural networks, regression models, and even time series models for zero capital expenditure- you can use both Ubuntu/R without compromising on your IT policy of Windows only in your organization (there is a shortage of Ubuntu trained IT administrators in the enterprise world)

Revised Installation Procedure for utilizing both Ubuntu /R/Rattle data mining on your Windows PC.

Using VMWare to build a free data mining system in R, as well as isolate your analytics system (thus using both Linux and Windows without overburdening your machine)

First Time

  1. http://downloads.vmware.com/d/info/desktop_end_user_computing/vmware_player/4_0Download and Install
  2. http://www.ubuntu.com/download/ubuntu/downloadDownload Only
  3. Create New Virtual Image in VM Ware Player
  4. Applications—–Terminal——sudo apt get-install R (to download and install)
  5.                                          sudo R (to open R)
  6. Once R is opened type this  —-install.packages(rattle)—– This will install rattle
  7. library(rattle) will load Rattle—–
  8. rattle() will open the GUI—-
Getting Data from Host to Guest VM
Next Time
  1. Go to VM Player
  2. Open the VM
  3. sudo R in terminal to bring up R
  4. library(rattle) within R
  5. rattle()
At this point even if you dont know any Linux and dont know any R, you can create data mining models using the Rattle GUI (and time series model using E pack in the R Commander GUI) – What can Rattle do in data mining? See this slideshow-http://www.decisionstats.com/data-mining-with-r-gui-rattle-rstats/
If Google Docs is banned as per your enterprise organizational IT policy of having Windows Explorer only- well you can see these screenshots http://rattle.togaware.com/rattle-screenshots.html

Building a Regression Model in R – Use #Rstats

One of the most commonly used uses of Statistical Software is building models, and that too logistic regression models for propensity in marketing of goods and services.

 

If building a model is what you do-here is a brief easy essay on  how to build a model in R.

1) Packages to be used-

For smaller datasets

use these

  1. CAR Package http://cran.r-project.org/web/packages/car/index.html
  2. GVLMA Package http://cran.r-project.org/web/packages/gvlma/index.html
  3. ROCR Package http://rocr.bioinf.mpi-sb.mpg.de/
  4. Relaimpo Package
  5. DAAG package
  6. MASS package
  7. Bootstrap package
  8. Leaps package

Also see

http://cran.r-project.org/web/packages/rms/index.html or RMS package

rms works with almost any regression model, but it was especially written to work with binary or ordinal logistic regression, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression.

For bigger datasets also see Biglm http://cran.r-project.org/web/packages/biglm/index.html and RevoScaleR packages.

http://www.revolutionanalytics.com/products/enterprise-big-data.php

2) Syntax

  1. outp=lm(y~x1+x2+xn,data=dataset) Model Eq
  2. summary(outp) Model Summary
  3. par(mfrow=c(2,2)) + plot(outp) Model Graphs
  4. vif(outp) MultiCollinearity
  5. gvlma(outp) Heteroscedasticity using GVLMA package
  6. outlierTest (outp) for Outliers
  7. predicted(outp) Scoring dataset with scores
  8. anova(outp)
  9. > predict(lm.result,data.frame(conc = newconc), level = 0.9, interval = “confidence”)

 

For a Reference Card -Cheat Sheet see

http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

3) Also read-

http://cran.r-project.org/web/views/Econometrics.html

http://cran.r-project.org/web/views/Robust.html

 

Use R for Business- Competition worth $ 20,000 #rstats

All you contest junkies, R lovers and general change the world people, here’s a new contest to use R in a business application

http://www.revolutionanalytics.com/news-events/news-room/2011/revolution-analytics-launches-applications-of-r-in-business-contest.php

REVOLUTION ANALYTICS LAUNCHES “APPLICATIONS OF R IN BUSINESS” CONTEST

$20,000 in Prizes for Users Solving Business Problems with R

 

PALO ALTO, Calif. – September 1, 2011 – Revolution Analytics, the leading commercial provider of R software, services and support, today announced the launch of its “Applications of R in Business” contest to demonstrate real-world uses of applying R to business problems. The competition is open to all R users worldwide and submissions will be accepted through October 31. The Grand Prize winner for the best application using R or Revolution R will receive $10,000.

The bonus-prize winner for the best application using features unique to Revolution R Enterprise – such as itsbig-data analytics capabilities or its Web Services API for R – will receive $5,000. A panel of independent judges drawn from the R and business community will select the grand and bonus prize winners. Revolution Analytics will present five honorable mention prize winners each with $1,000.

“We’ve designed this contest to highlight the most interesting use cases of applying R and Revolution R to solving key business problems, such as Big Data,” said Jeff Erhardt, COO of Revolution Analytics. “The ability to process higher-volume datasets will continue to be a critical need and we encourage the submission of applications using large datasets. Our goal is to grow the collection of online materials describing how to use R for business applications so our customers can better leverage Big Analytics to meet their analytical and organizational needs.”

To enter Revolution Analytics’ “Applications of R in Business” competition Continue reading “Use R for Business- Competition worth $ 20,000 #rstats”

Interview Eberhard Miethke and Dr. Mamdouh Refaat, Angoss Software

Here is an interview with Eberhard Miethke and Dr. Mamdouh Refaat, of Angoss Software. Angoss is a global leader in delivering business intelligence software and predictive analytics solutions that help businesses capitalize on their data by uncovering new opportunities to increase sales and profitability and to reduce risk.

Ajay-  Describe your personal journey in software. How can we guide young students to pursue more useful software development than just gaming applications.

 Mamdouh- I started using computers long time ago when they were programmed using punched cards! First in Fortran, then C, later C++, and then the rest. Computers and software were viewed as technical/engineering tools, and that’s why we can still see the heavy technical orientation of command languages such as Unix shells and even in the windows Command shell. However, with the introduction of database systems and Microsoft office apps, it was clear that business will be the primary user and field of application for software. My personal trip in software started with scientific applications, then business and database systems, and finally statistical software – which you can think of it as returning to the more scientific orientation. However, with the wide acceptance of businesses of the application of statistical methods in different fields such as marketing and risk management, it is a fast growing field that in need of a lot of innovation.

Ajay – Angoss makes multiple data mining and analytics products. could you please introduce us to your product portfolio and what specific data analytics need they serve.

a- Attached please find our main product flyers for KnowledgeSTUDIO and KnowledgeSEEKER. We have a 3rd product called “strategy builder” which is an add-on to the decision tree modules. This is also described in the flyer.

(see- Angoss Knowledge Studio Product Guide April2011  and http://www.scribd.com/doc/63176430/Angoss-Knowledge-Seeker-Product-Guide-April2011  )

Ajay-  The trend in analytics is for big data and cloud computing- with hadoop enabling processing of massive data sets on scalable infrastructure. What are your plans for cloud computing, tablet based as well as mobile based computing.

a- This is an area where the plan is still being figured out in all organizations. The current explosion of data collected from mobile phones, text messages, and social websites will need radically new applications that can utilize the data from these sources. Current applications are based on the relational database paradigm designed in the 70’s through the 90’s of the 20th century.

But data sources are generating data in volumes and formats that are challenging this paradigm and will need a set of new tools and possibly programming languages to fit these needs. The cloud computing, tablet based and mobile computing (which are the same thing in my opinion, just different sizes of the device) are also two technologies that have not been explored in analytics yet.

The approach taken so far by most companies, including Angoss, is to rely on new xml-based standards to represent data structures for the particular models. In this case, it is the PMML (predictive modelling mark-up language) standard, in order to allow the interoperability between analytics applications. Standardizing on the representation of models is viewed as the first step in order to allow the implementation of these models to emerging platforms, being that the cloud or mobile, or social networking websites.

The second challenge cited above is the rapidly increasing size of the data to be analyzed. Angoss has already identified this challenge early on and is currently offering in-database analytics drivers for several database engines: Netezza, Teradata and SQL Server.

These drivers allow our analytics products to translate their routines into efficient SQL-based scripts that run in the database engine to exploit its performance as well as the powerful hardware on which it runs. Thus, instead of copying the data to a staging format for analytics, these drivers allow the data to be analyzed “in-place” within the database without moving it.

Thus offering performance, security and integrity. The performance is improved because of the use of the well tuned database engines running on powerful hardware.

Extra security is achieved by not copying the data to other platforms, which could be less secure. And finally, the integrity of the results are vastly improved by making sure that the results are always obtained by analyzing the up-to-date data residing in the database rather than an older copy of the data which could be obsolete by the time the analysis is concluded.

Ajay- What are the principal competing products to your offerings, and what makes your products special or differentiated in value to them (for each customer segment).

a- There are two major players in today’s market that we usually encounter as competitors, they are: SAS and IBM.

SAS offers a data mining workbench in the form of SAS Enterprise Miner, which is closely tied to SAS data mining methodology known as SEMMA.

On the other hand, IBM has recently acquired SPSS, which offered its Clementine data mining software. IBM has now rebranded Clementine as IBM SPSS Modeller.

In comparison to these products, our KnowledgeSTUDIO and KnowledgeSEEKER offer three main advantages: ease of use; affordability; and ease of integration into existing BI environments.

Angoss products were designed to look-and-feel-like popular Microsoft office applications. This makes the learning curve indeed very steep. Typically, an intermediate level analyst needs only 2-3 days of training to become proficient in the use of the software with all its advanced features.

Another important feature of Angoss software products is their integration with SAS/base product, and SQL-based database engines. All predictive models generated by Angoss can be automatically translated to SAS and SQL scripts. This allows the generation of scoring code for these common platforms. While the software interface simplifies all the tasks to allow business users to take advantage of the value added by predictive models, the software includes advanced options to allow experienced statisticians to fine-tune their models by adjusting all model parameters as needed.

In addition, Angoss offers a unique product called StrategyBuilder, which allows the analyst to add key performance indicators (KPI’s) to predictive models. KPI’s such as profitability, market share, and loyalty are usually required to be calculated in conjunction with any sales and marketing campaign. Therefore, StrategyBuilder was designed to integrate such KPI’s with the results of a predictive model in order to render the appropriate treatment for each customer segment. These results are all integrated into a deployment strategy that can also be translated into an execution code in SQL or SAS.

The above competitive features offered by the software products of Angoss is behind its success in serving over 4000 users from over 500 clients worldwide.

Ajay -Describe a major case study where using Angoss software helped save a big amount of revenue/costs by innovative data mining.

a-Rogers Telecommunications Inc. is one of the largest Canadian telecommunications providers, serving over 8.5 million customers and a revenue of 11.1 Billion Canadian Dollars (2009). In 2008, Rogers engaged Angoss in order to help with the problem of ballooning accounts receivable for a period of 18 months.

The problem was approached by improving the efficiency of the call centre serving the collections process by a set of predictive models. The first set of models were designed to find accounts likely to default ahead of time in order to take preventative measures. A second set of models were designed to optimize the call centre resources to focus on delinquent accounts likely to pay back most of the outstanding balance. Accounts that were identified as not likely to pack quickly were good candidates for “Early-out” treatment, by forwarding them directly to collection agencies. Angoss hosted Rogers’ data and provided on a regular interval the lists of accounts for each treatment to be deployed by the call centre dialler. As a result of this Rogers estimated an improvement of 10% of the collected sums.

Biography-

Mamdouh has been active in consulting, research, and training in various areas of information technology and software development for the last 20 years. He has worked on numerous projects with major organizations in North America and Europe in the areas of data mining, business analytics, business analysis, and engineering analysis. He has held several consulting positions for solution providers including Predict AG in Basel, Switzerland, and as ANGOSS Corp. Mamdouh is the Director of Professional services for EMEA region of ANGOSS Software. Mamdouh received his PhD in engineering from the University of Toronto and his MBA from the University of Leeds, UK.

Mamdouh is the author of:

"Credit Risk Scorecards: Development and Implmentation using SAS"
 "Data Preparation for Data Mining Using SAS",
 (The Morgan Kaufmann Series in Data Management Systems) (Paperback)
 and co-author of
 "Data Mining: Know it all",Morgan Kaufmann



Eberhard Miethke  works as a senior sales executive for Angoss

 

About Angoss-

Angoss is a global leader in delivering business intelligence software and predictive analytics to businesses looking to improve performance across sales, marketing and risk. With a suite of desktop, client-server and in-database software products and Software-as-a-Service solutions, Angoss delivers powerful approaches to turn information into actionable business decisions and competitive advantage.

Angoss software products and solutions are user-friendly and agile, making predictive analytics accessible and easy to use.