Product Review – Revolution R 5.0

So I got the email from Revolution R. Version 5.0 is ready for download, and unlike half hearted attempts by many software companies they make it easy for the academics and researchers to get their free copy. Free as in speech and free as in beer.

Some thoughts-

1) R ‘s memory problem is now an issue of marketing and branding. Revolution Analytics has definitely bridged this gap technically  beautifully and I quote from their documentation-

The primary advantage 64-bit architectures bring to R is an increase in the amount of memory available to a given R process.
The first benefit of that increase is an increase in the size of data objects you can create. For example, on most 32-bit versions of R, the largest data object you can create is roughly 3GB; attempts to create 4GB objects result in errors with the message “cannot allocate vector of length xxxx.”
On 64-bit versions of R, you can generally create larger data objects, up to R’s current hard limit of 231 􀀀 1 elements in a vector (about 2 billion elements). The functions memory.size and memory.limit help you manage the memory used byWindows versions of R.
In 64-bit Revolution R Enterprise, R sets the memory limit by default to the amount of physical RAM minus half a gigabyte, so that, for example, on a machine with 8GB of RAM, the default memory limit is 7.5GB:

2) The User Interface is best shown as below or at  https://docs.google.com/presentation/pub?id=1V_G7r0aBR3I5SktSOenhnhuqkHThne6fMxly_-4i8Ag&start=false&loop=false&delayms=3000

-(but I am still hoping for the GUI ,Revolution Analytics promised us for Christmas)

3) The partnership with Microsoft HPC is quite awesome given Microsoft’s track record in enterprise software penetration

but I am also interested in knowing more about the Oracle version of R and what it will do there.

App to App Porting

I often wonder why bright, intelligent software programmers go out of their way to write turgid and lengthy words in documentation, do not make  step by step screenshot/slides for Tutorials, and practically force everyone to reinvent the wheel everytime they create a new platform.

Top of my wish list for 2012-

1) Better GUI  for APP CREATION-

example-A GUI utility to create chrome apps something similar to Android  App creator http://www.appinventorbeta.com/about/

2)  Automated Porting or Translation-

An automated appsot app for reading in an iOS app (or iPhone app) and churning out the necessary Android app code. This is similar to translating blogs from one blogging platform to another using Python at http://code.google.com/p/google-blog-converters-appengine/

 

but the woefully underpowered http://wordpress2blogger.appspot.com/ currently allows only downloads less than 1 MB, while WordPress itself allows 15 MB export files.

3) Better interaction between cloud and desktop apps

example – (google docs and libre office)  or webcams to (google hangouts and google voice /youtube)

Are we there yet? Not appy enough !

 

 

Preview- Google Cloud SQL

From –http://code.google.com/apis/sql/

What is Google Cloud SQL?

Google Cloud SQL is web service that allows you to create, configure, and use relational databases with your App Engine applications. It is a fully-managed service that maintains, manages, and administers your databases, allowing you to focus on your applications and services.

By offering the capabilities of a MySQL database, the service enables you to easily move your data, applications, and services into and out of the cloud. This allows for high data portability and helps in faster time-to-market because you can quickly leverage your existing database (using JDBC and/or DB-API) in your App Engine application.

Here is where you can get an invite to the beta only Google Cloud SQL

Sign up for Limited Preview

Google Cloud SQL is available to a limited number of users. To sign up for the service:

  1. Visit the Google APIs Console. The console opens the All services pane.
  2. Find the SQL Service line in the Services table and click Request access…
  3. Fill out the enrollment form.
  4. Our team will review your enrollment information and respond by email to the address associated with your Google Account.
  5. Follow the link in the email to view the Terms of Service. Please read these carefully before accepting.
  6. Sign up for the google-cloud-sql-announce group to receive important announcements and product news. (NOTE- Members: 384)
and after all that violence and double talk, a walk in the clouds with SQL.
1. There are three kinds of instances in the beta view
2. Wait for the Instance to be created note- the Design of the Interface uptil now is much better than Amazon’s.  
Note you need to have an appspot application from Google Apps and can choose between the Python and Java versions. Quite clearly there is a play for other languages too. I think GO is also supported.
3. You can import your data from your Google Storage bucket
4. I am not that hot at coding or maybe the interface was too pretty. Anyways- the log tells me that import of the text file has failed from Google Storage to Google Cloud SQL 
5. Incidentally the Google Cloud Storage interface is also much better than the Amazon GUI for transferring data- Note I was using the classical statistical dataset Boston Housing Data as the test case. 
6. The SQL prompt is the weakest part of the design process of the Interphase. There is no Query builder and the SELECT FROM WHERE prompt is slightly amusing/ insulting . I mean guys either throw in a fully fledged GUI for query builder similar to the MYSQL Workbench , than create a pretty white command prompt.
7. You can also export your data back to your Google Storage bucket 
These are early days, and I am trying to see if there is a play for some cloud kind of ODBC action between R, Prediction API , and the cloud SQL… so try it out yourself at http://code.google.com/apis/sql/ and see if there is any juice you can build  here.

Moving data between Windows and Ubuntu VMWare partition

I use Windows 7 on my laptop (it came pre-installed) and Ubuntu using the VMWare Player. What are the advantages of using VM Player instead of creating a dual-boot system? Well I can quickly shift from Ubuntu to Windows and bakc again without restarting my computer everytime. Using this approach allows me to utilize software that run only on Windows and run software like Rattle, the R data mining GUI, that are much easier installed on Linux.

However if your statistical software is on your Virtual Disk , and your data is on your Windows disk, you need a way to move data from Windows to Ubuntu.

The solution to this as per Ubuntu forums is –http://communities.vmware.com/thread/55242

Open My Computer, browse to the folder you want to share.  Right-click on the folder, select Properties.  Sharing tab.  Select the radio button to “Share this Folder”.  Change the default generated name if you wish; add a description if you wish.  Click the Permissions button to modify the security settings of what users can read/write to the share.

On the Linux side, it depends on the distro, the shell, and the window manager.

Well Ubuntu makes it really easy to configure the Linux steps to move data within Windows and Linux partitions.

 

NEW UPDATE-

VMmare makes it easy to share between your Windows (host) and Linux (guest) OS

 

Step 1

and step 2

Do this

 

and

Start the Wizard

when you finish the wizard and share a drive or folder- hey where do I see my shared ones-

 

see this folder in Linux- /mnt/hgfs (bingo!)

Hacker HW – Make this folder //mnt/hgfs a shortcut in Places your Ubuntu startup

Hacker Hw 2-

Upload using an anon email your VM dark data to Ubuntu one

Delete VM

Purge using software XX

Reinstall VM and bring back backup

 

Note time to do this

 

 

 

-General Sharing in Windows

 

 

Just open the Network tab in Ubuntu- see screenshots below-

Windows will now ask your Ubuntu user for login-

Once Logged in Windows from within Ubuntu Vmware, this is what happens

You see a tab called “users on “windows username”- pc appear on your Ubuntu Desktop  (see top right of the screenshot)

If you double click it- you see your windows path

You can now just click and drag data between your windows and linux partitions , just the way you do it in Windows .

So based on this- if you want to build  decision trees, artifical neural networks, regression models, and even time series models for zero capital expenditure- you can use both Ubuntu/R without compromising on your IT policy of Windows only in your organization (there is a shortage of Ubuntu trained IT administrators in the enterprise world)

Revised Installation Procedure for utilizing both Ubuntu /R/Rattle data mining on your Windows PC.

Using VMWare to build a free data mining system in R, as well as isolate your analytics system (thus using both Linux and Windows without overburdening your machine)

First Time

  1. http://downloads.vmware.com/d/info/desktop_end_user_computing/vmware_player/4_0Download and Install
  2. http://www.ubuntu.com/download/ubuntu/downloadDownload Only
  3. Create New Virtual Image in VM Ware Player
  4. Applications—–Terminal——sudo apt get-install R (to download and install)
  5.                                          sudo R (to open R)
  6. Once R is opened type this  —-install.packages(rattle)—– This will install rattle
  7. library(rattle) will load Rattle—–
  8. rattle() will open the GUI—-
Getting Data from Host to Guest VM
Next Time
  1. Go to VM Player
  2. Open the VM
  3. sudo R in terminal to bring up R
  4. library(rattle) within R
  5. rattle()
At this point even if you dont know any Linux and dont know any R, you can create data mining models using the Rattle GUI (and time series model using E pack in the R Commander GUI) – What can Rattle do in data mining? See this slideshow-http://www.decisionstats.com/data-mining-with-r-gui-rattle-rstats/
If Google Docs is banned as per your enterprise organizational IT policy of having Windows Explorer only- well you can see these screenshots http://rattle.togaware.com/rattle-screenshots.html

Data Mining with R GUI -Rattle #Rstats

Why is RATTLE my favorite R package?
because it allows data mining in a very nice interface.
Complicated software need not have complicated interfaces.
Have a look-

(Note- download rattle from http://rattle.togaware.com)

For better visibility please click the full screen button or click the second pps below- automatically advances every 5 secs

#rstats -Basic Data Manipulation using R

Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.
1 Keeping only some variables

Using subset we can keep only the variables we want-

Sitka89 <- subset(Sitka89, select=c(size,Time,treat))

Will keep only the variables we have selected (size,Time,treat).

2 Dropping some variables

Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor

3 Keeping records based on character condition

Titanic.sub1<-subset(Titanic,Sex=="Male")

Note the double equal-to sign
4 Keeping records based on date/time condition

subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04')

5 Converting Date Time Formats into other formats

if the variable dob is “01/04/1977) then following will convert into a date object

z=strptime(dob,”%d/%m/%Y”)

and if the same date is 01Apr1977

z=strptime(dob,"%d%b%Y")

6 Difference in Date Time Values and Using Current Time

The difftime function helps in creating differences in two date time variables.

difftime(time1, time2, units='secs')

or

difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks"))

For current system date time values you can use

Sys.time()

Sys.Date()

This value can be put in the difftime function shown above to calculate age or time elapsed.

7 Keeping records based on numerical condition

Titanic.sub1<-subset(Titanic,Freq >37)

For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset

8 Sorting Data

Sorting A Data Frame in Ascending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ Package)

Sorting a Data Frame in Descending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ -Installed)

9 Transforming a Dataset Structure around a single variable

Using the Reshape2 Package we can use melt and acast functions

library("reshape2")

tDat.m<- melt(tDat)

tDatCast<- acast(tDat.m,Subject~Item)

If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets.

df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide")

10 Type in Data

Using scan() function we can type in data in a list

11 Using Diff for lags and Cum Sum function forCumulative Sums

We can use the diff function to calculate difference between two successive values of a variable.

Diff(Dataset$X)

Cumsum function helps to give cumulative sum

Cumsum(Dataset$X)

> x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20

> x

[1] 20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585

[9] 20.02222 18.99311

> diff(x)

[1] -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296

[8] -0.6536232 -1.0291181 >

cumsum(x)

[1] 20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320

[8] 159.22905 179.25128 198.24438

> diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing

[1] -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064

[8] -1.6827413

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

12 Merging Data

Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is

totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region"))

13 Aggregating and group processing of a variable

We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply.

Refering to the R Online Manual at
[http://stat.ethz.ch/R-manual/R-patched/library/stats/html/aggregate.html]

## Compute the averages for the variables in 'state.x77', grouped

## according to the region (Northeast, South, North Central, West) that

## each state belongs to

aggregate(state.x77, list(Region = state.region), mean)

Using TApply

## tapply(Summary Variable, Group Variable, Function)

Reference

[http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm#tapply]

We can also use specialized packages for data manipulation.

For additional By-group processing you can see the doBy package as well as Plyr package
 for data manipulation.Doby contains a variety of utilities including:
 1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
 2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
 3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
 For example, you might want to fit a model to each spatial location or
 time point in your study, summarise data by panels or collapse high-dimensional arrays
 to simpler summary statistics.

Using Two Operating Systems for RATTLE, #Rstats Data Mining GUI

Using a virtual partition is slightly better than using a dual boot system. That is because you can keep the specialized operating system (usually Linux) within the main operating system (usually Windows), browse and alternate between the two operating system just using a simple command, and can utilize the advantages of both operating system.

Also you can create project specific discs for enhanced security.

In my (limited ) Mac experience, the comparisons of each operating system are-

1) Mac-  Both robust and aesthetically designed OS, the higher price and hardware-lockin for Mac remains a disadvantage. Also many stats and analytical software just wont work on the Mac

2) Windows- It is cheaper than Mac and easier to use than Linux. Also has the most compatibility with applications (usually when not crashing)

3) Linux- The lightest and most customized software in the OS class, free to use, and has many lite versions for newbies. Not compatible with mainstream corporate IT infrastructure as of 2011.

I personally use VMWare Player for creating the virtual disk (as much more convenient than the wubi.exe method)  from http://www.vmware.com/support/product-support/player/  (and downloadable from http://downloads.vmware.com/d/info/desktop_downloads/vmware_player/3_0)

That enables me to use Ubuntu on the alternative OS- keeping my Windows 7 for some Windows specific applications . For software like Rattle, the R data mining GUI , it helps to use two operating systems, in view of difficulties in GTK+.

Installing Rattle on Windows 7 is a major pain thanks to backward compatibility issues and version issues of GTK, but it installs on Ubuntu like a breeze- and it is very very convenient to switch between the two operating systems

Download Rattle from http://rattle.togaware.com/ and test it on the dual OS arrangement to see yourself.