Top ten RRReasons R is bad for you ?

 

This is the original symbol of the Perl progra...
Image via Wikipedia

 

R stands for programming language based out of www.r-project.org

R is bad for you because –

1) It is slower with bigger datasets than SPSS language and SAS language .If you use bigger datasets, then you should either consider more hardware , or try and wait for some of the ODBC connect packages.

2) It needs more time to learn than SAS language .Much more time to learn how to do much more.

3) R programmers are lesser paid than SAS programmers.They prefer it that way.It equates the satisfaction of creating a package in development with a world wide community with the satisfaction of using a package and earning much more money per hour.

4) It forces you to learn the exact details of what you are doing due to its object oriented structure. Thus you either get no answer or get an exact answer. Your customer pays you by the hour not by the correct answers.

5) You can not push a couple of buttons or refer to a list of top ten most commonly used commands to finish the project.

6) It is free. And open for all. It is socialism expressed in code. Some of the packages are built by university professors. It is free.Free is bad. Who pays for the mortgage of the software programmers if all softwares were free ? Who pays for the Friday picnics. Who pays for the Good Night cruises?

7) It is free. Your organization will not commend you for saving them money- they will question why you did not recommend this before. And why did you approve all those packages that expire in 2011.R is fReeeeee. Customers feel good while spending money.The more software budgets you approve the more your salary is. R thReatens all that.

8) It is impossible to install a package you do not need or want. There is no one calling you on the phone to consider one more package or solution. R can make you lonely.

9) R uses mostly Command line. Command line is from the Seventies. Or the Eighties. The GUI’s RCmdr and Rattle are there but still…..

10) R forces you to learn new stuff by the month. You prefer to only earn by the month. Till the day your job got offshored…

Written by a R user in English language

( which fortunately was not copyrighted otherwise we would be paying Britain for each word)

the above post was reprinted by request.

R –Refcards and Basic I/O Operations

While working with a large number of files for data processing, I used the following R commands for data processing. Given that everyone needs to split as well merge and append data – I am just giving some code on splitting data based on parameters , and appending data as well as merging data.

Splitting Data Based on a Parameter.

The following divides the data into subsets which contain either Male or anything else in different datasets.

Input and Subset

Note the read.table command assigns the dataset name X in R environment from the file reference (path denoted by ….)

x <- read.table(....)
rowIndx <- grep("Male", x$col)
write.table(x[rowIndx,], file="match")
write.table(x[-rowIndx,], file="nomatch")


Suppose we need to divide the dataset into multiple data sets.

X17 <- subset(X, REGION == 17)

This is prefered to the technique -

attach(X)
X17 = X[REGION == 17,]

Output

For putting the files back to the Windows environment you can use-

write.table(x,file="",row.names=TRUE,col.names=TRUE,sep=" ")

Append

Lets say you have a large number of data files ( say csv files )

that you need to append (assuming the files are in same structure)

after performing basic operations on them.

>setwd("C:\\Documents and Settings\\admin\\My Documents\\Data")

Note this changes the working folder to folder you want it to be,

note the double slashes which are needed to define the path

>list.files(path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE,recursive = FALSE, ignore.case = FALSE)

The R output would be something like below

[1] "calk.csv"                                            "call.csv"                                           
[2]"calm.csv"                                            "caln.csv"                                           
[3]"calo.csv"                                            "calp.csv"                                           

For appending one file repeatedly (like ten times) you can use the command

file.append("A", rep("B", 10))

For Refcards on learning R , the best ones are –

http://cran.r-project.org/doc/contrib/Shortrefcard.pdf

and

http://disinterested.googlepages.com/RQuickReference.pdf

R Graphics

A great book for R graphics is here. Its especially useful for people who are new into R and or using graphical function primarily.

http://www.stat.auckland.ac.nz/~paul/RGraphics/chapter2.html

image

This is a good textbook till the new edition November 2008 release of Bob Munchien’s R for SAS and SPSS Users (http://www.springer.com/statistics/computational/book/978-0-387-09417-5 )

The existing free copy is at http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf

 

image

Modified Ohri Framework

 

Some time back, I had created a framework for data mining through on demand cloud computing. This is the next version- it is free to use for all, with only authorship credit back to me…………..
 
It tries to do away with fixed server ,desktop costs AND fixed software costs in softwares which are used for data mining ,stats and analytics and have huge huge per CPU count annual license fees

 

The modified Ohri Framework tries to mash the following

 

0) HTTPS rather than HTTP

1) Encryption and Compression Software for data transfer (like PGP)

2) Open source stats package like R in cloud computer (like Amazon EC2 or Rightscale  with hadoop)

3) GUI to make it easy to use (like Rattle GUI and PMML Package)

4) A Data Mining Open Source Package (like Rapid Miner or Splunk)

5) RIA Graphics (like Silverlight )

6) Secure Output to cloud computing devices (like Google Docs)

7) Billing or Priced at simple cost plus X % (where simple cost can be like 0.85 cent /per instance hour or more depending on usage and X should not be more than 15 %)

8) Open source sharing of all code to ensure community sandboxing

 

Intention is to remove fixed computing costs of servers and desktops to normal PC’s (Ubuntu Linux ) with (Firefox or IE Explorer ) access to secure data mining on demand .

On tap demand mining to anyone in the world without going for the big license purchases/renewals (software expenses) or big hardware purchases (which become obsolete in 2-3 years).

 

 

Parsing XML files easily

To parse a XML (or KML or PMML) file easily without using any complicated softwares, here is a piece of code that fits right in your excel sheet.

Just import this file using Excel, and then use the function getElement, after pasting the XML code in 1 cell.

It is used  for simply reading the xml/kml code as a text string. Just pasted all the xml code in one cell, and used the start ,end function (for example start=<constraints> and end=</constraints> to get the value of constraints in the xml code).

Simply read into the value in another cell using the getElement function.

heres the code if you ever need it.Just paste it into the VB editor of Excel to create the GetElement function (if not there already) or simply import the file in the link above.

Attribute VB_Name = “Module1”
Public Function getElement(xml As String, start As String, finish As String)
For i = 1 To Len(xml)
If Mid(xml, i, Len(start)) = start Then
For j = i + Len(start) To Len(xml)
If Mid(xml, j, Len(finish)) = finish Then
getElement = Mid(xml, i + Len(start), j – i – Len(start))
Exit Function
End If
Next j
End If
Next i
End Function

FOR Using the R Package for parsing XML …………………………reference this site –

http://www.omegahat.org/RSXML/Overview.html

or this thread from R -Help

> Lines <- ‘
+ <root>
+  <data loc=”1″>
+    <val i=”t1″> 22 </val>
+    <val i=”t2″> 45 </val>
+  </data>
+  <data loc=”2″>
+    <val i=”t1″> 44 </val>
+    <val i=”t2″> 11 </val>
+  </data>
+ </root>
+ ‘
>
> library(XML)
> doc <- xmlTreeParse(Lines, asText = TRUE, trim = TRUE, useInternalNodes = TRUE)
> root <- xmlRoot(doc)
>
> data1 <- getNodeSet(root, “//data”)[[1]]
> xmlValue(getNodeSet(data1, “//val”)[[1]])
[1] ” 22 “

Upcoming Book

The great Bob Muenchen, is coming with the very good updated version of the R for SAS and SPSS users book— in September 2008 to help people learn R , if they have used only SAS or SPSS before.We first covered the earlier edition of the book here.The book adds sections on R Commander, Rattle and JGR as well as two chapters on graphics ; one on basic stats. The author runs the examples and walks though them explaining each step, especially where the results differ from SAS & SPSS. Check the new book here-

http://www.amazon.com/SAS-SPSS-Users-Statistics-Computing/dp/0387094172/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1217456813&sr=8-1

The Ohri Framework – Data Mining on Demand

The Ohri Framework tries to create an economic alternative to proprietary data mining softwares by giving more value to the customer and utilizing open source statistical package R , with the GUI Rattle , hosted on a cloud computing environment.

It is based on the following assumptions-

1) R is relatively inefficient in processing bigger file sizes on same desktop configuration as other softwares like SAS.

2) R has a steep learning curve , hence the need for the GUI Rattle .

3) The enhanced need for computing resources for R is best solved using a cloud computing on demand processing environment. This enables R to scale up to whatever processing power it needs. Mainstream data mining softwares charge by CPU count for servers and are much more expensive due to software costs alone.

Continue reading “The Ohri Framework – Data Mining on Demand”