post – Page 8 – DECISION STATS

Blog Update

Some changes at Decisionstats-

1) We are back at Decisionstats.com and Decisionstats.wordpress.com will point to that as well. The SEO effects would be interesting and so would be the Instant Pagerank or LinkRank or whatever Coffee/Percolator they use in Cali to index the site.

2) AsterData is no longer a sponsor- but Predictive Analytics Conference is. Welcome PAWS! I have been a blog partner to PAWS ever since it began- and it’s a great marketing fit. Expect to see a lot of exclusive content and interviews from great speakers at PAWS.

3) The Feedblitz newsletter (now at 404 subscribers) is now a weekly subscription to send one big big email rather than lots of email through the week- this is because my blogging frequency is moving up as I collect material for a new book on business analytics that I would probably release in 2011 (if all goes well, touchwood). Linkedin group would be getting a weekly update announcement. If you are connected to Decisionstats on Analyticbridge _ I would soon try to find a way to update the whole post automatically using RSS and Ning.com . or not. Depends.

4) R continues to be a bigger focus. So will SPSS and maybe JMP. Newer softwares or older softwares that change more rapidly would get more coverage. Generally a particular software is covered if it has newer features, or an interesting techie conference, or it gets sued.

5) I will occasionally write a poem or post a video once a week randomly to prove geeks and nerds and analysts can have fun (much more fun actually dont we)

Thanks for reading this. Sept 2010 was the best ever for Decisionstats.com – we crossed 15,000 + visitors and thanks for that again! I promise to bore you less and less as we grow old together on the blog 😉

Economic: Indian Caste System -Simplification

I am often asked by Western and non Indian people regarding the caste system. It trips me a lot trying to explain the complexity, necessity and current scenario given the history.

Here is an effort- The Indian /Hindu caste system was primarily an economic system to divide labor. In the original Manusmriti ,named by the King Manu- it was flexible.

A son of blue collar worker could become a warrior if he was brave etc.

A couple of centuries later – the top castes primarily the priests decided to make it rigid. No more social intermingling or marriage between castes, and no more migration of occupation regardless of merit.

This led to a lot of lower caste people leaving Hinduism to join religions like Islam ( post 1000 AD, Muslim Invasions and Mughal Rule) and Christianity ( post the arrival of English).

Post 1947 , many of “lower castes” preferred to remain within Hinduism but adopted Buddhism as their primary worship mechanism.Also India‘s leaders in the 1940’s , many of whom were educated in UK as lawyers ( including Mahatma Gandhi, Subhash Chandra Bose, Jawahar Lal Nehru) decided this system had weakened the nation state and divided the energies of India, besides being obviously inhumane and degrading.

The Constitution of India was shepharded in 1950 by an assembly led by Dr. B R Ambedkar , one of the very first educated lower castes ( also called Harijan , after Mahatma Gandhi’s name for them, literally Hari -Jan people of the Lord).That Cosntitution endures as India remains the finest example of a Democracy in the non Western world.

The Indian constitution established 7.5 % jobs reservation in Government jobs and educational institutes at a college and masters level only for lowest and most educationally backward castes ( hence called scheduled castes), 15 % jobs reservation in Government jobs only for tribal people ( hence called scheduled tribes). The provision is renewed every 10 years. Think of it as a constitutionallu bound affirmative action.

In 1990, another 27.5 % of jobs and educational seats were reserved for castes that were socially okay but educationally backward. This caused some riots, delays, political actions, but was finally implemented by 2007.

Opponents of the new affirmative action say that this is like doing two wrongs to make a right. Supporters say data proves that reservation has led to social advancement ( especially in the State of Tamil Nadu).Rollback of the new system is a political impossibilty thanks to unity among hitherto repressed classes.

As an upper caste Hindu ( embarassingly enough my caste is both a warrior and a kingly royal caste , which gives me zero benefit in 2010 AD)……..

In God we Trust..All others must bring Data.

Unfortunately, when it comes to politics the same data is either hidden, partially hidden, or interpreted in different ways especially with regards to projecting sampling error or decisions.

Phew…!! That was an analytical layman definition of the Indian Caste System over 2000 years.

Note- The Indian soldier caste is Kshatriyas not Kshatritas..

Using JMP 9 and R together

An interesting blog post at http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html on using the new JMP 9 with R, and quite possibly using SAS as well.

Example Code-

Here’s the R integration JSL code used to run the bootstrap

rconn = R Connect();
rconn << Submit(“\[
library(boot)

# Load Boot package
library(boot)

RStatFctn <- function(x,d) {return(mean(x[d]))}

b.basic = matrix(data=NA, nrow=1000, ncol=2)
b.normal = matrix(data=NA, nrow=1000, ncol=2)
b.percent =matrix(data=NA, nrow=1000, ncol=2)
b.bca =matrix(data=NA, nrow=1000, ncol=2)

for(i in 1:1000){
rnormdat = rnorm(30,0,1)
b <- boot(rnormdat, RStatFctn, R = 1000)
b.ci=boot.ci(b, conf =095,type=c(“basic”,”norm”,”perc”,”bca”)) b.basic[i,] = b.ci$basic[,4:5]
b.normal[i,] = b.ci$normal[,2:3]
b.percent[i,] = b.ci$percent[,4:5]
b.bca[i,] = b.ci$bca[,4:5]
}
]\”));
b_basic= rconn << Get(b.basic);
b_normal = rconn << Get(b.normal);
b_percent= rconn << Get(b.percent);
b_bca = rconn << Get(b.bca);
rconn << Disconnect();

Using the R Connect() JSL command and assigning it to the object “rconn”, the code sends messages to the JSL scriptable object “rconn” to submit R code via the Submit() command and to retrieve R matrices containing the bootstrap confidence intervals back via the Get() commands.

and I also found interesting what the write has to say about using JMP (for visual analysis) and SAS (bigger datasets handling) and R (for advanced statistics) together

Other standard JMP tools such as the Data Filter can help to explore these results in ways that cannot easily and quickly be done in R

and

With a little JSL and the statistical and graphics platforms of JMP coupled with the breadth and variety of packages and functions in R, one can build complete easy-to-use applications for statistical analysis.

JMP can also integrate with SAS, which adds the ability to work with large-scale data through the file-based system as well as the depth and advanced capabilities of SAS procedures. With these seamless integrations, JMP can become a hub that enables you to connect with both SAS and R, as well as provide unique statistical features such as the JMP Profiler and interactive graphic features such as Graph Builder

and in the meanwhile here is a data visualization of a frequency analysis of various words bundled together from xkcd.com

Running a R GUI,and parallel programming on Amazon EC2

Ok here is an update to the post on running R on an Amazon EC2.

https://decisionstats.wordpress.com/2010/09/25/running-r-on-amazon-ec2/

1) Login to Amazon Console using instructions in earlier post

2) Select AMI-Platform Ubuntu-i-5575773f

Basically select the latest 64 bit instance from Ubuntu

3) Proceed as in post before to launch AMI and instance- here I chose large with 4cores

3.1) Before connecting to your session

search Synaptic Package Manager for x11-

I installed the X11 VNC server package –

and now interactive sessions will work (read GUIs)

3.2) Modify the line

ssh -i decisionstats2.pem root@ec2-75-101-182-203.compute-1.amazonaws.com

ssh -i decisionstats2.pem -X ubuntu@ec2-75-101-182-203.compute-1.amazonaws.com

This will connect you.

4) INSTALL R – Cran R is a standard Ubuntu Package

using

sudo apt-get install r-base

then type R

and install.packages(“Rcmdr”)

Note – you should be able to see the grey colored Tcl/Tpk script showing cran locations

in a seperate window if X11 is working

5) doSNOW package works on the Ubuntu 64- The results are below for

check <-function(n) {check <-function(n) {

+ for(i in 1:1000)

+ {

+ sme <- matrix(rnorm(100), 10,10)

+ solve(sme)

+ }

> times <- 100

> system.time(x <- foreach(j=1:times ) %dopar% check(j))
user system elapsed
0.150 0.080 7.303
> system.time(for(j in 1:times ) x <- check(j))
user system elapsed
27.460 2.300 29.757

The time of 7.3 is almost 5.5 times faster than running it locally on a dual core, and still 3 times faster than running foreach locally. Note I used 4 cores this time in snow.

5) The Tcl/Tk interface of R Cmdr takes a long time to load on EC2 than locally. It may be due to the fact I was running Ubuntu using a VM Player (http://www.vmware.com/go/downloadplayer/ ). However there seems to be a general slowing down when viewing graphics.

or simply

sudo apt-get install r-cran-rcmdr

Kill R? Wait a sec

1) Is R efficient? (scripting wise, and performance wise) _ Depends on how you code it- some Packages like foreach can help but basic efficiency come from programmer. XDF formats from Revoscalar -the non open R package further improve programming efficiency

2) Should R be written from scratch?

You got to be kidding- It depends on how you define scratch after 2 million users

This has been done with S, then S Plus and now R.

3) What should be the license of R (if it was made a new)?

GPL license is fine. You need to do a better job of executing the license. Currently interfaces to R exist from SPSS, SAS, KXEN , other companies as well. To my knowledge royalty payments as well as formal code sharing does not agree.

R core needs to do a better job of protecting the work of 2500 package-creators rather than settling for a few snacks at events, sponsorships, Corporate Board Membership for Prof Gentleman, and 4-5 packages donated to it. The only way R developers can currently support their research is write a book (ny Springer mostly)

Eg GGplot and Hmisc are likely to be used more by average corporate user. Do their creators deserve royalty if creators of RevoScalar are getting it?

If some of 2 million users gave 1 $ to R core (compared to 9 million in last round of funding in Revolution Analytics)- you would have enough money to create a 64 bit optimized R for Linux (missing in Enterprise R), Amazon R APIs (like Karim Chine’s efforts), R GUIs (like Rattle’s commercial version) etc etc

The developments are not surprising given that Microsoft and Intel are funding Revolution Analytics http://www.dudeofdata.com/?p=1967

R controversies come and go (this has happened before including the NYT article and shakeup at Revo)

An interesting debate on whether R should be killed to make an upgrade to a more efficient language.

From Tal (creator R Bloggers) and on R help list-

There is currently a (very !) lively discussions happening around the web, surrounding the following topics:
1) Is R efficient? (scripting wise, and performance wise)
2) Should R be written from scratch?
3) What should be the license of R (if it was made a new)?

Very serious people have taken part in the debates so far. I hope to let you know of the places I came by, so you might be able to follow/participate
in these (IMHO) important discussions.

The discussions started in the response for the following blog post on
Xi’An’s blog:
http://xianblog.wordpress.com/2010/09/06/insane/

Followed by the (short) response post by Ross Ihaka:
http://xianblog.wordpress.com/2010/09/13/simply-start-over-and-build-something-better/

Other discussions started to appear on Andrew Gelman’s blog:
http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html

And (many) more responses started to appear in the hackers news website:
http://news.ycombinator.com/item?id=1687054

I hope these discussions will have fruitful results for our community,
Tal

—————-Contact
Details:——————————————————-
Contact me: Tal.Galili@gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)

My 0 cents ( see it would 2 cents but it;s free)

WordPress.com Tweeting

Just a single click on a check mark to enable tweeting from your every blog post (similar to a Tweetmeme button)

Q&A with David Smith, Revolution Analytics.

Here’s a group of questions and answers that David Smith of Revolution Analytics was kind enough to answer post the launch of the new R Package which integrates Hadoop and R- RevoScaleR

Ajay- How does RevoScaleR work from a technical viewpoint in terms of Hadoop integration?

David-The point isn’t that there’s a deep technical integration between Revolution R and Hadoop, rather that we see them as complementary (not competing) technologies. Hadoop is amazing at reliably (if slowly) processing huge volumes of distributed data; the RevoScaleR package complements Hadoop by providing statistical algorithms to analyze the data processed by Hadoop. The analogy I use is to compare a freight train with a race car: use Hadoop to slog through a distributed data set and use Map/Reduce to output an aggregated, rectangular data file; then use RevoScaleR to perform statistical analysis on the processed data (and use the speed of RevolScaleR to iterate through many model options to find the best one).

Ajay- How is it different from MapReduce and R Hipe– existing R Hadoop packages?

David- They’re complementary. In fact, we’ll be publishing a white paper soon by Saptarshi Guha, author of the Rhipe R/Hadoop integration, showing how he uses Hadoop to process vast volumes of packet-level VOIP data to identify call time/duration from the packets, and then do a regression on the table of calls using RevoScaleR. There’s a little more detail in this blog post: http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html

Ajay- Is it going to be proprietary, free or licensable (open source)?

David- RevoScaleR is a proprietary package, available to paid subscribers (or free to academics) with Revolution R Enterprise. (If you haven’t seen it, you might be interested in this Q&A I did with Matt Shotwell: http://biostatmatt.com/archives/533 )

Ajay- Any existing client case studies for Terabyte level analysis using R.

David- The VOIP example above gets close, but most of the case studies we’ve seen in beta testing have been in the 10’s to 100’s of Gb range. We’ve tested RevoScaleR on larger data sets internally, but we’re eager to hear about real-life use cases in the terabyte range.

Ajay- How can I use RevoScaleR on my dual chip Win Intel laptop for say 5 gb of data.

David- One of the great things about RevoScaleR is that it’s designed to work on commodity hardware like a dual-core laptop. You won’t be constrained by the limited RAM available, and the parallel processing algorithms will make use of all cores available to speed up the analysis even further. There’s an example in this white paper (http://info.revolutionanalytics.com/bigdata.html) of doing linear regression on 13Gb of data on a simple dual-core laptop in less than 5 seconds.

AJ-Thanks to David Smith, for this fast response and wishing him, Saptarshi Guha Dr Norman Nie and the rest of guys at Revolution Analytics a congratulations for this new product launch.

Please share:

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: