Cloud say hello to R. R say hello to Cloud.

image

Here is a terrific project from Biocep which I have covered before in January at http://www.decisionstats.com/2009/01/r-and-cloud-computing/

But with some exciting steps ahead at http://biocep-distrib.r-forge.r-project.org/

Basically add open source R , create a user friendly GUI, host it on a cloud computer to better crunch data, and save hardware costs as well. Basically upload, crunch data, download.

Save hardware costs and software costs in recession. Before your boss decides to save his staffing costs.

image

    Biocep combines the capabilities of R and the flexibility of a Java based distributed system to create a tool of considerable power and utility. A Biocep based R virtualization infrastructure has been successfully deployed on the British National Grid Service, demonstrating its usability and usefulness for researchers. 

    The virtual workbench enhances the user experience and the productivity of anyone working with R.

A lovely presentation on it is here

and I am taking an extract

What is missing now

•High Level Java API for Accessing R

•Stateful, Resuable, Remotable R Components

•Scalable, Distributed, R Based Infrastructure

•Safe multiple clients framework for components usage as a pool of indistinguishable Remote Resources

•User friendly Interface for the remote resources creation, tracking and debugging

    Citation: Karim Chine, "Biocep, Towards a Federative, Collaborative, User-Centric,Grid-Enabled and Cloud-Ready Computational Open Platform,"
    escience,pp.321-322, 2008 Fourth IEEE International Conference on eScience, 2008

Ajay- With thanks to Bob Marcus for pointing this out from an older post of mine. I did write on this in August on the Ohri framework but that was before recession moved me out from cloud computing to blog computing.

Interview –Jon Peck SPSS

JonPeck

 

I was in the middle of interviewing people as well as helping the good people in my new role as a community evangelist at Smart Data Collective when I got a LinkedIn Request to join the SDC group  from Jon Peck .

SPSS Inc. is a leading worldwide provider of predictive analytics software and solutions. Founded in 1968, today SPSS has more than 250,000 customers worldwide, served by more than 1,200 employees in 60 countries .Now Jon is a legendary SPSS figure and a great teacher in this field .I asked him for an interview he readily agreed.

Jon Peck is a Principal Software Engineer and Technical Advisor at SPSS. He has been working with SPSS since 1983  and in the interview he talks from the breadth of his perspective and experience on things in analytics and at SPSS .

Ajay – Describe your career journey from college to today. What advice would you give to young students seeking to be hedge fund managers rather than scientists.  What are the basic things that a science education can help students with , in your opinion ?

Jon– After graduating from college with a B.A. in math, I earned a Ph. D in Economics, specializing in econometrics, and taught at a top American university for 13 years in the Economics and Statistics Departments and the School of Organization and Management.  Working in an academic environment all that time was a great opportunity to grow intellectually.  I was increasingly drawn to computing and eventually decided to join a statistical software company.  There were only two substantial ones at the time.  After a lot of thought, I joined SPSS as it seemed to be the more interesting place and one where I would be able to work in a wider variety of areas.  That was over 25 years ago!  Now I have some opportunities to teach and speak again as well as working in development, which I enjoy a lot.

I still believe in getting a broad liberal arts education along with as much quantitative training as possible.  Being able to work in very different areas has been a big asset for me.  Most people will have multiple careers, so preparing broadly is the most important career thing you can do.  As for hedge fund jobs – if there are any left, I’d say not to be starry-eyed about the money.  If you don’t choose a career that really interests you, you won’t be very successful anyway. Do what you love – subject to earning a living.

Math and scientific reasoning skills are preparation for working in many areas as well as being helpful in making the many decisions with quantitative aspects in life.  Math, especially, provides a foundation useful in many areas.  The recently announced program in the UK to improve general understanding of probability illustrates some practical value.

Ajay- What are SPSS’s contribution to Open Source software . What ,if you can disclose are any plans for further increasing that involvement.

Jon-  I wish I could talk about SPSS future plans, but I can’t.  However, the company is committed to continuing its efforts in Python and R.  By opening up the SPSS technology with these open source technologies, we are able to expand what we and our users can do.  At the same time, we can make R more attractive through nicer output and simpler syntax and taking away much of the pain.  One of the things I love about this approach is how quickly and easily new things can be produced and distributed this way compared to the traditional development cycle.  I wrote about productivity and Python recently on my blog at insideout.spss.com.

Ajay – How happy is the SPSS developer community with Python . Are there any other languages that you are considering in the future.

Jon- Many in the SPSS user community were more used to packaged procedures than to programming (except in the area of data transformations).  So Python, first, and then R were a shock.  But the benefits are so large that we have had an excellent response to both the Python and R technologies.  Some have mastered the technology and have been very successful and have made contributions back to the SPSS community.  Others are consumers of this technology, especially through our custom dialogs and extension commands that eliminate the need to learn Python or R in order to use programs in these languages.  Python is an outstanding language.  It is easy to get started with it, but it has very sophisticated features.  It has fewer dark corners than any other language I know.  While there are a few other more popular languages, Python popularity has been steadily growing, especially in the scientific and statistical communities.  But we already have support for three high-level languages, and if there is enough demand, we’ll do more.

Some of our partners prefer to use the lower-level C language interfaces we offer.  That’s fine, too.  We’re not Python zealots (well, maybe, I am).  Python, as a scripting language, isn’t as fast as a compiled language.  For many purposes this does not matter, and Python itself is written in C.  I recently wrote a Python module for TURF analysis.  The computations are simple but computationally explosive, so I was worried that it would be too slow to be useful.   It turned out to be pretty fast because of the way I could use some of Python’s built-in data structures and algorithms.  And the popular numPy and SciPy scientific and numerical libraries are written in C.

Users who would not think of themselves as developers sometimes find that a small Python effort can automate manual work with big time and accuracy improvements.  I got a note recently from a user who said, "I got it to work, and this is FANTASTIC! It will save me a lot of time in my survey analysis work."

Ajay- What are the areas where SPSS is not a good fit for using. What areas suit SPSS software the most compared to other solutions.

Jon- SPSS Statistics, the product,  is not a database.  Our strength is in applying analytical methods to data for model building, prediction, and insight.  Although SPSS Statistics is used in a wide variety of areas, we focus first on people data and think of that first when planning and designing new features.  SPSS Statistics and other SPSS products all work well with databases, and we have solutions for deploying analytics into production systems, but we’re not going to do your payroll.  One thing that was a surprise to me a few years ago is that we have a significant number of users who use SPSS Statistics as a basic reporting product but don’t do any inferential statistics.  They find that they can do customized reporting – often using the Custom Tables module – very quickly.  With Version 17, they can also do fancier and dynamic output formatting without resorting to script writing or manual editing, which is proving very attractive.

Ajay- Are there any plans for SPSS to use Software as a Service Model . Any plans to use advances in remote and cloud computing for SPSS ?

Jon- We are certainly looking at cloud computing.  The biggest challenge is being able to put things in the cloud that will be robust and reliable.

Ajay- What are SPSS’s Asia plans ? Which
country has the maximum penetration of SPSS in terms of usage.

Jon- SPSS, the company, has long been strong in Japan, and Taiwan, and Korea is also strong.  China is increasingly important, of course.  We have a large data center in Singapore.  Although India has a strong, long, history in statistical methodology, it is a much less well-developed market for us.  We have a presence there, but I don’t know the numbers. (Ajay – SPSS has been one of my first experiences in statistical software when I came up with it at my business school in 2001. In India SPSS has been very active with academia licensing and it introduced us to the nice and easy menu driven features of SPSS.)

Biography – Jon earned his Ph. D. from Yale University and taught econometrics and statistics there for 13 years before joining SPSS.

Jon joined the SPSS company in 1983 and worked on many aspects of the very first SPSS DOS product, including writing the first C code that SPSS ever shipped. Among the features he has designed are OMS (the Output Management System), the Visual Bander, Define Variable Properties, ALTER TYPE, Unicode support, and the Date and Time Wizard. Jon is the author of many of the modules on Developer Central. He is an active cyclist and hiker.

Jon Peck blogs on  SPSS Inside-Out.

Modeling Visualization Macros

Here is a nice SAS Macro from Wensui’s blog at http://statcompute.spaces.live.com/blog/

Its particularly useful for Modelling chaps, I have seen a version of this Macro sometime back which had curves also plotted but this one is quite nice too

SAS MACRO TO CALCULATE GAINS CHART WITH KS

%macro ks(data = , score = , y = );

options nocenter mprint nodate;

data _tmp1;
  set 
&data;
  where &score ~= . and y in (1, 0);
  random = ranuni(1);
  keep &score &y random;
run;

proc sort data = _tmp1 sortsize = max;
  by descending &score random;
run;

data _tmp2;
  set _tmp1;
  by descending &score random;
  i + 1;
run;

proc rank data = _tmp2 out = _tmp3 groups = 10;
  var i;
run;

proc sql noprint;
create table
  _tmp4 as
select
  i + 1       as decile,
  count(*)    as cnt,
  sum(&y)     as bad_cnt,
  min(&score) as min_scr format = 8.2,
  max(&score) as max_scr format = 8.2
from
  _tmp3
group by
  i;

select
  sum(cnt) into :cnt
from
  _tmp4;

select
  sum(bad_cnt) into :bad_cnt
from
  _tmp4;    
quit;

data _tmp5;
  set _tmp4;
  retain cum_cnt cum_bcnt cum_gcnt;
  cum_cnt  + cnt;
  cum_bcnt + bad_cnt;
  cum_gcnt + (cnt – bad_cnt);
  cum_pct  = cum_cnt  / &cnt;
  cum_bpct = cum_bcnt / &bad_cnt;
  cum_gpct = cum_gcnt / (&cnt &bad_cnt);
  ks       = (max(cum_bpct, cum_gpct) – min(cum_bpct, cum_gpct)) * 100;

  format cum_bpct percent9.2 cum_gpct percent9.2
         ks       6.2;
  
  label decile    = ‘DECILE’
        cnt       = ‘#FREQ’
        bad_cnt   = ‘#BAD’
        min_scr   = ‘MIN SCORE’
        max_scr   = ‘MAX SCORE’
        cum_gpct  = ‘CUM GOOD%’
        cum_bpct  = ‘CUM BAD%’
        ks        = ‘KS’;
run;

title "%upcase(&score) KS";
proc print data  = _tmp5 label noobs;
  var decile cnt bad_cnt min_scr max_scr cum_bpct cum_gpct ks;
run;    
title;

proc datasets library = work nolist;
  delete _: / memtype = data;
run;
quit;

%mend ks;    

data test;
  do i = 1 to 1000;
    score = ranuni(1);
    if score * 2 + rannor(1) * 0.3 > 1.5 then y = 1;
    else y = 0;
    output;
  end;
run;

%ks(data = test, score = score, y = y);

/*
SCORE KS              
                                MIN         MAX
DECILE    #FREQ    #BAD       SCORE       SCORE     CUM BAD%    CUM GOOD%        KS
   1       100      87         0.91        1.00      34.25%        1.74%      32.51
   2       100      78         0.80        0.91      64.96%        4.69%      60.27
   3       100      49         0.69        0.80      84.25%       11.53%      72.72
   4       100      25         0.61        0.69      94.09%       21.58%      72.51
   5       100      11         0.51        0.60      98.43%       33.51%      64.91
   6       100       3         0.40        0.51      99.61%       46.51%      53.09
   7       100       1         0.32        0.40     100.00%       59.79%      40.21
 &#
160; 8       100       0         0.20        0.31     100.00%       73.19%      26.81
   9       100       0         0.11        0.19     100.00%       86.60%      13.40
  10       100       0         0.00        0.10     100.00%      100.00%       0.00
*/

Its particularly useful for Modelling , I have seen a version of this Macro sometime back which had curves also plotted but this one is quite nice too.

Here is another example of a SAS Macro for ROC Curve  and this one comes from http://www2.sas.com/proceedings/sugi22/POSTERS/PAPER219.PDF

APPENDIX A
Macro
/***************************************************************/;
/* MACRO PURPOSE: CREATE AN ROC DATASET AND PLOT */;
/* */;
/* VARIABLES INTERPRETATION */;
/* */;
/* DATAIN INPUT SAS DATA SET */;
/* LOWLIM MACRO VARIABLE LOWER LIMIT FOR CUTOFF */;
/* UPLIM MACRO VARIABLE UPPER LIMIT FOR CUTOFF */;
/* NINC MACRO VARIABLE NUMBER OF INCREMENTS */;
/* I LOOP INDEX */;
/* OD OPTICAL DENSITY */;
/* CUTOFF CUTOFF FOR TEST */;
/* STATE STATE OF NATURE */;
/* TEST QUALITATIVE RESULT WITH CUTOFF */;
/* */;
/* DATE WRITTEN BY */;
/* */;
/* 09-25-96 A. STEAD */;
/***************************************************************/;
%MACRO ROC(DATAIN,LOWLIM,UPLIM,NINC=20);
OPTIONS MTRACE MPRINT;
DATA ROC;
SET &DATAIN;
LOWLIM = &LOWLIM; UPLIM = &UPLIM; NINC = &NINC;
DO I = 1 TO NINC+1;
CUTOFF = LOWLIM + (I-1)*((UPLIM-LOWLIM)/NINC);
IF OD > CUTOFF THEN TEST="R"; ELSE TEST="N";
OUTPUT;
END;
DROP I;
RUN;
PROC PRINT;
RUN;
PROC SORT; BY CUTOFF;
RUN;
PROC FREQ; BY CUTOFF;
TABLE TEST*STATE / OUT=PCTS1 OUTPCT NOPRINT;
RUN;
DATA TRUEPOS; SET PCTS1; IF STATE="P" AND TEST="R";
TP_RATE = PCT_COL; DROP PCT_COL;
RUN;
DATA FALSEPOS; SET PCTS1; IF STATE="N" AND TEST="R";
FP_RATE = PCT_COL; DROP PCT_COL;
RUN;
DATA ROC; MERGE TRUEPOS FALSEPOS; BY CUTOFF;
IF TP_RATE = . THEN TP_RATE=0.0;
IF FP_RATE = . THEN FP_RATE=0.0;
RUN;
PROC PRINT;
RUN;
PROC GPLOT DATA=ROC;
PLOT TP_RATE*FP_RATE=CUTOFF;
RUN;
%MEND;

VERSION 9.2 of SAS has a macro called %ROCPLOT http://support.sas.com/kb/25/018.html

SPSS also uses ROC curve and there is a nice document here on that

http://www.childrensmercy.org/stats/ask/roc.asp

Here are some examples from R with the package ROCR from

http://rocr.bioinf.mpi-sb.mpg.de/

 

image

Using ROCR’s 3 commands to produce a simple ROC plot:
pred <- prediction(predictions, labels)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))

The graphics are outstanding in the R package and here is an example

Citation:

Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer.
ROCR: visualizing classifier performance in R.
Bioinformatics 21(20):3940-3941 (2005).

 

Interview with Anne Milley, SAS II

Anne Milley is director of product marketing, SAS Institute . In part 2 of the interview Anne talks of immigration in technology areas, open source networks ,how she misses coding and software as a service especially SAS Institute’s offering . She also reveals some preview on SAS ‘s involvement with R and mentions cloud computing.

Anne_Milley

Ajay – Labor arbitrage outsourcing versus virtual teams located globally. What is the SAS Inst position and your opinion on this. What do you feel about the recent debate on HB1 visas and job cuts. How many jobs if at all is SAS planning to cut in 2009-2010.

Anne – SAS is a global company, with customers in more than 100 countries around the world.  We hire employees in these countries to help us better serve our global customers.  Our workforce decisions are based on our business needs.  We also employ virtual teams–the feedback and insights from our global workforce help us improve and develop new products to meet the evolving needs of our customers.  (As someone who works from her home office in Connecticut, I am a fan of virtual teaming!)  We see these approaches as complementary.

The issue of the H-1B visa is a different discussion entirely.  H-1B visas, although capped, permit US employers to bring foreign employees in “specialty occupations” into this country.   The better question, though, is what is necessitating the need for H-1B visas.  We would submit that the reason the U.S. has to look outside its borders for highly qualified technical workers is because we are not producing a sufficient number of workers with the right skill sets to meet U.S. demand.  In turn, that means that our educational system is not producing students interested or qualified to pursue the STEM (science, technology, engineering or mathematics) professions (either at a K-12 or post-secondary level), or developing the workforce improvement programs that may allow workers to pursue these “specialty occupations.”  Further, any discussion about H-1B visas (or any other type of visa) should include a more comprehensive review of our nation’s immigration policies—are they working, are they not working, how or why are they, are we able to limit illegal immigration and if not, why not, etc.

I am not aware of any planned job cuts at SAS.  In fact, I am aware of a few groups which are actively hiring.

Ajay- What open source softwares have SAS Institute worked in the past and it continues to support financially as well as technologically.  Any exciting product releases in 2009-2010 that you can tell us about.

Anne- Open source software provides many options and benefits.  We see many (SAS included) embracing open source for different things.  Our software runs on Linux and we use some open-source tools in development. There are different aspects of open source software in developing SAS software:

-Development with open source tools such as Eclipse, Ant, NAnt, JUnit, etc. to build, test, and package our software

-Using open source software in our products; examples include Apache/Jakarta products such as the Apache Web Server.

-Developing open source software, making changes to an open source codebase, and optionally contributing that source back to the open source project, to adapt an open source project for use in a SAS product or for internal use. Example: Eclipse.

And we plan to do more with open source in the future.  The first step of SAS integrating with R will be shown at SAS Global Forum coming up in DC later this month.  Other announcements for new offerings are also planned at this event. 

Ajay- What do you feel about adopting Software as a service for any of  SAS Institute’s products. Any new initiatives from SAS on the cloud computing front especially in terms of helping customers cut down on hardware costs.

Anne- SAS Solutions OnDemand, the division which oversees the infrastructure and support of all our hosted offerings, is expanding in this rapidly growing market.  SAS Solutions OnDemand Drug Development was our first SaaS offering announced in January.  Additional news on new hosted offerings will be announced at SAS Global Forum later this month.  SAS doesn’t currently offer any external cloud computing options, but we’re actively looking at this area.

AjayWhich software do you personally find best to write code into and why. Do you miss writing code, if so why ?

Anne- In my current role, I have limited opportunity to write code.  At times, I do miss the logical thought process coding forces you to adopt (to do the job as elegantly as possible).  I had the opportunity to do a long-term assignment at a major financial services company in the UK last year and did get to use some SAS and JMP, including a little JSL (JMP scripting language).  There’s nothing like real-world, noisy, messy data to make you thankful for the power of writing code!  Even though I don’t write code on a regular basis, I am happy to see continued investment in the languages SAS provides—among the most recent, the addition of an algebraic optimization modeling language in our SAS/OR module contained within the SAS language as “PROC OPTMODEL.”

I have great respect for people who invest in learning (or even getting exposure to) more than one language and who appreciate the strengths of different languages for certain tasks and applications.

Ajay- It is great to see passionate people at work on both sides of the open source as well as packaged software teams- and even better for them to collaborate once in a while.Most of our work is based on scientists who came before us (especially in math theory).

Ultimately we are all just students of science anyway.

SAS Global Forum –http://support.sas.com/events/sasglobalforum/2009/

Annual event of SAS language practitioners.SAS language consists of data step and proc steps for input and output thus simplifying syntax for users.

SAS Institute – The leader of analytics software since 1970’s , it grew out of the North Carolina University, and provides jobs to thousands of people. The world’s largest privately held company, admired for it’s huge investments in Research and Development and criticized for its premium price  on packaged software solutions.A recent entrant in corporate users who are willing to support R language.

Dataset too big for R ?

In case you have a dataset too big for fitting in memory for R, there is a package called biglm .

You install it like this-

install.packages("biglm", dep=TRUE)

 

 

  Information on package ‘biglm’

Description:

Package:       biglm
Type:          Package
Title:         bounded memory linear and generalized linear models
Version:       0.6
Date:          2005-09-24
Author:        Thomas Lumley
Maintainer:    Thomas Lumley <tlumley@u.washington.edu>
Description:   Regression for data too large to fit in memory
License:       GPL
Suggests:      RSQLite, RODBC
Enhances:      leaps
Packaged:      Tue Feb 24 10:47:44 2009; tlumley
Built:         R 2.8.1; i386-pc-mingw32; 2009-02-24 21:35:12; windows

Index:

bigglm                  Bounded memory linear regression
biglm                   Bounded memory linear regression
predict.bigglm          Predictions from a biglm/bigglm

and in case you are the statistical kind of chap who want to know what’s IN the code for these functions

function (formula, data, family = gaussian(), …)
UseMethod("bigglm", data)
<environment: namespace:biglm>

 

R tip of the day – If you want to know what an R Function say procmeans does…..all you need to do is type procmeans at the command prompt , and it will tell you what is inside the code.

If it gives an error most probably you need to

1) Install

and 2) Load the package containing the function

Which are conveniently here

image

credit source –http://www.nabble.com/R-f13819.html

An R Package only for SAS Users

Dear All,

I am doing some research into creating a R Package for SAS language Users.

The name of the beta package is “ Anne”, but I am open to suggestions for the name please.

The basic idea is to enable SAS language Users (especially Windows SAS language  users) to get a feel to try out the R package without getting overwhelmed with the matrix level powerful capabilities as well as command line interface.

Creating new functions is quite easy as the following code shows.

The first R code for the “Anne 1.0” Package is

procunivariate(x) <- function(x) summary(x)

procimportcsv(x) <- function(x) read.table(x,header=TRUE,

                           + sep=”,”, row.names=”id”, na.string=”   “)

libname(x) <-function(x) setwd(x)

 

Note I am tweaking the code as we speak and would be trying to add one proc per week.

But how to put functions in a R Package ?

This is how to create a R package –( To be Continued)

Note- SAS here refers to SAS Language.