R and SAS in Twitter Land

A tale of two languages ( set in Twitterland)

Everytime I post to the R help list, if the email contains the three words S.- A – S , I get plenty of e-spanking from senior professors and distinguished Linux people. On the other hand when I mentioned W-P-S I got dunked by the Don of SAS Global himself. We geeks are so passionate.

Here is some new stuff on Twitter for the R /Open Source community.

1) I manually made a list of

  1. best R blogs,
  2. R help lists ( on Nabble since Google Groups banned R help archive),
  3. Twitter Search for #rstats ( general search word for R)

I then copied the RSS feeds of each of the above.

2) I then went to www.twitterfeed.com (uses open Id) and linked a new Twitter Account to these RSS feeds

Screenshot-twitterfeed.com : feed your blog to twitter - Mozilla Firefox

3) I then tweaked the layout and added #rstats before each post to the new R resource http://twitter.com/Rarchive

http://twitter.com/Rarchive
http://twitter.com/Rarchive

If you are a tweeter you can follow it here http://twitter.com/Rarchive and never miss any R news going forward.

ps- I also did the same for sas for http://twitter.com/sascommunity

UPDATE

#rstats helps in SEO in Google since Google uses Twitter search as well. Existing best R search engine is http://rseek.com
In any case it is too late to change now since this is more like a automated firehose. Now you can use #rstats as well additional keywords to get more searchable useful stuff.

NOTE______

http://twitter.com/sas belongs to a guy who is wondering who is trying to hack his twitter account

, well you can check the screen shot below

Screenshot-Sky Sutton (SAS) on Twitter - Mozilla Firefox
Screenshot-Sky Sutton (SAS) on Twitter - Mozilla Firefox

Interview Karim Chine BIOCEP (Cloud Computing with R)

Here is an interview with Karim Chine of http://www.biocep.net/

Working with an R or Scilab on clusters/grids/clouds becomes as simple as working with them locally-

Karim Chine, Biocep.

Ajay- Please describe your career in the field of science. What advice would you give to young science graduates in this recession.

Karim- My original background is in theoretical Physics, I did my Master’s thesis at the Ecole Normale’s Statistical Physics Laboratory where I worked on phase separation in two-dimensional additive mixtures with Dr Werner Krauth. I came to computer science after graduating from the Ecole Polytechnique and I spent two years at TELECOM ParisTech studying software architecture and distributed systems design. I worked then for the IBM Paris Laboratory (VisualAge Pacbase applications’ generator), Schlumberger (Over the Air Platform and Web platform for smartcards personalization services), Air France (SSO deployment) and ILOG (OPL-CPLEX-ODM Development System). This gave me the intense exposure to real world large-scale software design. I crossed the borders of cultural, technical and organizational domains several times and I worked with a broad palette of technologies with some of the best and most innovative engineers. I moved to Cambridge in 2006 and I worked for the European Bioinformatics Institute. It’s where I started dealing with the integration of R into various types of applications. I left the EBI in November 2007. I was looking for an institutional support to help me in bringing into reality a vision that was becoming clearer and clearer about a universal platform for scientific and statistical computing. I failed in getting that support and I have been working on BIOCEP full time for most of the last 18 months without being funded. Few days of consultancy given here and there allowed me to keep going. I spent several weeks at Imperial College, at the National Center for e-Social Sciences and at Berkeley’s department of statistics during that period. Those visits were extremely useful in refining the use cases of my platform. I am still looking for a partner to back the project. You asked me to give advice. The unique advice I would give is to be creative and to try again and again to do what you really want to do. Crisises come and go, they will always do and extreme situations are part of life. I believe hard work and sincerity can prevail anything.

Ajay- Describe BIOCEP’s scope and ambition.

What are the current operational analytics that can be done by users having data.

Karim- My first ambition with BIOCEP is to deliver a universal platform for scientific and statistical computing and to create an open, federative and collaborative environment for the production, sharing and reuse of all the artifacts of computing. My second ambition is to enhance dramatically the accessibility of mathematical and statistical computing, to make HPC a commonplace and to put new analytical, numerical and processing capabilities in the hands of everyone (open science).

The Open source software Conquest has gone very far. Environments like R or Scilab, technologies like Java, Operating Systems like Linux-Ubuntu, and tools like OpenOffice are being used by millions of people. Very little doubt remains about the OSS’s final victory in some domains. The cloud is already a reality and it will take computing to a whole new realm. What is currently missing is the software that, by making the Cloud’s usage seamless, will create new ecosystems and will provide rooms for creativity, innovation and knowledge discovery of an unprecedented scale.

BIOCEP is one more building block into this. BIOCEP is built on top of R and Scilab and anything that you can do within those environments is accessible through BIOCEP. Here is what you have uniquely with this new R/Scilab-based e-platform:

High productivity via the most advanced cross-platform workbench available for the R environment.

Advanced Graphics: with BIOCEP, a graphic transducer allows the rendering on client side of graphics produced on server side and enables advanced capabilities like zooming/unzooming/scrolling for R graphics. a client side mouse tracker allows to display dynamically information related to the graphics and depending on coordinates. Several virtual R Devices showing different data can be coupled in zooming/scrolling and this helps comparing visually complex graphics.

Extensibility with plug-ins: new views (IDE-like views, analytical interfaces…) can be created very easily either programmatically or via drag-and-drop GUI designers.

Extensibility with server-side extensions: any java code can be packaged and used on server side. The code can interact seamlessly with R and Scilab or provide generic bridges to other software. For example, I provide an extension that allows you to use openoffice as a universal converter between various files formats on server side.

Seamless High Performance Computing: working with an R or Scilab on clusters/grids/clouds becomes as simple as working with them locally. Distributed computing becomes seamless, creating a large number R and Scilab remote engines and using them to solve large scale problems becomes easier than ever. From the R console the user can create logical links to existing R engines or to newly created ones and use those logical links to pilot the remote workers from within his R session. R functions enable using the logical links to import/export variables from the R session to the different workers and vice versa. R commands/scripts can be executed by the R workers synchronously or asynchronously. Many logical R links can be aggregated into one logical cluster variable that can be used to pilot the R workers in a coordinated way. A cluster.apply function allows the usage of the logical cluster to apply a function to a big data structure by slicing it and sending elementary execution commands to the workers. The workers apply the user’s function to the slices in parallel. The elementary results are aggregated to compose the final result that becomes available within the R session.

Collaboration: your R/scilab server running in the cloud can be accessed simultaneously by you and your collaborators. Everything gets broadcasted including Graphics. A spreadsheet enables to view and edit data collaboratively. Anyone can write plug-ins to take advantage of the collaborative capabilities of the frameworks. If your IP address is public, you can provide a URL to anyone and get him connect to your locally running R.

– Powerful frameworks for Java developers: BIOCEP provides Frameworks and tools to use R as if it was an Object Oriented Java Toolkit or a Web Toolkit for R-based dynamic application.

Webservices for C#, Perl, Python users/developers: Most of the capabilities of BIOCEP including piloting of R/Scilab engines on the cloud for distributed computing or for building scalable analytical web application are accessible from most of the programming languages thanks to the SOAP front-end.

RESTful API: simple URLs can perform computing using R/Scilab engines and return the result as an XML or as graphics in any format. This works like google charts and has all the power of R since the graphic is described with an R script provided as a parameter of the URL. The same API can be exposed on demand by the workbench. This allow for example to integrate a Cloud-R with Excel or OpenOffice. The workbench works as a bridge between the cloud and those applications.

Advanced Pooling framework for distributed resources: useful for deploying pools of R/scilab engines on multi nodes systems and get them used simultaneously by several distributed client processes in a scalable/optimal way. A supervision GUI is provided for a user friendly management of the pools/nodes/engines.

Simultaneous use of R and Scilab: Using java scripting, data can be transferred from R to Scilab and vice versa.

Ajay- Could you tell us about a successful BIOCEP installation and what it led to? Can BIOCEP be used by the rest of the R community for other packages? What would be an ideal BIOCEP user /customer for whom cloud based analytics makes more sense ?

Karim- BIOCEP is still in pre-beta stage. However it is a robust and polished pre-Beta that several organizations are already using. Janssen Pharmaceutica is using it to create and deliver statistical applications for drug discovery that use R engines running on their backend servers. The platform is foreseen there as the way to go for an ultimate optimization of some of their data analysis pipelines. Janssen’s head of statistics said to be very much interested in the capabilities given by BIOCEP to statisticians to create their own analytical User Interfaces and deliver them with their models without needing specific software development skills. Shell is creating BIOCEP-based applications prototypes to explore the feasibility and advantages of migrating some of Shell’s applications to the Cloud. One group from Shell Global Solutions is planning to use BIOCEP for running scilab in the cloud for Corrosion simulation modeling. Dr Ivo Dinov’s team at UCLA is studying the migration of some the SOCR applications to the BIOCEP platform as plug-ins and extensions. Dr Ivo Dinov also applied for an important grant for building DISCb (Distributed Infrastructure for Statistical Computing in Biomedicine). If the grant application is successful, BIOCEP will be the backbone at software architecture level of that new infrastructure. In cooperation with the Institute of Biostatistics, Leibniz University of Hannover, Bernd Bischl and Kornelius Rohmeyer have developed a framework to user friendly R-GUIs of different complexity. The toolkit uses BIOCEP as an R-backend since release 2.0. Several small projects have been implemented using this framework and some are in production such as an application for education in biostatistics at the University of Hannover. Also the ESNATS project is planning to use the BIOCEP frameworks. Some development is being done at the EBI to customize the workbench and use it to give to the end user the possibility to run R and Bioconductor on the EBI’s LSF cluster.

I’ve been in touch with Phil Butcher, Sanger’s head of IT and he is considering the deployment of BIOCEP on Sanger’s systems simultaneously with Eucalyptus. The same type of deployment has been discussed with the director of OMII-UK, Neil Chue Hong. BIOCEP’s deployment is probably going to follow the deployment of the Eucalyptus System on NGS. Tena Sakai deployed BIOCEP at the Ernest Gallo Clinic and Research Centre and he is currently exploring the usage of the R on the Cloud via BIOCEP (Eucalyptus / AWS). The platform has been deployed by a small consultancy company specializing in R on several London-based investment banks’ systems. I have had a go ahead form Nancy Wilkins Diher (Director for Science Gateways, SDSC) for deploying on TeraGrid, a deployment on EGEE has been discussed with Dr Steven Newhouse (EGEE Technical Director). Both deployments are in standby at the moment.

Quest Diagnostics is planning to use BIOCEP extensively. Sudeep Talati (University of Manchester) is doing his Master’s project on BIOCEP. He is supervised by Professor Andy Brass and he is exploring the use of a BIOCEP-based infrastructure to deliver microarray analysis workflows in a simple and intuitive way to biologists with and without the Cloud. In Manchester, Robin Pinning (e-Science team leader, Research Computing Services) has the deployment of BIOCEP on Manchester’s research cluster on his agenda…

As I have said, anything that you can do with R including installing, loading and using any R package is accessible through BIOCEP. The platform aims to be universal and to become a tool for productivity and collaboration used by everyone dealing with computing/analytics with or without the cloud.

The Cloud whether it is public or private will be generalized and everyone will become a cloud user in one way or another

Ajay- What motivated you to build BIOCEP and mash cloud computing and R. What scope do you see for cloud computing in developing countries in Asia and Africa?

Karim– When I was at the EBI, I worked on the integration of R within scalable web applications. I explored and tested the available frameworks and tools and all of them were too low level or too simple to answer the problem. I decided to build new frameworks. I had the opportunity to be able to stand on the shoulders of giants.

Simon Urbanek’s packages already bridged the C-API of R with Java reliably. Martin Morgan’s RWebsevices package defined class mappings between R types, including S4 classes, and java.

Progressively R became usable as a Java object oriented toolkit, then as a Java Server. Then I built a pooling framework for distributed resources that made it possible for multiple clients to use multiple R engines optimally.

I started building a GUI to validate the server’s increasingly sophisticated API. That GUI became progressively the workbench.

When I was at Imperial, I worked with the National Grid Service team at the Oxford e-Research Centre to deploy my platform on Oxford’s core cluster. That deployment led to many changes in the architecture to meet all the security requirements.

It was obvious that the next step was to make BIOCEP available on Amazon’s Cloud. Academic Grids are for researchers and the cloud is for everyone. Making the platform work seamlessly on EC2 took few months. With the cloud came the focus on collaborative features (collaborative views, graphics, spreadsheets…).

I can only talk about the example of a country I know, Tunisia, and I guess some of this applies to Asian Countries. Even if the broadband is everywhere today and is becoming accessible and affordable by a majority of Tunisians, I am not sure that the adoption of the cloud would happen soon.

Simple considerations like the obligation to pay for the compute cycles in dollars (and not in dinars) are a barrier for adoption. Spending foreign currencies is subject to several restrictions in general for companies and for individuals; few Tunisians have credit cards that can be used to pay Amazon. Companies would prefer to buy and administer their own machines because the cost of operation and maintenance is lower in Tunisia than it is in Europe/US.

Even if the cloud would help in giving Tunisian researchers access to affordable Computing cycles on demand, it seems that most of them have learned to live without HPC resources and that their research is more theoretical and less computational than it could be. Others are collaborating with research groups in Europe (France) and they are using those European groups’ infrastructures.

Ajay- How would BIOCEP address the problem of data hygiene, data security and privacy. Is encrypted and compressed data transfers supported or planned?

Karim- With BIOCEP, a computational engine is exposed as a distributed component via a single mono-directional HTTP port. When you run such an engine on an EC2 instance you have two options:

  • 1/ totally sandbox the machine (via the security group) and leave only the SSH port open.
  • Private Key authentication is required to access the machine. In this case you use an SSH Tunnel (created with a tool like Putty for example) which allows you to see the engine as if it was running on your local machine on a port of your choice, the one specified for creating the Tunnel.
  • When you start the Virtual Workbench and connect in Http mode to your local host via the specified port, you are effectively connecting to the EC2-R engine. 100% of the information exchanged between your workbench and the engine, including your data, is ciphered thanks to the SSH tunnel.
  • The virtual workbench embeds JSCH and can create the Tunnel for you automatically. This mode doesn’t allow collaboration since it requires the private key to let the workbench talk to the EC2 R/Scilab engine.
  • 2/ tell the EC2 machine at startup (via the “user data”) to require specific credentials from the user. When the machine starts running, the user needs to provide those credentials to get a session ID and to be able to pilot a virtual EC2 R/Scilab engine. This mode enables collaboration. The client (workbench/scripts) connects to the EC2 machine instance via HTTP (will be HTTPS in a near future).

Ajay- Suppose I have 20 gb per month of data and my organization decided to cut back on the number of annual expensive software. How can the current version of BIOCEP help me do the following?

Karim– Ways BIOCEP can help you right now.

1) Data aggregation and Reporting in terms of spreadsheet, presentation and graphs

  • BIOCEP provides a highly programmable server side spreadsheet.
  • It can be used interactively as a view of the workbench and simple clicks allow the transfer of data form cells to R variables and vice versa. It can be created and populated from R (console / scripts).
  • Any R function can be used within dynamically computed cells. The evaluation of those dynamic cells is done on server side and can use high performance computing functions. Macros allow adding reactivity to the spreadsheets.
  • A macro allows the user to execute any R code in response to a value change of an R variable or of the content of a range within a spreadsheet. Variables docking macros allow the mirroring of R variables of any type (vectors, matrixes, data frames..) with ranges within the spreadsheet in Read/Write mode

. Several ready-to-use User Interface components can be created and docked anywhere within the spreadsheet. Those components include

  • an R Graphics viewer (PDF viewer) showing Graphics produced by a user-defined R script and reactive on user-defined variables and cell ranges changes,
  • customizable sliders mirroring R variables,
  • Buttons executing user-defined R code when pressed,
  • Combo boxes mirroring factor variables ..

The spreadsheet-based analytical user interface can pilot an R running at any location (local R, Grid R, Cloud R…). It can be created in minutes just by pointing, clicking and copy/pasting.

Cells content+macros+reactive docked components can be saved in a zip file and become a Workbench plug-ins. Like all BIOCEP plug-ins, the spreadsheet-based GUI can be delivered to the end user via a simple URL. It can use a cloud-R or a local R created transparently on the user’s machine.

2) Build time series models, regression models

BIOCEP’s workbench is extensible and I am hoping that contributors will soon start writing plug-ins or converting available GUIs to BIOCEP plug-ins in order to make the creation of those models as easy as possible.

Biography-

Karim Chine
Karim chine graduated from the French Ecole Polytechnique and TELECOM ParisTech. He worked at Ecole Normale Supérieure-LPS (phase separation in two-dimensional additive mixture), IBM (VisualAge Pacbase), Schlumberger (Over the Air Platform and Web platform for smartcards personalization services), Air France (SSO deployment), ILOG (OPL-CPLEX-ODM Development System), European Bioinformatics Institute (Expression Profiler, Biocep) and Imperial College London-Internet Center (Biocep). He contributed to open source software (AdaBroker) and he is the author of the Biocep platform. He currently works on the seamless integration of the new platform within utility computing infrastructures (Amazon EC2), its deployment on Grids (NGS) and its usage as a tool for education and he tries to build collaborations with academic and industrial partners.

You can view his resume here http://www.biocep.net/scan/CV_Karim_Chine_June_2009.pdf

Interview David Smith REvolution Computing

Here is an Interview with REvolution Computing’s Director of Community David Smith.

Our development team spent more than six months making R work on 64-bit Windows (and optimizing it for speed), which we released as REvolution R Enterprise bundled with ParallelR.” David Smith

Ajay -Tell us about your journey in science. In particular tell us what attracted you to R and the open source movement.

David- I got my start in science in 1990 working with CSIRO (the government science organization in Australia) after I completed my degree in mathematics and computer science. Seeing the diversity of projects the statisticians there worked on really opened my eyes to statistics as the way of objectively answering questions about science.

That’s also when I was first introduced to the S language, the forerunner of R. I was hooked immediately; it was just so natural for doing the work I had to do. I also had the benefit of a wonderful mentor, Professor Bill Venables, who at the time was teaching S to CSIRO scientists at remote stations around Australia. He brought me along on his travels as an assistant. I learned a lot about the practice of statistical computing helping those scientists solve their problems (and got to visit some great parts of Australia, too).

Ajay- How do you think we should help bring more students to the fields of mathematics and science-

David- For me, statistics is the practical application of mathematics to the real world of messy data, complex problems and difficult conclusions. And in recent years, lots of statistical problems have broken out of geeky science applications to become truly mainstream, even sexy. In our new information society, graduating statisticians have a bright future ahead of them which I think will inevitably draw more students to the field.

Ajay- Your blog at REVolution Computing is one of the best technical corporate blogs. In particular the monthly round up of new packages, R events and product launches all written in a lucid style. Are there any plans for a REvolution computing community or network as well instead of just the blog.

David- Yes, definitely. We recently hired Danese Cooper as our Open Source Diva to help us in this area. Danese has a wealth of experience building open-source communities, such as for Java at Sun. We’ll be announcing some new community initiatives this summer. In the meantime, of course, we’ll continue with the Revolutions blog, which has proven to be a great vehicle for getting the word out about R to a community that hasn’t heard about it before. Thanks for the kind words about the blog, by the way — it’s been a lot of fun to write. It will be a continuing part of our community strategy, and I even plan to expand the roster of authors in the future, too. (If you’re an aspiring R blogger, please get in touch!)

Ajay- I kind of get confused between what exactly is 32 bit or 64 bit computing in terms of hardware and software. What is the deal there. How do Enterprise solutions from REvolution take care of the 64 bit computing. How exactly does Parallel computing and optimized math libraries in REvolution R help as compared to other flavors of R.

David– Fundamentally, 64-bit systems allow you to process larger data sets with R — as long as you have a version of R compiled to take advantage of the increased memory available. (I wrote about some of the technical details behind this recently on the blog.)  One of the really exciting trends I’ve noticed over the past 6 months is that R is being applied to larger and more complex problems in areas like predictive analytics and social networking data, so being able to process the largest data sets is key.

One common mis perception is that 64-bit systems are inherently faster than their 32-bit equivalents, but this isn’t generally the case. To speed up large problems, the best approach is to break the problem down into smaller components and run them in parallel on multiple machines. We created the ParallelR suite of packages to make it easy to break down such problems in R and run them on a multiprocessor workstation, a local cluster or grid, or even cloud computing systems like Amazon’s EC2 .

” While the core R team produces versions of R for 64-bit Linux systems, they don’t make one for Windows. Our development team spent more than six months making R work on 64-bit Windows (and optimizing it for speed), which we released as REvolution R Enterprise bundled with ParallelR. We’re excited by the scale of the applications our subscribers are already tackling with a combination of 64-bit and parallel computing”

Ajay-  Command line is oh so commanding. Please describe any plans to support or help any R GUI like rattle or R Commander. Do you think Revolution R can get more users if it does help a GUI.

David- Right now we’re focusing on making R easier to use for programmers by creating a new GUI for programming and debugging R code. We heard feedback from some clients who were concerned about training their programmers in R without a modern development environment available. So we’re addressing that by improving R to make the “standard” features programmers expect (like step debugging and variable inspection) work in R and integrating it with the standard environment for programmers on Windows, Visual Studio.

In my opinion R’s strength lies in its combination of high-quality of statistical algorithms with a language ideal for applying them, so “hiding” the language behind a general-purpose GUI negates that strength a bit, I think. On the other hand it would be nice to have an open-source “user-friendly” tool for desktop statistical analysis, so I’m glad others are working to extend R in that area.

Ajay- Companies like SAS are investing in SaaS and cloud computing. Zementis offers scored models on the cloud through PMML. Any views on just building the model or analytics on the cloud itself.

David- To me, cloud computing is a cost-effective way of dynamically scaling hardware to the problem at hand. Not everyone has access to a 20-machine cluster for high-performing computing — and even those that do can’t instantly convert it to a cluster of 100 or 1000 machines to satisfy a sudden spike in demand. REvolution R Enterprise with ParallelR is unique in that it provides a platform for creating sophisticated data analysis applications distributed in the cloud, quickly and easily.

Using clouds for building models is a no-brainer for parallel-computing problems: I recently wrote about how parallel backtesting for financial trading can easily be deployed on Amazon EC2, for example. PMML is a great way of deploying static models, but one of the big advantages of cloud computing is that it makes it possible to update your model much more frequently, to keep your predictions in tune with the latest source data.

Ajay- What are the major alliances that REvolution has in the industry.

David- We have a number of industry partners. Microsoft and Intel, in particular, provide financial and technical support allowing us to really strengthen and optimize R on Windows, a platform that has been somewhat underserved by the open-source community. With Sybase, we’ve been working on combing REvolution R and Sybase Rap to produce some exciting advances in financial risk analytics. Similarly, we’ve been doing work with Vhayu’s Velocity database to provide high-performance data extraction. On the life sciences front, Pfizer is not only a valued client but in many ways a partner who has helped us “road-test” commercial grade R deployment with great success.

Ajay- What are the major R packages that REvolution supports and optimizes and how exactly do they work/help?

David- REvolution R works with all the R packages: in fact, we provide a mirror of CRAN so our subscribers have access to the truly amazing breadth and depth of analytic and graphical methods available in third-party R packages. Those packages that perform intensive mathematical calculations automatically benefit from the optimized math libraries that we incorporate in REvolution R Enterprise. In the future, we plan to work with authors of some key packages provide further improvements — in particular, to make packages work with ParallelR to reduce computation times in multiprocessor or cloud computing environments.

Ajay- Are you planning to lay off people during the recession. does REvolution Computing offer internships to college graduates. What do people at REvolution Computing do to have fun?

David- On the contrary, we’ve been hiring recently. We don’t have an intern program in place just yet, though. For me, it’s been a really fun place to work. Working for an open-source company has a different vibe than the commercial software companies I’ve worked for before. The most fun for me has been meeting with R users around the country and sharing stories about how R is really making a difference in so many different venues — over a few beers of course!


David Smith
Director of Community

David has a long history with the statistical community.  After graduating with a degree in Statistics from the University of Adelaide, South Australia, David spent four years researching statistical methodology at Lancaster University (United Kingdom), where he also developed a number of packages for the S-PLUS statistical modeling environment. David continued his association with S-PLUS at Insightful (now TIBCO Spotfire) where for more than eight years he oversaw the product management of S-PLUS and other statistical and data mining products. David is the co-author (with Bill Venables) of the tutorial manual, An Introduction to R , and one of the originating developers of ESS: Emacs Speaks Statistics. Prior to joining REvolution, David was Vice President, Product Management at Zynchros, Inc.

AjayTo know more about David Smith and REvolution Computing do visit http://www.revolution-computing.com and

http://www.blog.revolution-computing.com
Also see interview with Richard Schultz ,­CEO REvolution Computing here.

http://www.decisionstats.com/2009/01/31/interviewrichard-schultz-ceo-revolution-computing/

More R please

some R news

0 The R Foundation Website I guess the http://www.r-project.org team is busy prettyfying before the annual R users conference kicks in- the website of www.r-project.org ( I was told it looks has the aesthetic visual appeal of dead cat splattered on the autobahn a very HTML 4.0 kind of retro look )

I cant believe the R Site and R core honchos finds the following image the prettiest image to represent graphical abilities of R

The R core site has tremendous functionality and demand though I wonder if they can just put up some ads and get some funding/ two way research tie- up with Google —Google uses R extensively, and can help with online methods as well, and is listed as supporting organization at http://www.r-project.org/foundation/memberlist.html …..

The R archives are a collection of emails and thats not documentation at all – but

1 Revolution R Website and particularly David Smith’s blog is a great way to stay updated on R news at http://blog.revolution-computing.com/

I have covered REvolution R before, and they are truly impressive.

http://www.decisionstats.com/2009/01/31/interviewrichard-schultz-ceo-revolution-computing/

It seems the domain name revolutioncomputing.com was squatted ( by NC?) so thats why the hyphenated web name. It is a very lucid website- though I do request them to put more video/podcasts and a Tweet this button would be great :))

and another more techie post here

http://blog.revolution-computing.com/2009/05/verifying-zipfs-powerdistribution-law-for-cities.html

Another great source is the Twitter – it seems that Twitter R users use the hashtag #rstats to search for R kind of news and code – that should help R bloggers and at a later date users.

Click here for checking it out

http://search.twitter.com/search?q=#stats

2 Some more R forums and sites

Forum for R Enterprise Users http://www.revolution-computing.com/forum

A R Tips Site http://onertipaday.blogspot.com/

The R Journal ( yes there is a journal for all hard working R fans) http://journal.r-project.org/

R on Linkedin http://www.linkedin.com/groups?about=&gid=77616

and the Analytic Bridge community group for R

http://www.analyticbridge.com/group/rprojectandotherfreesoftwaretools

2 Here is a terrific post by Robert Grossman

at http://blog.rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

I liked the way he built the case for using R on Amazon EC2 ( Business case not Use case) and then proceeded to a step by step tutorial simple and powerful blog post. I hope R comes out with a standardized Online R Doc like that which is a single point search able archive for code – something like the SAS online doc (which remains free for WPS users 😉 ) but the way the web is evolving it seems the present mish mash method would continue

the main steps to use R on a pre-configured AMI.

Set up.
The set up needs to be done just once.

1. Set up an Amazon Web Services (AWS) account by going to:

aws.amazon.com.

If you already have an Amazon account for buying books and other items from Amazon, then you can use this account also for AWS.
2. Login to the AWS console
3. Create a “key-pair” by clinking on the link “Key Pairs” in the Configuration section of the Navigation Menu on the left hand side of the AWS console page.
4. Clink on the “Create Key Pair” button, about a quarter of the way down the page.
5. Name the key pair and save it to working directory, say /home/rlg/work.

Launching the AMI. These steps are done whenever you want to launch a new AMI.

1. Login to the AWS console. Click on the Amazon EC2 tab.
2. Click the “AMIs” button under the “Images and Instances” section of the left navigation menu of the AWS console.
3. Enter “opendatagroup” in the search box and select the AMI labeled
“opendatagroup/r-timeseries.manifest.xml”, which
is AMI instance “ami-ea846283″.
4. Enter the number of instances to launch (1), the name of the key pair that you have previously created, and select “web server” for the security group. Click the launch button to launch the AMI. Be sure to terminate the AMI when you are done.
5. Wait until the status of the AMI is “running.” This usually takes about 5 minutes.

Accessing the AMI.

1. Get the public IP address of the new AMI. The easiest way to do this is to select the AMI by checking the box. This provides some additional information about the AMI at the bottom of the window. You can can copy the IP address there.
2. Open a console window and cd to your working directory which contains the key-pair that you previously downloaded.
3. Type the command:
ssh -i testkp.pem -X root@ec2-67-202-44-197.compute-1.amazonaws.com

Here we assume that the name of the key-pair you created is “testkp.pem.” The flag “-X” starts a session that supports X11. If you don’t have X11 on your machine, you can still login and use R but the graphics in the example below won’t be displayed on your computer.

Using R on the AMI.

1. Change your directory and start R

#cd examples
#R
2. Test R by entering a R expression, such as:

> mean(1:100)
[1] 50.5
>
3. From within R, you can also source one of the example scripts to see some time series computations:

> source(‘NYSE.r’)
4. After a minute or so, you should see a graph on your screen. After the graph is finished being drawn, you should see a prompt:

CR to continue

Enter a carriage return and you should see another graph. You will need to enter a carriage return 8 times to complete the script (you can also choose to break out of the script if you get bored with the all the graphs.
5. When you are done, exit your R session with a control-D. Exit your ssh session with an “exit” and terminte your AMI from the Amazon AWS console. You can also choose to leave your AMI running (it is only a few dollars a day).

Acknowledgements: Steve Vejcik from Open Data Group wrote the R scripts and configured the AMI.

AjayTerrific R companies, blogs, tweets, research and sites, but do let me know your feedback . Just un-other R day.

saP or saS or sasR or saaS

Some pending news and posts- It appears that the company SAP is moving closer to major acquisitions. This includes launching more and more applications that are analytical in nature as well coming together in an alliance with hardware major Teradata. Teradata off course is a very close partner to SAS Institute. So could SAP and SAS and or Terdata be moving closer to a major announcement on BI and BA merging.

The open source database movement with Hadoop is the one which can be the real game changer in the managed database industry and AsterData is the company to watch here.

However R with its modular extensions is a different paradigm in language developement and SAS no longer has the nimbleness or flexibity in creating such apps- at the same time it has lost a fair deal of credibility in the young academia (due to R) as well cost sensitive consumers (due to WPS)

The succession issue of Jim Goodnight continues to be the biggest problem for SAS Institute- Jim is not getting younger and his second line is not expected to be of the same class as the Sall/ Goodnight partnership. Of all the major companies in software, Jim Goodnight stood alone in remaining private and thus managed to escape distractions of share prices while building up the franchise. Surviving oil shocks, cold wars, three recessions Mr Goodnight has cared for his local community as well despite being active in SAS and fending off sustained attempts by open source languages.

. An automatic partner for Mr Goodnight should have been Google or even Google Labs with the Brin/Page duo being the top data miners ( commerically) of this generation as Sall/Goodnight were 30 years ago.

SAP may spend a lot of its cash but the supply chain paradigm is best served by SaaS and exemplified by Salesforce.com and Force.com developers.

As the ancient Chinese said- May you live in interesting times.

Mergers and Acqusitions: Analyzing them

Valuation of future cash flows is an inexact science- too often it relies either on flat historical numbers (we grew by 5% last year so next year we will grow by 10%)

To add to the fun is the agency conflict, manager’s priorities (in terms of stock options encashment) is different from owner’s priorities.

These are some ways you can track companies for analysis-

1) Make a Google Alert on Company Name

2) Track if there is sudden and sustained spike in activity – it may be that company may be on road show seeking like minded partners, investors or mergers.

3) Watch for sudden drop in news alerts- it may mean radio silence or company may be in negotiations

4) Watch how company starts behaving with traditional antagonists…….

The easiest word thrown in the melee is ethics, copyright violations or payments delayed.

I am pasting an extract by a noted and renowned analyst in the business intelligence field-

Curt Monash

His Professional opinion on SAP

SAP’s NetWeaver Business Warehouse software will soon run natively on Teradata’s database for high-end data warehousing and BI (business intelligence), the vendors announced Monday.

SAP and its BusinessObjects BI subsidiary already had partnerships and product integrations with Teradata. But the vendors’ many joint customers have been clamoring for more, and native Business Warehouse support is the answer, said Tim Lang, vice president of product management for Business Objects.

SAP expects the new capability to enter beta testing in the fourth quarter of this year, with general availability in the first quarter of 2010, according to a spokesman.

Under the partnership, SAP will be handling first-line support, according to Lang. Pricing was not available.

The announcement drew a skeptical response from analyst Curt Monash of Monash Research, who questioned how deeply SAP will be committed to selling its customers on Teradata versus rival platforms.

“Business Objects has long been an extremely important partner for Teradata. But SAP’s most important DBMS partner is and will long be IBM, simply because [IBM] DB2 is not Oracle,” Monash said.”

Credit-

http://www.infoworld.com/d/data-management/sap-and-teradata-deepen-data-warehousing-ties-088

and here are some words from Curt Monash’s personal views on SAP

Typical nonsense from SAP

Below, essentially in its entirety, is an e-mail I just received from SAP, today, January 3. (Emphasis mine.)

Thank you for attending SAPs 4th Annual Analyst Summit in Las Vegas. We hope you found the time to be valuable. To ensure that we continue meeting your informational needs, please take a few moments to complete our online survey by using the link below. We ask that you please complete the survey before December 20. We look forward to receiving your feedback.

What makes this typical piece of SAP over-organization particularly amusing is that I didnt actually attend the event. I was planning to, but after considerable effort I think I finally made it clear to VP of Analyst Relations Don Bulmer that I was fed up with being lied to* by him and his colleagues. In connection with that, we came to a mutual agreement, as it were, that I wouldnt go.

*and lied about

Obviously, administrative ineptitude and dishonesty are two very different matters, united only by the fact that they both are characteristics of SAP, particularly its analyst relations group. Having said that, I should hasten to add that there are plenty of people at SAP I still trust. If Peter Zencke or Lothar Schubert tells me something, I expect it to be true. And its not just Germans; I feel the same way about Dan Rosenberg or Andrew Cabanski-Dunning, to name just a couple non-German SAP guys.

But I have to say this both SAPs ethics and its internal business processes are sufficiently screwed up as to cast doubt on SAPs qualifications to run the worlds best-run businesses.

Source:

http://www.monashreport.com/2007/01/03/sap-nonsense-ethics/

Journalism ethics off course makes sure that journalists don’t get renumerance or have to compulsorily declare benefits openly.This is not true for online journalism as it is still evolving.

Curt Monash is the grand daddy of all Business Intelligence Journalists- he has been doing this and seen it all since 1981 ( I was 4 years old then).

Almost incorruptible and therefore much respected his Monash report remains closely watched.

Some techniques to thwart Business Intelligence journalists is off course tactics of

1) Fear

2) Uncertainity

3) Doubt

by planting false leaks, or favoring more pliable journalists than the ones who ask difficult questions.

Another way is to use Search Engine Optimization so the Google search is rendered ineffective for diificult journalists for people to read them.

Why did I start this thread?

Well it seems the Business Intelligence world is coming to a round of consolidations and mergers. So will the trend of mega vendors first mentioned by M Fauschette here lead to a trend of mega journalist agencies as well- like a Fox News for all business intelligence journalists to report and get a share of the booty.

The Business Intelligence companies have long viewed analyst relationships as an unnecessary and uncontrollable marketing channel which they would like to see evolve.

Television Ratings can be manipulated for advertising similarly can you manipulate views, page views, clicks on a website for website advertisement.The catch is Google Trends may just give you the actual picture, but you can lie low by choosing not to submit or ping google during initial days and then we the website is big enough in terms of viewers or contributing bloggers can then safely ping Google as the momentum would be inertial in terms of getting bigger and bigger.

http://www.mfauscette.com/software_technology_partn/2009/05/the-emergence-of-the-mega-tech-vendor-economy.html

Here are some facts as per companies-

1) For SAS Institute

a) WPS is launching its Desktop software which enables SAS language users to migrate seamlessly at 1/10 th of the cost of SAS Base and SAS Stat. It will include Proc Reg and Proc Logistic in this and have a huge documentation.

b) R – open source software is increasingly powerful to manipulate data. SAS/IML tried offering a peace hand but they would need to reconcile with the GPL conditions for R- so if it is a plugin the source code is open and so on

c) Inference of R may be acquired by SAS to get a limited liability stake in a R based user platform.

d) Traditional Rival SPSS ( the two have dunked it out in analytics since 40 years) has a much better GUI and launched a revamped brand PASW. They are no longer distracted with a lawsuit which curiously accused them of stock manipulation and were found innocent.

e) Jim Goodnight has been dominating the industry since 1975 and has managed to stay private despite three recessions and huge inducements ( a wise miove given the mess in the markets in 2008). After Jim who will lead SAS with as much wisdom is an open question. Jim has refused Microsoft some years back, and is still very much in command despite being isolated in terms of industry alliances he remains respected. Pressure on him to rush into a merger would may just backfire.

f) The politics of envy- SAS is hated by many analytics people just as in some corners people hate America- it is because it is number 1, and been there too long.Did you mention anti-trust investigations . Well WPS is based out of UK and the European Union takes competition much more seriously.

g) Long time grudges – SAS is disliked despite its substantial R and D investments, the care it takes of its employees, and local community. Naturally people who are excluded or were excluded at some point of time have resentments.

h) SAS ambitions in Business Intelligence where curiously it is not that expensive and is actually more efficient than other players. The recent salvo fired by Jim Davis declaring business analytics as better than business intelligence- a remark much resented by cricket loving British journalist, Peter J Thomas

http://peterthomas.wordpress.com/category/business-intelligence/sas-bi-ba-controversy/

Intellectuals can carry huge grudges for decades ( Newton and Liebnitz) or Me with people who delay my interviews.

Teradata

1) Teradata has been a big partner with both SAS and SAP. It has also been losing ground recently in the same scenario SAS will shortly face.

It was also spun off in 2007-8 by the parent company NCR

http://it.toolbox.com/blogs/infosphere/against-the-flow-ncr-unacquires-teradata-13842

So will SAS buy Teradata

Will SAP Buy Teradata

Will SAS merge with Teradata and acquired by SAP while reaching a compromise with both WPS and R Project.

Will SAS call the bluff, make sincere efforts with the GPL and academic community to reconcile, give away multiple SAS Base and SAS Stat licenses in colleges and universities (like Asia, India, China) by expanding their academic program globally, start offering more coverage to JMP at a reduced price, make a trust for succession.

I dont know. All I know is I like writing code and poetry. Any code that gets the job done.

Any poem that I want to write ( see scribd books on the right)

R or SAS —– R and SAS ?

http://support.sas.com/rnd/app/studio/Rinterface2.html

R Interface Coming to SAS/IML Studio

While readers of the New York Times may have learned about R in recent weeks, it’s not news to many at SAS.

R is a leading language for developing new statistical methods, said Bob Rodriguez, Senior Director of Statistical Development at SAS. Our new PhD developers learned R in their graduate programs and are quite versed in it.

R is a matrix-based programming language that allows you to program statistical methods reasonably quickly. It’s open source software, and many add-on packages for R have emerged, providing statisticians with convenient access to new research. Many new statistical methods are first programmed in R.

While SAS is committed to providing the new statistical methodologies that the marketplace demands and will deliver new work more quickly with a recent decoupling of the analytical product releases from Base SAS, a commercial software vendor can only put out new work so fast. And never as as fast as a professor and a grad student writing an academic implementation of brand-new methodology.

Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers.

We know a lot of our users have both R and SAS in their tool kit, and we decided to make it easier for them to access R by making it available in the SAS environment, said Rodriguez. Our first interface to R will be in an upcoming version of SAS/IML Studio (currently known as SAS Stat Studio), scheduled for this summer.

The SAS/IML Studio interface allows you to integrate R functionality with IML or SAS programs. You can also exchange data between SAS and R as data sets or matrices.

This is just the first step, said Radhika Kulkarni, Vice President of Advanced Analytics. We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. For example, users will be able to interface with R through the IML procedure, possibly as soon as the first part of 2010.

SAS/IML Studio is distributed with SAS/IML software. Stay tuned for details on availability.

 

This is not to be co related by recent announcement by Mr Gentleman who invented the R language that if needed they will enforce legal action if terms of creative common licensing are not enforced.

It is a sad day for science when Gentleman professors are issuing mild legal threats just to make sure some pseudo science people are satisfied in  their intellectual hubris even though they themselves innovated R from language S. Revolution Computing does not want to be like the commercial maker of S Plus so they are supporting this legal position. Sad day when lawyers have to enforce code share. Maybe the R Project should start updating their website which looks like wreck across the auto bahn. Maybe Jim should visit the R users conference so the R Core team can see his horns.

Newton sued Leibnitz, and in the last days of his life, was tasked with enforcing a paper currency which he did rigorously. Good for the worlds currency, bad for science.