Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

High Performance Analytics

Marry Big Data Analytics to High Performance Computing, and you get the buzzword of this season- High Performance Analytics.

It basically consists of Parallelized code to run in parallel on custom hardware, in -database analytics for speed, and cloud computing /high performance computing environments. On an operational level, it consists of software (as in analytics) partnering with software (as in databases, Map reduce, Hadoop) plus some hardware (HP or IBM mostly). It is considered a high margin , highly profitable, business with small number of deals compared to say desktop licenses.

As per HPC Wire- which is a great tool/newsletter to keep updated on HPC , SAS Institute has been busy on this front partnering with EMC Greenplum and TeraData (who also acquired  SAS Partner AsterData to gain a much needed foot in the MR/SQL space) Continue reading “High Performance Analytics”

Interview Anne Milley JMP

Here is an interview with Anne Milley, a notable thought leader in the world of analytics. Anne is now Senior Director, Analytical Strategy in Product Marketing for JMP , the leading data visualization software from the SAS Institute.

Ajay-What do you think are the top 5 unique selling points of JMP compared to other statistical software in its category?

Anne-

JMP combines incredible analytic depth and breadth with interactive data visualization, creating a unique environment optimized for discovery and data-driven innovation.

With an extensible framework using JSL (JMP Scripting Language), and integration with SAS, R, and Excel, JMP becomes your analytic hub.

JMP is accessible to all kinds of users. A novice analyst can dig into an interactive report delivered by a custom JMP application. An engineer looking at his own data can use built-in JMP capabilities to discover patterns, and a developer can write code to extend JMP for herself or others.

State-of-the-art DOE capabilities make it easy for anyone to design and analyze efficient experiments to determine which adjustments will yield the greatest gains in quality or process improvement – before costly changes are made.

Not to mention, JMP products are exceptionally well designed and easy to use. See for yourself and check out the free trial at www.jmp.com.

Download a free 30-day trial of JMP.

Ajay- What are the challenges and opportunities of expanding JMP’s market share? Do you see JMP expanding its conferences globally to engage global audiences?

Anne-

We realized solid global growth in 2010. The release of JMP Pro and JMP Clinical last year along with continuing enhancements to the rest of the JMP family of products (JMP and JMP Genomics) should position us well for another good year.

With the growing interest in analytics as a means to sustained value creation, we have the opportunity to help people along their analytic journey – to get started, take the next step, or adopt new paradigms speeding their time to value. The challenge is doing that as fast as we would like.

We are hiring internationally to offer even more events, training and academic programs globally.

Ajay- What are the current and proposed educational and global academic initiatives of JMP? How can we see more JMP in universities across the world (say India- China etc)?

Anne-

We view colleges and universities both as critical incubators of future JMP users and as places where attitudes about data analysis and statistics are formed. We believe that a positive experience in learning statistics makes a person more likely to eventually want and need a product like JMP.

For most students – and particularly for those in applied disciplines of business, engineering and the sciences – the ability to make a statistics course relevant to their primary area of study fosters a positive experience. Fortunately, there is a trend in statistical education toward a more applied, data-driven approach, and JMP provides a very natural environment for both students and researchers.

Its user-friendly navigation, emphasis on data visualization and easy access to the analytics behind the graphics make JMP a compelling alternative to some of our more traditional competitors.

We’ve seen strong growth in the education markets in the last few years, and JMP is now used in nearly half of the top 200 universities in the US.

Internationally, we are at an earlier stage of market development, but we are currently working with both JMP and SAS country offices and their local academic programs to promote JMP. For example, we are working with members of the JMP China office and faculty at several universities in China to support the use of JMP in the development of a master’s curriculum in Applied Statistics there, touched on in this AMSTAT News article.

Ajay- What future trends do you see for 2011 in this market (say top 5)?

Anne-

Growing complexity of data (text, image, audio…) drives the need for more and better visualization and analysis capabilities to make sense of it all.

More “chief analytics officers” are making better use of analytic talent – people are the most important ingredient for success!

JMP has been on the vanguard of 64-bit development, and users are now catching up with us as 64-bit machines become more common.

Users should demand easy-to-use, exploratory and predictive modeling tools as well as robust tools to experiment and learn to help them make the best decisions on an ongoing basis.

All these factors and more fuel the need for the integration of flexible, extensible tools with popular analytic platforms.

Ajay-You enjoy organic gardening as a hobby. How do you think hobbies and unwind time help people be better professionals?

Anne-

I am lucky to work with so many people who view their work as a hobby. They have other interests too, though, some of which are work-related (statistics is relevant everywhere!). Organic gardening helps me put things in perspective and be present in the moment. More than work defines who you are. You can be passionate about your work as well as passionate about other things. I think it’s important to spend some leisure time in ways that bring you joy and contribute to your overall wellbeing and outlook.

Btw, nice interviews over the past several months—I hadn’t kept up, but will check it out more often!

Biography–  Source- http://www.sas.com/knowledge-exchange/business-analytics/biographies.html

  • Anne Milley

    Anne Milley

    Anne Milley is Senior Director of Analytics Strategy at JMP Product Marketing at SAS. Her ties to SAS began with bank failure prediction at Federal Home Loan Bank Dallas and continued at 7-Eleven Inc. She has authored papers and served on committees for F2006, KDD, SIAM, A2010 and several years of SAS’ annual data mining conference. Milley is a contributing faculty member for the International Institute of Analytics. anne.milley@jmp.com

The Latest GUI for R- BioR

Once more a spanking new shiny software –

Bio7 is a integrated development environment for ecological modelling based on the Rich-Client-Platformconcept of the Java IDE Eclipse. The Bio7 platform contains several perspectives which arrange several views for a special purpose useful for the development and analysis of ecological models. One special perspective bundles a feature rich GUI (Graphical User Interface) for the statistical software R.
For the bidirectional communication between Java and R the Rserve application is used (as a backend to evaluate R code and transfer data from and to Java).
The Bio7 R perspective (see figure below) is divided into a R-Shell view on the left side (conceptual the R side) and a Table view on the right side (conceptual the Java side).
Data can be imported to a spreadsheet, edited and then transferred to the R workspace. Vice versa data from R can be transferred to a sheet of the Table view and then exported e.g. to an Excel or OpenOffice file.

and

General:

Built upon Eclipse 3.6.1.

Now works with the latest Java version! (Windows version bundled with the latest JRE release).

Removed the Soil perspective (now soils can be modeled with ImageJ (float precision). Active images can be displayed in the 3D discrete view (new example available).

Removed the database perspective and the plant layer. You can now built any discrete models without any plant layer.

Removed several controls in the Control view. Added the “Custom Controls” view. In addition ported the Swing component of the Time panel to Swt.

Deleted the avi to swf converter in the ImageJ menu.

Now patterns can be saved with opened Java editor source. If this file is reopened and dragged on Bio7 the pattern is loaded, the source is compiled and the setup method (if available) is executed. In this way model files can be used for presentations ->drag, setup and run. The save actions are located in the Speadsheet view toolbar.

More options available to disable panel painting and recording of values (if not needed for speed!).

New Setup button in the toolbar of Bio7 to trigger a compiled setup method if available.

Removed the load and save pattern buttons from the toolbar of Bio7. Discrete patterns can now be stored with the available action in the spreadsheet view menu.

New P2 Update Manager available in Bio7.

Updated the Janino Compiler.

New HTML perspective added with a view which embeds the TinyMC editor.

New options to disable painting operations for the discrete panels.

New option to explicitly enable scripts at startup (for a faster startup).

Quadgrid (Hexgrid)

Only states are now available which can be created in the “Spreadsheet” view menu easily. Patterns can be stored and restored as usual but are now stored in an *.exml file.

New method to transfer the quadgrid pattern as a matrix to R.

New method to transfer the population data of all quadgrid states to R.

ImageJ:

Update to the latest version (with additional fixes).

Fixed a bug to rename the image.

Thumbnail browser can now open images recursevely(limited to 1000 pics), the magnifiyng glass can be disabled, too.

Plugins can be installed dynamically with a drag and drop operation on the ImageJ view or toolbar (as known from ImageJ).

Installed plugins now extend the plugin menu as submenus or subsubmenus (not finished yet!).

Plugins can now be created with the Java editor. New Bio7 Wizard available to create a plugin template.

Compiled Java files can be added to a *.jar file with a new available action in the Navigator view (if you rightclick on the files in the Navigator). In this way ImageJ plugins can be packaged in a *.jar.

Floweditor:

Fixed a repaint bug in the debug mode of a flow (now draws correctly the active shape in the flow).

Resize with Strg+Scrollwheel works again.

Comments with more than one line works again.

New Test action to verify connections in a flow.

Debug mode now shows all executed Shapes.

Integrated more default tests (for the verification of a regular flow).

A mouse-click now deletes colored shapes in a flow (e.g. in debug mode).

Points panel:

Integrated (dynamic) Voronoi, Delauney visualization (with area and clip to rectangle action).

Points coordinates can now be set in double precision.

Transfer of point coordinates to R now in double precision.

Bio7 Table:

New import and export of Excel 2007 OOXML.

Row headers can now be resized with the mouse device.

R:

Updated R (2.12.1) and Rserve (0.6.3) to the latest version.

New help action in the R-Shell view.

New action to display help for R specific commands in the embedded Bio7 browser (which opens automatically).

New Key actions to copy the selected variable names to the expression dialog (c=cocatenate (+), a=add (,)).

New action to transfer character or numeric vectors horizontally or vertically in an opened spread (Table view) at selection coordinates.

Empty spaces in the filepath are now allowed under Windows if Rserve is started with a system shell or the RGUI (for the tempfile select a location in the Preferences dialog which is writeable) is started.This works also for the RGUI action.

Improved the search for the “Install packages” action (option “Case Sensitive” added).

API:

New API methods available!

And:

Many fixes since the last version!

 

Installation

Important information:

A certain firewall software can corrupt the Bio7 *.zip file (as well as other files).
Please ensure that you have downloaded a functioning Bio7 1.5 version. In addition it is also reported that a certain antivirus software detects the bundled R software (on Windows) as malware. Often the R specific “open.exe” is detected as malware. Please use a different scanner to make sure that the software is not infected if you have any doubts. For more details see:

http://r.789695.n4.nabble.com/trojan-at-current-development-version-td3244348.html

 

Multi State Models

Arc de Triomphe

A special issue of the Journal of Statistical Software has come out devoted to Multi State Models and Competing Risks. It is a must read for anyone with interest in Pharma Analytics or Survival Analysis- even if you dont know much R

Here is an extract from “mstate: An R Package for the Analysis ofCompeting Risks and Multi-State Models”

Multi-state models are a very useful tool to answer a wide range of questions in sur-vival analysis that cannot, or only in a more complicated way, be answered by classicalmodels. They are suitable for both biomedical and other applications in which time-to-event variables are analyzed. However, they are still not frequently applied. So far, animportant reason for this has been the lack of available software. To overcome this prob-lem, we have developed the mstate package in R for the analysis of multi-state models.The package covers all steps of the analysis of multi-state models, from model buildingand data preparation to estimation and graphical representation of the results. It canbe applied to non- and semi-parametric (Cox) models. The package is also suitable forcompeting risks models, as they are a special category of multi-state models.

 

—————————–

 

Issues for JSS Special Volume 38: Competing Risks and Multi-State Models

Special Issue about Competing Risks and Multi-State Models

Hein Putter
Vol. 38, Issue 1, Jan 2011
Submitted 2011-01-03, Accepted 2011-01-03

Analyzing Competing Risk Data Using the R timereg Package

Thomas H. Scheike, Mei-Jie Zhang
Vol. 38, Issue 2, Jan 2011
Submitted 2009-05-25, Accepted 2010-06-22

p3state.msm: Analyzing Survival Data from an Illness-Death Model

Luís Filipe Meira Machado, Javier Roca-Pardiñas
Vol. 38, Issue 3, Jan 2011
Submitted 2009-06-30, Accepted 2010-03-02

Empirical Transition Matrix of Multi-State Models: The etm Package

Arthur Allignol, Martin Schumacher, Jan Beyersmann
Vol. 38, Issue 4, Jan 2011
Submitted 2009-01-08, Accepted 2010-03-11

Lexis: An R Class for Epidemiological Studies with Long-Term Follow-Up

Martyn Plummer, Bendix Carstensen
Vol. 38, Issue 5, Jan 2011
Submitted 2010-02-09, Accepted 2010-09-16

Using Lexis Objects for Multi-State Models in R

Bendix Carstensen, Martyn Plummer
Vol. 38, Issue 6, Jan 2011
Submitted 2010-02-09, Accepted 2010-09-16

mstate: An R Package for the Analysis of Competing Risks and Multi-State Models

Liesbeth C. de Wreede, Marta Fiocco, Hein Putter
Vol. 38, Issue 7, Jan 2011
Submitted 2010-01-17, Accepted 2010-08-20

Multi-State Models for Panel Data: The msm Package for R

Christopher Jackson
Vol. 38, Issue 8, Jan 2011
Submitted 2009-07-21, Accepted 2010-08-18

_______________________________________________
JSS-Announce mailing list
JSS-Announce@lists.stat.ucla.edu
http://lists.stat.ucla.edu/mailman/listinfo/jss-announce

 

Choosing R for business – What to consider?

A composite of the GNU logo and the OSI logo, ...
Image via Wikipedia

Additional features in R over other analytical packages-

1) Source Code is given to ensure complete custom solution and embedding for a particular application. Open source code has an advantage that is extensively peer- reviewed in Journals and Scientific Literature.  This means bugs will found, shared and corrected transparently.

2) Wide literature of training material in the form of books is available for the R analytical platform.

3) Extensively the best data visualization tools in analytical software (apart from Tableau Software ‘s latest version). The extensive data visualization available in R is of the form a variety of customizable graphs, as well as animation. The principal reason third-party software initially started creating interfaces to R is because the graphical library of packages in R is more advanced as well as rapidly getting more features by the day.

4) Free in upfront license cost for academics and thus budget friendly for small and large analytical teams.

5) Flexible programming for your data environment. This includes having packages that ensure compatibility with Java, Python and C++.

 

6) Easy migration from other analytical platforms to R Platform. It is relatively easy for a non R platform user to migrate to R platform and there is no danger of vendor lock-in due to the GPL nature of source code and open community.

Statistics are numbers that tell (descriptive), advise ( prescriptive) or forecast (predictive). Analytics is a decision-making help tool. Analytics on which no decision is to be made or is being considered can be classified as purely statistical and non analytical. Thus ease of making a correct decision separates a good analytical platform from a not so good analytical platform. The distinction is likely to be disputed by people of either background- and business analysis requires more emphasis on how practical or actionable the results are and less emphasis on the statistical metrics in a particular data analysis task. I believe one clear reason between business analytics is different from statistical analysis is the cost of perfect information (data costs in real world) and the opportunity cost of delayed and distorted decision-making.

Specific to the following domains R has the following costs and benefits

  • Business Analytics
    • R is free per license and for download
    • It is one of the few analytical platforms that work on Mac OS
    • It’s results are credibly established in both journals like Journal of Statistical Software and in the work at LinkedIn, Google and Facebook’s analytical teams.
    • It has open source code for customization as per GPL
    • It also has a flexible option for commercial vendors like Revolution Analytics (who support 64 bit windows) as well as bigger datasets
    • It has interfaces from almost all other analytical software including SAS,SPSS, JMP, Oracle Data Mining, Rapid Miner. Existing license holders can thus invoke and use R from within these software
    • Huge library of packages for regression, time series, finance and modeling
    • High quality data visualization packages
    • Data Mining
      • R as a computing platform is better suited to the needs of data mining as it has a vast array of packages covering standard regression, decision trees, association rules, cluster analysis, machine learning, neural networks as well as exotic specialized algorithms like those based on chaos models.
      • Flexibility in tweaking a standard algorithm by seeing the source code
      • The RATTLE GUI remains the standard GUI for Data Miners using R. It was created and developed in Australia.
      • Business Dashboards and Reporting
      • Business Dashboards and Reporting are an essential piece of Business Intelligence and Decision making systems in organizations. R offers data visualization through GGPLOT, and GUI like Deducer and Red-R can help even non R users create a metrics dashboard
        • For online Dashboards- R has packages like RWeb, RServe and R Apache- which in combination with data visualization packages offer powerful dashboard capabilities.
        • R can be combined with MS Excel using the R Excel package – to enable R capabilities to be imported within Excel. Thus a MS Excel user with no knowledge of R can use the GUI within the R Excel plug-in to use powerful graphical and statistical capabilities.

Additional factors to consider in your R installation-

There are some more choices awaiting you now-
1) Licensing Choices-Academic Version or Free Version or Enterprise Version of R

2) Operating System Choices-Which Operating System to choose from? Unix, Windows or Mac OS.

3) Operating system sub choice- 32- bit or 64 bit.

4) Hardware choices-Cost -benefit trade-offs for additional hardware for R. Choices between local ,cluster and cloud computing.

5) Interface choices-Command Line versus GUI? Which GUI to choose as the default start-up option?

6) Software component choice- Which packages to install? There are almost 3000 packages, some of them are complimentary, some are dependent on each other, and almost all are free.

7) Additional Software choices- Which additional software do you need to achieve maximum accuracy, robustness and speed of computing- and how to use existing legacy software and hardware for best complementary results with R.

1) Licensing Choices-
You can choose between two kinds of R installations – one is free and open source from http://r-project.org The other R installation is commercial and is offered by many vendors including Revolution Analytics. However there are other commercial vendors too.

Commercial Vendors of R Language Products-
1) Revolution Analytics http://www.revolutionanalytics.com/
2) XL Solutions- http://www.experience-rplus.com/
3) Information Builder – Webfocus RStat -Rattle GUI http://www.informationbuilders.com/products/webfocus/PredictiveModeling.html
4) Blue Reference- Inference for R http://inferenceforr.com/default.aspx

  1. Choosing Operating System
      1. Windows

 

Windows remains the most widely used operating system on this planet. If you are experienced in Windows based computing and are active on analytical projects- it would not make sense for you to move to other operating systems. This is also based on the fact that compatibility problems are minimum for Microsoft Windows and the help is extensively documented. However there may be some R packages that would not function well under Windows- if that happens a multiple operating system is your next option.

        1. Enterprise R from Revolution Analytics- Enterprise R from Revolution Analytics has a complete R Development environment for Windows including the use of code snippets to make programming faster. Revolution is also expected to make a GUI available by 2011. Revolution Analytics claims several enhancements for it’s version of R including the use of optimized libraries for faster performance.
      1. MacOS

 

Reasons for choosing MacOS remains its considerable appeal in aesthetically designed software- but MacOS is not a standard Operating system for enterprise systems as well as statistical computing. However open source R claims to be quite optimized and it can be used for existing Mac users. However there seem to be no commercially available versions of R available as of now for this operating system.

      1. Linux

 

        1. Ubuntu
        2. Red Hat Enterprise Linux
        3. Other versions of Linux

 

Linux is considered a preferred operating system by R users due to it having the same open source credentials-much better fit for all R packages and it’s customizability for big data analytics.

Ubuntu Linux is recommended for people making the transition to Linux for the first time. Ubuntu Linux had an marketing agreement with revolution Analytics for an earlier version of Ubuntu- and many R packages can  installed in a straightforward way as Ubuntu/Debian packages are available. Red Hat Enterprise Linux is officially supported by Revolution Analytics for it’s enterprise module. Other versions of Linux popular are Open SUSE.

      1. Multiple operating systems-
        1. Virtualization vs Dual Boot-

 

You can also choose between having a VMware VM Player for a virtual partition on your computers that is dedicated to R based computing or having operating system choice at the startup or booting of your computer. A software program called wubi helps with the dual installation of Linux and Windows.

  1. 64 bit vs 32 bit – Given a choice between 32 bit versus 64 bit versions of the same operating system like Linux Ubuntu, the 64 bit version would speed up processing by an approximate factor of 2. However you need to check whether your current hardware can support 64 bit operating systems and if so- you may want to ask your Information Technology manager to upgrade atleast some operating systems in your analytics work environment to 64 bit operating systems.

 

  1. Hardware choices- At the time of writing this book, the dominant computing paradigm is workstation computing followed by server-client computing. However with the introduction of cloud computing, netbooks, tablet PCs, hardware choices are much more flexible in 2011 than just a couple of years back.

Hardware costs are a significant cost to an analytics environment and are also  remarkably depreciated over a short period of time. You may thus examine your legacy hardware, and your future analytical computing needs- and accordingly decide between the various hardware options available for R.
Unlike other analytical software which can charge by number of processors, or server pricing being higher than workstation pricing and grid computing pricing extremely high if available- R is well suited for all kinds of hardware environment with flexible costs. Given the fact that R is memory intensive (it limits the size of data analyzed to the RAM size of the machine unless special formats and /or chunking is used)- it depends on size of datasets used and number of concurrent users analyzing the dataset. Thus the defining issue is not R but size of the data being analyzed.

    1. Local Computing- This is meant to denote when the software is installed locally. For big data the data to be analyzed would be stored in the form of databases.
      1. Server version- Revolution Analytics has differential pricing for server -client versions but for the open source version it is free and the same for Server or Workstation versions.
      2. Workstation
    2. Cloud Computing- Cloud computing is defined as the delivery of data, processing, systems via remote computers. It is similar to server-client computing but the remote server (also called cloud) has flexible computing in terms of number of processors, memory, and data storage. Cloud computing in the form of public cloud enables people to do analytical tasks on massive datasets without investing in permanent hardware or software as most public clouds are priced on pay per usage. The biggest cloud computing provider is Amazon and many other vendors provide services on top of it. Google is also coming for data storage in the form of clouds (Google Storage), as well as using machine learning in the form of API (Google Prediction API)
      1. Amazon
      2. Google
      3. Cluster-Grid Computing/Parallel processing- In order to build a cluster, you would need the RMpi and the SNOW packages, among other packages that help with parallel processing.
    3. How much resources
      1. RAM-Hard Disk-Processors- for workstation computing
      2. Instances or API calls for cloud computing
  1. Interface Choices
    1. Command Line
    2. GUI
    3. Web Interfaces
  2. Software Component Choices
    1. R dependencies
    2. Packages to install
    3. Recommended Packages
  3. Additional software choices
    1. Additional legacy software
    2. Optimizing your R based computing
    3. Code Editors
      1. Code Analyzers
      2. Libraries to speed up R

citation-  R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

(Note- this is a draft in progress)

Cloud Computing with R

Illusion of Depth and Space (4/22) - Rotating ...
Image by Dominic's pics via Flickr

Here is a short list of resources and material I put together as starting points for R and Cloud Computing It’s a bit messy but overall should serve quite comprehensively.

Cloud computing is a commonly used expression to imply a generational change in computing from desktop-servers to remote and massive computing connections,shared computers, enabled by high bandwidth across the internet.

As per the National Institute of Standards and Technology Definition,
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

(Citation: The NIST Definition of Cloud Computing

Authors: Peter Mell and Tim Grance
Version 15, 10-7-09
National Institute of Standards and Technology, Information Technology Laboratory
http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

From http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Web-Interfaces

R Web Interfaces

Rweb is developed and maintained by Jeff Banfield. The Rweb Home Page provides access to all three versions of Rweb—a simple text entry form that returns output and graphs, a more sophisticated JavaScript version that provides a multiple window environment, and a set of point and click modules that are useful for introductory statistics courses and require no knowledge of the R language. All of the Rweb versions can analyze Web accessible datasets if a URL is provided.
The paper “Rweb: Web-based Statistical Analysis”, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, was published in the Journal of Statistical Software (http://www.jstatsoft.org/v04/i01/).

Ulf Bartel has developed R-Online, a simple on-line programming environment for R which intends to make the first steps in statistical programming with R (especially with time series) as easy as possible. There is no need for a local installation since the only requirement for the user is a JavaScript capable browser. See http://osvisions.com/r-online/ for more information.

Rcgi is a CGI WWW interface to R by MJ Ray. It had the ability to use “embedded code”: you could mix user input and code, allowing the HTMLauthor to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output was possible in PostScript or GIF formats and the executed code was presented to the user for revision. However, it is not clear if the project is still active.

Currently, a modified version of Rcgi by Mai Zhou (actually, two versions: one with (bitmap) graphics and one without) as well as the original code are available from http://www.ms.uky.edu/~statweb/.

CGI-based web access to R is also provided at http://hermes.sdu.dk/cgi-bin/go/. There are many additional examples of web interfaces to R which basically allow to submit R code to a remote server, see for example the collection of links available from http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatCompCourse.

David Firth has written CGIwithR, an R add-on package available from CRAN. It provides some simple extensions to R to facilitate running R scripts through the CGI interface to a web server, and allows submission of data using both GET and POST methods. It is easily installed using Apache under Linux and in principle should run on any platform that supports R and a web server provided that the installer has the necessary security permissions. David’s paper “CGIwithR: Facilities for Processing Web Forms Using R” was published in the Journal of Statistical Software (http://www.jstatsoft.org/v08/i10/). The package is now maintained by Duncan Temple Lang and has a web page athttp://www.omegahat.org/CGIwithR/.

Rpad, developed and actively maintained by Tom Short, provides a sophisticated environment which combines some of the features of the previous approaches with quite a bit of JavaScript, allowing for a GUI-like behavior (with sortable tables, clickable graphics, editable output), etc.
Jeff Horner is working on the R/Apache Integration Project which embeds the R interpreter inside Apache 2 (and beyond). A tutorial and presentation are available from the project web page at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RApacheProject.

Rserve is a project actively developed by Simon Urbanek. It implements a TCP/IP server which allows other programs to use facilities of R. Clients are available from the web site for Java and C++ (and could be written for other languages that support TCP/IP sockets).

OpenStatServer is being developed by a team lead by Greg Warnes; it aims “to provide clean access to computational modules defined in a variety of computational environments (R, SAS, Matlab, etc) via a single well-defined client interface” and to turn computational services into web services.

Two projects use PHP to provide a web interface to R. R_PHP_Online by Steve Chen (though it is unclear if this project is still active) is somewhat similar to the above Rcgi and Rweb. R-php is actively developed by Alfredo Pontillo and Angelo Mineo and provides both a web interface to R and a set of pre-specified analyses that need no R code input.

webbioc is “an integrated web interface for doing microarray analysis using several of the Bioconductor packages” and is designed to be installed at local sites as a shared computing resource.

Rwui is a web application to create user-friendly web interfaces for R scripts. All code for the web interface is created automatically. There is no need for the user to do any extra scripting or learn any new scripting techniques. Rwui can also be found at http://rwui.cryst.bbk.ac.uk.

Finally, the R.rsp package by Henrik Bengtsson introduces “R Server Pages”. Analogous to Java Server Pages, an R server page is typically HTMLwith embedded R code that gets evaluated when the page is requested. The package includes an internal cross-platform HTTP server implemented in Tcl, so provides a good framework for including web-based user interfaces in packages. The approach is similar to the use of the brew package withRapache with the advantage of cross-platform support and easy installation.

Also additional R Cloud Computing Use Cases
http://wwwdev.ebi.ac.uk/Tools/rcloud/

ArrayExpress R/Bioconductor Workbench

Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster

Start the workbench by downloading the package for your operating system (Macintosh or Windows), or via Java Web Start, and you will get access to an instance of R running on one of EBI’s powerful machines. You can install additional packages, upload your own data, work with graphics and collaborate with colleagues, all as if you are running R locally, but unlimited by your machine’s memory, processor or data storage capacity.

  • Most up-to-date R version built for multicore CPUs
  • Access to all Bioconductor packages
  • Access to our computing infrastructure
  • Fast access to data stored in EBI’s repositories (e.g., public microarray data in ArrayExpress)

Using R Google Docs
http://www.omegahat.org/RGoogleDocs/run.pdf
It uses the XML and RCurl packages and illustrates that it is relatively quick and easy
to use their primitives to interact with Web services.

Using R with Amazon
Citation
http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:

  • Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
  • On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
  • Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
  • Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.

Connecting to R on Amazon EC2- Detailed tutorials
Ubuntu Linux version
https://decisionstats.com/2010/09/25/running-r-on-amazon-ec2/
and Windows R version
https://decisionstats.com/2010/10/02/running-r-on-amazon-ec2-windows/

Connecting R to Data on Google Storage and Computing on Google Prediction API
https://github.com/onertipaday/predictionapirwrapper
R wrapper for working with Google Prediction API

This package consists in a bunch of functions allowing the user to test Google Prediction API from R.
It requires the user to have access to both Google Storage for Developers and Google Prediction API:
see
http://code.google.com/apis/storage/ and http://code.google.com/apis/predict/ for details.

Example usage:

#This example requires you had previously created a bucket named data_language on your Google Storage and you had uploaded a CSV file named language_id.txt (your data) into this bucket – see for details
library(predictionapirwrapper)

and Elastic R for Cloud Computing
http://user2010.org/tutorials/Chine.html

Abstract

Elastic-R is a new portal built using the Biocep-R platform. It enables statisticians, computational scientists, financial analysts, educators and students to use cloud resources seamlessly; to work with R engines and use their full capabilities from within simple browsers; to collaborate, share and reuse functions, algorithms, user interfaces, R sessions, servers; and to perform elastic distributed computing with any number of virtual machines to solve computationally intensive problems.
Also see Karim Chine’s http://biocep-distrib.r-forge.r-project.org/

R for Salesforce.com

At the point of writing this, there seem to be zero R based apps on Salesforce.com This could be a big opportunity for developers as both Apex and R have similar structures Developers could write free code in R and charge for their translated version in Apex on Salesforce.com

Force.com and Salesforce have many (1009) apps at
http://sites.force.com/appexchange/home for cloud computing for
businesses, but very few forecasting and statistical simulation apps.

Example of Monte Carlo based app is here
http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016cT9EAI#

These are like iPhone apps except meant for business purposes (I am
unaware if any university is offering salesforce.com integration
though google apps and amazon related research seems to be on)

Force.com uses a language called Apex  and you can see
http://wiki.developerforce.com/index.php/App_Logic and
http://wiki.developerforce.com/index.php/An_Introduction_to_Formulas
Apex is similar to R in that is OOPs

SAS Institute has an existing product for taking in Salesforce.com data.

A new SAS data surveyor is
available to access data from the Customer Relationship Management
(CRM) software vendor Salesforce.com. at
http://support.sas.com/documentation/cdl/en/whatsnew/62580/HTML/default/viewer.htm#datasurveyorwhatsnew902.htm)

Personal Note-Mentioning SAS in an email to a R list is a big no-no in terms of getting a response and love. Same for being careless about which R help list to email (like R devel or R packages or R help)

For python based cloud see http://pi-cloud.com

%d bloggers like this: