Home » Posts tagged 'Distributed computing'

Tag Archives: Distributed computing

Latest R Journal

Including juicy stuff on using a cluster of Apple Machines for grid computing , seasonality forecasting (Yet Another Package For Time Series )

But I kind of liked Sumo too-

https://code.google.com/p/sumo/

Sumo is a fully-functional web application template that exposes an authenticated user’s R session within java server pages.

Sumo: An Authenticating Web Application with an Embedded R Session by Timothy T. Bergsma and Michael S. Smith Abstract Sumo is a web application intended as a template for developers. It is distributed as a Java ‘war’ file that deploys automatically when placed in a Servlet container’s ‘webapps’
directory. If a user supplies proper credentials, Sumo creates a session-specific Secure Shell connection to the host and a user-specific R session over that connection. Developers may write dynamic server pages that make use of the persistent R session and user-specific file space.

and for Apple fanboys-

We created the xgrid package (Horton and Anoke, 2012) to provide a simple interface to this distributed computing system. The package facilitates use of an Apple Xgrid for distributed processing of a simulation with many independent repetitions, by simplifying job submission (or grid stuffing) and collation of results. It provides a relatively thin but useful layer between R and Apple’s ‘xgrid’ shell command, where the user constructs input scripts to be run remotely. A similar set of routines, optimized for parallel estimation of JAGS (just another Gibbs sampler) models is available within the runjags package (Denwood, 2010). However, with the exception of runjags, none of the previously mentioned packages support parallel computation over an Apple Xgrid.

Hmm I guess parallel computing enabled by Wifi on mobile phones would be awesome too ! So would be anything using iOS . See the rest of the R Journal at http://journal.r-project.org/current.html

RJournal_2012-1

Interview Rapid-I -Ingo Mierswa and Simon Fischer

Here is an interview with Dr Ingo Mierswa , CEO of Rapid -I and Dr Simon Fischer, Head R&D. Rapid-I makes the very popular software Rapid Miner – perhaps one of the earliest leading open source software in business analytics and business intelligence. It is quite easy to use, deploy and with it’s extensions and innovations (including compatibility with R )has continued to grow tremendously through the years.

In an extensive interview Ingo and Simon talk about algorithms marketplace, extensions , big data analytics, hadoop, mobile computing and use of the graphical user interface in analytics.

Special Thanks to Nadja from Rapid I communication team for helping coordinate this interview.( Statuary Blogging Disclosure- Rapid I is a marketing partner with Decisionstats as per the terms in http://decisionstats.com/privacy-3/)

Ajay- Describe your background in science. What are the key lessons that you have learnt while as scientific researcher and what advice would you give to new students today.

Ingo: My time as researcher really was a great experience which has influenced me a lot. I have worked at the AI lab of Prof. Dr. Katharina Morik, one of the persons who brought machine learning and data mining to Europe. Katharina always believed in what we are doing, encouraged us and gave us the space for trying out new things. Funnily enough, I never managed to use my own scientific results in any real-life project so far but I consider this as a quite common gap between science and the “real world”. At Rapid-I, however, we are still heavily connected to the scientific world and try to combine the best of both worlds: solving existing problems with leading-edge technologies.

Simon: In fact, during my academic career I have not worked in the field of data mining at all. I worked on a field some of my colleagues would probably even consider boring, and that is theoretical computer science. To be precise, my research was in the intersection of game theory and network theory. During that time, I have learnt a lot of exciting things, none of which had any business use. Still, I consider that a very valuable experience. When we at Rapid-I hire people coming to us right after graduating, I don’t care whether they know the latest technology with a fancy three-letter acronym – that will be forgotten more quickly than it came. What matters is the way you approach new problems and challenges. And that is also my recommendation to new students: work on whatever you like, as long as you are passionate about it and it brings you forward.

Ajay-  How is the Rapid Miner Extensions marketplace moving along. Do you think there is a scope for people to say create algorithms in a platform like R , and then offer that algorithm as an app for sale just like iTunes or Android apps.

 Simon: Well, of course it is not going to be exactly like iTunes or Android apps are, because of the more business-orientated character. But in fact there is a scope for that, yes. We have talked to several developers, e.g., at our user conference RCOMM, and several people would be interested in such an opportunity. Companies using data mining software need supported software packages, not just something they downloaded from some anonymous server, and that is only possible through a platform like the new Marketplace. Besides that, the marketplace will not only host commercial extensions. It is also meant to be a platform for all the developers that want to publish their extensions to a broader community and make them accessible in a comfortable way. Of course they could just place them on their personal Web pages, but who would find them there? From the Marketplace, they are installable with a single click.

Ingo: What I like most about the new Rapid-I Marketplace is the fact that people can now get something back for their efforts. Developing a new algorithm is a lot of work, in some cases even more that developing a nice app for your mobile phone. It is completely accepted that people buy apps from a store for a couple of Dollars and I foresee the same for sharing and selling algorithms instead of apps. Right now, people can already share algorithms and extensions for free, one of the next versions will also support selling of those contributions. Let’s see what’s happening next, maybe we will add the option to sell complete RapidMiner workflows or even some data pools…

Ajay- What are the recent features in Rapid Miner that support cloud computing, mobile computing and tablets. How do you think the landscape for Big Data (over 1 Tb ) is changing and how is Rapid Miner adapting to it.

Simon: These are areas we are very active in. For instance, we have an In-Database-Mining Extension that allows the user to run their modelling algorithms directly inside the database, without ever loading the data into memory. Using analytic databases like Vectorwise or Infobright, this technology can really boost performance. Our data mining server, RapidAnalytics, already offers functionality to send analysis processes into the cloud. In addition to that, we are currently preparing a research project dealing with data mining in the cloud. A second project is targeted towards the other aspect you mention: the use of mobile devices. This is certainly a growing market, of course not for designing and running analyses, but for inspecting reports and results. But even that is tricky: When you have a large screen you can display fancy and comprehensive interactive dashboards with drill downs and the like. On a mobile device, that does not work, so you must bring your reports and visualizations very much to the point. And this is precisely what data mining can do – and what is hard to do for classical BI.

Ingo: Then there is Radoop, which you may have heard of. It uses the Apache Hadoop framework for large-scale distributed computing to execute RapidMiner processes in the cloud. Radoop has been presented at this year’s RCOMM and people are really excited about the combination of RapidMiner with Hadoop and the scalability this brings.

 Ajay- Describe the Rapid Miner analytics certification program and what steps are you taking to partner with academic universities.

Ingo: The Rapid-I Certification Program was created to recognize professional users of RapidMiner or RapidAnalytics. The idea is that certified users have demonstrated a deep understanding of the data analysis software solutions provided by Rapid-I and how they are used in data analysis projects. Taking part in the Rapid-I Certification Program offers a lot of benefits for IT professionals as well as for employers: professionals can demonstrate their skills and employers can make sure that they hire qualified professionals. We started our certification program only about 6 months ago and until now about 100 professionals have been certified so far.

Simon: During our annual user conference, the RCOMM, we have plenty of opportunities to talk to people from academia. We’re also present at other conferences, e.g. at ECML/PKDD, and we are sponsoring data mining challenges and grants. We maintain strong ties with several universities all over Europe and the world, which is something that I would not want to miss. We are also cooperating with institutes like the ITB in Dublin during their training programmes, e.g. by giving lectures, etc. Also, we are leading or participating in several national or EU-funded research projects, so we are still close to academia. And we offer an academic discount on all our products :-)

Ajay- Describe the global efforts in making Rapid Miner a truly international software including spread of developers, clients and employees.

Simon: Our clients already are very international. We have a partner network in America, Asia, and Australia, and, while I am responding to these questions, we have a training course in the US. Developers working on the core of RapidMiner and RapidAnalytics, however, are likely to stay in Germany for the foreseeable future. We need specialists for that, and it would be pointless to spread the development team over the globe. That is also owed to the agile philosophy that we are following.

Ingo: Simon is right, Rapid-I already is acting on an international level. Rapid-I now has more than 300 customers from 39 countries in the world which is a great result for a young company like ours. We are of course very strong in Germany and also the rest of Europe, but also concentrate on more countries by means of our very successful partner network. Rapid-I continues to build this partner network and to recruit dynamic and knowledgeable partners and in the future. However, extending and acting globally is definitely part of our strategic roadmap.

Biography-

Dr. Ingo Mierswa is working as Chief Executive Officer (CEO) of Rapid-I. He has several years of experience in project management, human resources management, consulting, and leadership including eight years of coordinating and leading the multi-national RapidMiner developer team with about 30 developers and contributors world-wide. He wrote his Phd titled “Non-Convex and Multi-Objective Optimization for Numerical Feature Engineering and Data Mining” at the University of Dortmund under the supervision of Prof. Morik.

Dr. Simon Fischer is heading the research & development at Rapid-I. His interests include game theory and networks, the theory of evolutionary algorithms (e.g. on the Ising model), and theoretical and practical aspects of data mining. He wrote his PhD in Aachen where he worked in the project “Design and Analysis of Self-Regulating Protocols for Spectrum Assignment” within the excellence cluster UMIC. Before, he was working on the vtraffic project within the DFG Programme 1126 “Algorithms for large and complex networks”.

http://rapid-i.com/content/view/181/190/ tells you more on the various types of Rapid Miner licensing for enterprise, individual and developer versions.

(Note from Ajay- to receive an early edition invite to Radoop, click here http://radoop.eu/z1sxe)

 

AsterData still alive;/launches SQL-MapReduce Developer Portal

so apparantly ole client AsterData continues to thrive under gentle touch of Terrific Data

———————————————————————————————————————————————————

Aster Data today launched the SQL-MapReduce Developer Portal, a new online community for data scientists and analytic developers. For your convenience, I copied the release below and it can also be found here. Please let me know if you have any questions or if there is anything else I can help you with.

Sara Korolevich

Point Communications Group for Aster Data

sarak@pointcgroup.com

Office: 602.279.1137

Mobile: 623.326.0881

Teradata Accelerates Big Data Analytics with First Collaborative Community for SQL-MapReduce®

New online community for data scientists and analytic developers enables development and sharing of powerful MapReduce analytics


San Carlos, California - Teradata Corporation (NYSE:TDC) today announced the launch of the Aster Data SQL-MapReduce® Developer Portal. This portal is the first collaborative online developer community for SQL-MapReduce analytics, an emerging framework for processing non-relational data and ultra-fast analytics.

“Aster Data continues to deliver on its unique vision for powerful analytics with a rich set of tools to make development of those analytics quick and easy,” said Tasso Argyros, vice president of Aster Data Marketing and Product Management, Teradata Corporation. “This new developer portal builds on Aster Data’s continuing SQL-MapReduce innovation, leveraging the flexibility and power of SQL-MapReduce for analytics that were previously impossible or impractical.”

The developer portal showcases the power and flexibility of Aster Data’s SQL-MapReduce – which uniquely combines standard SQL with the popular MapReduce distributed computing technology for processing big data – by providing a collaborative community for sharing SQL-MapReduce expert insights in addition to sharing SQL-MapReduce analytic functions and sample code. Data scientists, quantitative analysts, and developers can now leverage the experience, knowledge, and best practices of a community of experts to easily harness the power of SQL-MapReduce for big data analytics.

A recent report from IDC Research, “Taking Care of Your Quants: Focusing Data Warehousing Resources on Quantitative Analysts Matters,” has shown that by enabling data scientists with the tools to harness emerging types and sources of data, companies create significant competitive advantage and become leaders in their respective industry.

“The biggest positive differences among leaders and the rest come from the introduction of new types of data,” says Dan Vesset, program vice president, Business Analytics Solutions, IDC Research. “This may include either new transactional data sources or new external data feeds of transactional or multi-structured interactional data — the latter may include click stream or other data that is a by-product of social networking.”

Vesset goes on to say, “Aster Data provides a comprehensive platform for analytics and their SQL-MapReduce Developer Portal provides a community for sharing best practices and functions which can have an even greater impact to an organization’s business.”

With this announcement Aster Data extends its industry leadership in delivering the most comprehensive analytic platform for big data analytics — not only capable of processing massive volumes of multi-structured data, but also providing an extensive set of tools and capabilities that make it simple to leverage the power of MapReduce analytics. The Aster Data

SQL-MapReduce Developer Portal brings the power of SQL-MapReduce accessible to data scientists, quantitative analysis, and analytic developers by making it easy to share and collaborate with experts in developing SQL-MapReduce analytics. This portal builds on Aster Data’s history of SQL-MapReduce innovations, including:

  • The first deep integration of SQL with MapReduce
  • The first MapReduce support for .NET
  • The first integrated development environment, Aster Data
    Developer Express
  • A comprehensive suite of analytic functions, Aster Data
    Analytic Foundation

Aster Data’s patent-pending SQL-MapReduce enables analytic applications and functions that can deliver faster, deeper insights on terabytes to petabytes of data. These applications are implemented using MapReduce but delivered through standard SQL and business intelligence (BI) tools.

SQL-MapReduce makes it possible for data scientists and developers to empower business analysts with the ability to make informed decisions, incorporating vast amounts of data, regardless of query complexity or data type. Aster Data customers are using SQL-MapReduce for rich analytics including analytic applications for social network analysis, digital marketing optimization, and on-the-fly fraud detection and prevention.

“Collaboration is at the core of our success as one of the leading providers, and pioneers of social software,” said Navdeep Alam, director of Data Architecture at Mzinga. “We are pleased to be one of the early members of The Aster Data SQL-MapReduce Developer Portal, which will allow us the ability to share and leverage insights with others in using big data analytics to attain a deeper understanding of customers’ behavior and create competitive advantage for our business.”

SQL-MapReduce is one of the core capabilities within Aster Data’s flagship product. Aster DatanCluster™ 4.6, the industry’s first massively parallel processing (MPP) analytic platform has an integrated analytics engine that stores and processes both relational and non-relational data at scale. With Aster Data’s unique analytics framework that supports both SQL and
SQL-MapReduce™, customers benefit from rich, new analytics on large data volumes with complex data types. Aster Data analytic functions are embedded within the analytic platform and processed locally with data, which allows for faster data exploration. The SQL-MapReduce framework provides scalable fault-tolerance for new analytics, providing users with superior reliability, regardless of number of users, query size, or data types.


About Aster Data
Aster Data is a market leader in big data analytics, enabling the powerful combination of cost-effective storage and ultra-fast analysis of new sources and types of data. The Aster Data nCluster analytic platform is a massively parallel software solution that embeds MapReduce analytic processing with data stores for deeper insights on new data sources and types to deliver new analytic capabilities with breakthrough performance and scalability. Aster Data’s solution utilizes Aster Data’s patent-pending SQL-MapReduce to parallelize processing of data and applications and deliver rich analytic insights at scale. Companies including Barnes & Noble, Intuit, LinkedIn, Akamai, and MySpace use Aster Data to deliver applications such as digital marketing optimization, social network and relationship analysis, and fraud detection and prevention.


About Teradata
Teradata is the world’s leader in data warehousing and integrated marketing management through itsdatabase softwaredata warehouse appliances, and enterprise analytics. For more information, visitteradata.com.

# # #

Teradata is a trademark or registered trademark of Teradata Corporation in the United States and other countries.

Cloud Computing with R

Illusion of Depth and Space (4/22) - Rotating ...

Image by Dominic's pics via Flickr

Here is a short list of resources and material I put together as starting points for R and Cloud Computing It’s a bit messy but overall should serve quite comprehensively.

Cloud computing is a commonly used expression to imply a generational change in computing from desktop-servers to remote and massive computing connections,shared computers, enabled by high bandwidth across the internet.

As per the National Institute of Standards and Technology Definition,
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

(Citation: The NIST Definition of Cloud Computing

Authors: Peter Mell and Tim Grance
Version 15, 10-7-09
National Institute of Standards and Technology, Information Technology Laboratory
http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

From http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Web-Interfaces

R Web Interfaces

Rweb is developed and maintained by Jeff Banfield. The Rweb Home Page provides access to all three versions of Rweb—a simple text entry form that returns output and graphs, a more sophisticated JavaScript version that provides a multiple window environment, and a set of point and click modules that are useful for introductory statistics courses and require no knowledge of the R language. All of the Rweb versions can analyze Web accessible datasets if a URL is provided.
The paper “Rweb: Web-based Statistical Analysis”, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, was published in the Journal of Statistical Software (http://www.jstatsoft.org/v04/i01/).

Ulf Bartel has developed R-Online, a simple on-line programming environment for R which intends to make the first steps in statistical programming with R (especially with time series) as easy as possible. There is no need for a local installation since the only requirement for the user is a JavaScript capable browser. See http://osvisions.com/r-online/ for more information.

Rcgi is a CGI WWW interface to R by MJ Ray. It had the ability to use “embedded code”: you could mix user input and code, allowing the HTMLauthor to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output was possible in PostScript or GIF formats and the executed code was presented to the user for revision. However, it is not clear if the project is still active.

Currently, a modified version of Rcgi by Mai Zhou (actually, two versions: one with (bitmap) graphics and one without) as well as the original code are available from http://www.ms.uky.edu/~statweb/.

CGI-based web access to R is also provided at http://hermes.sdu.dk/cgi-bin/go/. There are many additional examples of web interfaces to R which basically allow to submit R code to a remote server, see for example the collection of links available from http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatCompCourse.

David Firth has written CGIwithR, an R add-on package available from CRAN. It provides some simple extensions to R to facilitate running R scripts through the CGI interface to a web server, and allows submission of data using both GET and POST methods. It is easily installed using Apache under Linux and in principle should run on any platform that supports R and a web server provided that the installer has the necessary security permissions. David’s paper “CGIwithR: Facilities for Processing Web Forms Using R” was published in the Journal of Statistical Software (http://www.jstatsoft.org/v08/i10/). The package is now maintained by Duncan Temple Lang and has a web page athttp://www.omegahat.org/CGIwithR/.

Rpad, developed and actively maintained by Tom Short, provides a sophisticated environment which combines some of the features of the previous approaches with quite a bit of JavaScript, allowing for a GUI-like behavior (with sortable tables, clickable graphics, editable output), etc.
Jeff Horner is working on the R/Apache Integration Project which embeds the R interpreter inside Apache 2 (and beyond). A tutorial and presentation are available from the project web page at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RApacheProject.

Rserve is a project actively developed by Simon Urbanek. It implements a TCP/IP server which allows other programs to use facilities of R. Clients are available from the web site for Java and C++ (and could be written for other languages that support TCP/IP sockets).

OpenStatServer is being developed by a team lead by Greg Warnes; it aims “to provide clean access to computational modules defined in a variety of computational environments (R, SAS, Matlab, etc) via a single well-defined client interface” and to turn computational services into web services.

Two projects use PHP to provide a web interface to R. R_PHP_Online by Steve Chen (though it is unclear if this project is still active) is somewhat similar to the above Rcgi and Rweb. R-php is actively developed by Alfredo Pontillo and Angelo Mineo and provides both a web interface to R and a set of pre-specified analyses that need no R code input.

webbioc is “an integrated web interface for doing microarray analysis using several of the Bioconductor packages” and is designed to be installed at local sites as a shared computing resource.

Rwui is a web application to create user-friendly web interfaces for R scripts. All code for the web interface is created automatically. There is no need for the user to do any extra scripting or learn any new scripting techniques. Rwui can also be found at http://rwui.cryst.bbk.ac.uk.

Finally, the R.rsp package by Henrik Bengtsson introduces “R Server Pages”. Analogous to Java Server Pages, an R server page is typically HTMLwith embedded R code that gets evaluated when the page is requested. The package includes an internal cross-platform HTTP server implemented in Tcl, so provides a good framework for including web-based user interfaces in packages. The approach is similar to the use of the brew package withRapache with the advantage of cross-platform support and easy installation.

Also additional R Cloud Computing Use Cases
http://wwwdev.ebi.ac.uk/Tools/rcloud/

ArrayExpress R/Bioconductor Workbench

Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster

Start the workbench by downloading the package for your operating system (Macintosh or Windows), or via Java Web Start, and you will get access to an instance of R running on one of EBI’s powerful machines. You can install additional packages, upload your own data, work with graphics and collaborate with colleagues, all as if you are running R locally, but unlimited by your machine’s memory, processor or data storage capacity.

  • Most up-to-date R version built for multicore CPUs
  • Access to all Bioconductor packages
  • Access to our computing infrastructure
  • Fast access to data stored in EBI’s repositories (e.g., public microarray data in ArrayExpress)

Using R Google Docs
http://www.omegahat.org/RGoogleDocs/run.pdf
It uses the XML and RCurl packages and illustrates that it is relatively quick and easy
to use their primitives to interact with Web services.

Using R with Amazon
Citation
http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:

  • Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
  • On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
  • Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
  • Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.

Connecting to R on Amazon EC2- Detailed tutorials
Ubuntu Linux version
http://decisionstats.com/2010/09/25/running-r-on-amazon-ec2/
and Windows R version
http://decisionstats.com/2010/10/02/running-r-on-amazon-ec2-windows/

Connecting R to Data on Google Storage and Computing on Google Prediction API
https://github.com/onertipaday/predictionapirwrapper
R wrapper for working with Google Prediction API

This package consists in a bunch of functions allowing the user to test Google Prediction API from R.
It requires the user to have access to both Google Storage for Developers and Google Prediction API:
see
http://code.google.com/apis/storage/ and http://code.google.com/apis/predict/ for details.

Example usage:

#This example requires you had previously created a bucket named data_language on your Google Storage and you had uploaded a CSV file named language_id.txt (your data) into this bucket – see for details
library(predictionapirwrapper)

and Elastic R for Cloud Computing
http://user2010.org/tutorials/Chine.html

Abstract

Elastic-R is a new portal built using the Biocep-R platform. It enables statisticians, computational scientists, financial analysts, educators and students to use cloud resources seamlessly; to work with R engines and use their full capabilities from within simple browsers; to collaborate, share and reuse functions, algorithms, user interfaces, R sessions, servers; and to perform elastic distributed computing with any number of virtual machines to solve computationally intensive problems.
Also see Karim Chine’s http://biocep-distrib.r-forge.r-project.org/

R for Salesforce.com

At the point of writing this, there seem to be zero R based apps on Salesforce.com This could be a big opportunity for developers as both Apex and R have similar structures Developers could write free code in R and charge for their translated version in Apex on Salesforce.com

Force.com and Salesforce have many (1009) apps at
http://sites.force.com/appexchange/home for cloud computing for
businesses, but very few forecasting and statistical simulation apps.

Example of Monte Carlo based app is here
http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016cT9EAI#

These are like iPhone apps except meant for business purposes (I am
unaware if any university is offering salesforce.com integration
though google apps and amazon related research seems to be on)

Force.com uses a language called Apex  and you can see
http://wiki.developerforce.com/index.php/App_Logic and
http://wiki.developerforce.com/index.php/An_Introduction_to_Formulas
Apex is similar to R in that is OOPs

SAS Institute has an existing product for taking in Salesforce.com data.

A new SAS data surveyor is
available to access data from the Customer Relationship Management
(CRM) software vendor Salesforce.com. at
http://support.sas.com/documentation/cdl/en/whatsnew/62580/HTML/default/viewer.htm#datasurveyorwhatsnew902.htm)

Personal Note-Mentioning SAS in an email to a R list is a big no-no in terms of getting a response and love. Same for being careless about which R help list to email (like R devel or R packages or R help)

For python based cloud see http://pi-cloud.com

Matlab-Mathematica-R and GPU Computing

Matlab announced they have a parallel computing toolbox- specially to enable GPU computing as well

http://www.mathworks.com/products/parallel-computing/

Parallel Computing Toolbox™ lets you solve computationally and data-intensive problems using multicore processors, GPUs, and computer clusters. High-level constructs—parallel for-loops, special array types, and parallelized numerical algorithms—let you parallelize MATLAB® applications without CUDA or MPI programming. You can use the toolbox with Simulink® to run multiple simulations of a model in parallel.

MATLAB GPU Support

The toolbox provides eight workers (MATLAB computational engines) to execute applications locally on a multicore desktop. Without changing the code, you can run the same application on a computer cluster or a grid computing service (using MATLAB Distributed Computing Server™). You can run parallel applications interactively or in batch.

Parallel Computing with MATLAB on Amazon Elastic Compute Cloud (EC2)

Also a video of using Mathematica and GPU

Also R has many packages for GPU computing

Parallel computing: GPUs

from http://cran.r-project.org/web/views/HighPerformanceComputing.html

  • The gputools package by Buckner provides several common data-mining algorithms which are implemented using a mixture of nVidia‘s CUDA langauge and cublas library. Given a computer with an nVidia GPU these functions may be substantially more efficient than native R routines. The rpud package provides an optimised distance metric for NVidia-based GPUs.
  • The cudaBayesreg package by da Silva implements the rhierLinearModel from the bayesm package using nVidia’s CUDA langauge and tools to provide high-performance statistical analysis of fMRI voxels.
  • The rgpu package (see below for link) aims to speed up bioinformatics analysis by using the GPU.
  • The magma package provides an interface to the hybrid GPU/CPU library Magma (see below for link).
  • The gcbd package implements a benchmarking framework for BLAS and GPUs (using gputools).

I tried to search for SAS and GPU and SPSS and GPU but got nothing. Maybe they would do well to atleast test these alternative hardwares-

Also see Matlab on GPU comparison for the product Jacket vs Parallel Computing Toolbox

http://www.accelereyes.com/products/compare

Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • and
  • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
    • public static IQueryable<R>
      MapReduce<S,M,K,R>(this IQueryable<S> source,
      Expression<Func<S,IEnumerable<M>>> mapper,
      Expression<Func<M,K>> keySelector,
      Expression<Func<K,IEnumerable<M>,R>> reducer)
      {
      return source.SelectMany(mapper).GroupBy(keySelector, reducer);
      }

    and http://research.microsoft.com/en-us/projects/dryad/

    Dryad

    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Overview

    New! An academic release of DryadLINQ is now available for public download.

    Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

    The Structure of Dryad Jobs

    A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

    Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

    The Dryad Software Stack

    As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

    • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
    • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
    • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
      1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
      2. It allows the construction of pipes spanning multiple machines, across a cluster.
      3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
    • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

    Publications

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

    Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

    Also interesting to read-

    Why does Dryad use a DAG?

    he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

    So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

    from

    http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

      Mapreduce Book

      Here is a new book on learning MapReduce and it has a free downloadable version as well.

      Data-Intensive Text Processing with MapReduce

      Jimmy Lin and Chris Dyer

      ABSTRACT

      Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.

      You can download the book here

      This book is part of the Morgan & Claypool Synthesis Lectures on Human Language Technologies. If you’re at a university, your institution may already subscribe to the series, in which case you can access the electronic version directly without cost (see this page for a list of institutional subscribers). Otherwise, to purchase:

      Quite explicitly, this book focuses on MapReduce algorithm design, not Hadoop programming. Tom White’s Hadoop: The Definitive Guide is a great resource for learning Hadoop.

      Want to be notified of updates? Interested in MapReduce algorithm design? Follow @lintool on Twitter here!

      Follow

      Get every new post delivered to your Inbox.

      Join 735 other followers