Here is an interview with Dr Usama Fayyad, founder of Open Insights LLC (www.open-insights.com). Prior to this he was Yahoo’s Chief Data Officer. In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations.
Ajay- Describe your career in science. How would you motivate young people today to take science careers rather than other careers
Dr Fayyad- My career started out in science and engineering. My original plan was to be in research and to become a university professor. Indeed, my first few jobs were strictly in basic Research. After doing summer internships at place like GM Research Labs and JPL, my first full-time position was at the NASA – Jet Propulsion Laboratory, California Institute of Technology.
I started in research in Artificial Intelligence for autonomous monitoring and control and in Machine Learning and data mining. The first major success was with Caltech Astronomers on using machine learning classification techniques to automatically recognize objects in a large sky survey (POSS-II – the 2nd Palomar Observatory Sky Survey). The Survey consists of taking high resolution images of the entire northern sky. The images, when digitized, contain over 2 billion sky objects. The main problem is to recognize if an object is a star of galaxy. For “faint objects” – which constitute the majority of objects, this was an exceedingly hard problem that people wrestled with for 30 years. I was surprised how well the algorithms could do at solving it.
This was a real example of data sets where the dimensionality is so high that algorithms are better suited at solving it than humans – even well-trained astronomers. Our methods had over 94% accuracy on faint objects that no one could reliably classify before at better than 75% accuracy. This additional accuracy made all the difference in enabling all sort of new science, discoveries and theories about formation of large scale structure in the Universe.
The success of this work and its wide recognition in scientific and engineering communities let to the creation of a new group – I founded and managed the Machine Learning Systems group at JPL which went on to address hard problems in object recognition in scientific data – mostly from remote sensing instruments – like Magellan images of the planet Venus (we recognized and classified over a million small volcanoes on the planet in collaboration with geologists at Brown University) and Earth Observing System data, including Atmospherics and storm data.
At the time, Microsoft was interested in figuring out data mining applications in the corporate world and after a long recruiting cycle they got me to join the newly formed Microsoft Research as a Senior Researcher in late 1995. My work there focus on algorithms, database systems, and basic science issues in the newly formed field of Data Mining and Knowledge Discovery. We had just finished publishing a nice edited collection of chapters in a book that became very popular, and I had agreed to become the founding Editor-in-Chief of a brand new journal called: Data Mining and Knowledge Discovery. This journal today is the premier scientific journal in the field. My research work at Microsoft led to several applications – especially in databases. I founded the Data Mining & Exploration group at MSR and later a product group in SQL Server that built and shipped the first integrated data mining product in a large-scale commercial DBMS – SQL Server 2000 (analysis Services). We created extensions to the SQL language (that we called DMX) and tried to make data mining mainstream. I really enjoyed the life of doing basic research as well as having a real product group that built and shipped components in a major DBMS.
That’s when I learned that the real challenging problems in the real-world where really not in data mining but in getting the data ready and available for analysis – Data Warehousing was a field littered with failures and data stores that were write-only (meaning data never came out!) — I used to call these Data Tombs at the time and I likened them to the pyramids in Ancient Egypt: great engineering feats to build, but really just tombs.
In 2000 I decided to leave the world of Research at Microsoft to do my first venture-backed start-up company – digiMine. The company wanted to solve the problem of managing the data and performing data mining and analysis over data sets, and we targeted a model of hosted data warehouses and mining applications as an ASP – one of the first Software as a Service (SaaS) firms in that arena. This began my transition from the world of research and science to business and technology. We focused on on-line data and web analytics since the data volumes their were about 10x the size of transactional databases and most companies did not know how to deal with all that data. The business grew fast and so did the company – reaching 120 employees in about 1 year.
After 3 years of doing high-growth start-up and raising some $50 million in venture capital for the company, I was beginning to feel the itch again to do technical work.
In June 2003, we had a chance to spin-off part of the business that was focused on difficult high-end data mining problems. This opportunity was exactly what I needed and we formed DMX Group as a spinoff company that had a solid business from its first day. At DMX Group I got to work on some of the hardest data mining problems in predicting sales of automobiles, churn of wireless users, financial scoring and credit risk analysis, and many related deep business Intelligence problems.
Our client list included many of the Fortune 500 companies. One of these clients was Yahoo! — After 6 months of working with Yahoo! As a client they decided to acquire DMX Group and use the people to build a serious data team for Yahoo! We negotiated a deal that got about half the employees into Yahoo! And we spun-off the rest of DMX Group to continue focusing on consulting work in data mining and BI. I thus became the industry’s first Chief Data Officer.
The original plan was to spend 2 years or so to help Yahoo! Form the right data teams and build the data processing and targeting technology to deliver high value from its inventory of ads.
Yahoo! Proved to be a wonderful experience and I learned so much about the Internet. I also learned that even someone like me who worked on Internet data from the early days of MSN (in 1996) and who ran a web analytics firm still did not scratch the service on the depth of the area. I learned a lot about the Internet from Jerry Yaang (Yahoo! Co-founder) and much about advertising/media business from Dan Rosensweig (COO) and mTerry Semel (then CEO) and lots about technology management and strategic deal-making from Farzad (Zod) Nazem who was the CTO. As Executive VP at Yahoo!
I built one of the industry’s largest and best data teams and we were able to to process over 25 terabytes of data per year and power several hundred million Dollars of new revenue for Yahoo! Resulting from these data systems. A year after joining Yahoo! I was asked to form a new Research Lab to study much of what we did not understand about the Internet. This was yet another return of basic research into my life. I founded Yahoo! Research to invent the new sciences of the Internet, and I wanted them to be focused on only 4 areas (the idea of focus came from my exposure to Caltech and its philosophy in picking few areas of excellence). The goal was the become the best research lab in the world in these new focused areas. Surprisingly we did it within 2 years. I hired Prabhakar Raghavan to run Research and he did a phenomenal job in building out the Research organization. The four areas we chose were: Search and information navigation, Community Systems, Micro-economics of the Web, and Computational Advertising. We were able to attract the top talent in the world to lead or work on these emerging areas. Yahoo! Research was a success in basic research but also in influencing product. The chief scientists for all the major areas of company products all came from Yahoo! Research and all owned the product development agenda and plans: Raghu Ramakrishnan (CS for Audience), Andrew Tomkins (CS for Search), Anrei Broder (CS for Monetization) and Preston McCaffee (CS for Marketplaces/Exchanges). I consider this an unprecendented achievement in the world of Research in general: excellence in basic research and huge impact on company products, all within 3-4 years.
I have recently left Yahoo! And started Open Insights (www.open-insights.com) to focus on data strategy and helping enterprises realize the value of data, develop the right data strategies, and create new business models. Sort of an ‘outsourced version” of my Chief Data Officer job at Yahoo!
Finally, on my advice to young people: it is not just about science careers, I would call it engineering careers. My advice to any young person in fact, whether they plan to become a business person, a medical doctor, and artist, a lawyer, or a scientist – basic training in engineering and abstract problem solving will be a huge assets. Some of the best lawyers, doctors, and even CEO’s started out with engineering training.
For those young people who want to become scientists, my advice is always look for real-world applications where the research can be conducted in their context. The reason for that is technical and sociological. From a technical perspective, the reality of an application and the fact that things have to work force a regiment of technical discipline and make sure that the new ideas are tested and challenged. Socially, working on a real application forces interactions with people who care about the problem and provides continuous feedback which is really crucial in guiding good work (even if scientists deny this, social pressure is a big factor) – it also ensures that your work will be relevant and will evolve in relevant directions. I always tell people who are seeking basic research: “some of the deepest fundamental science problems can often be found lurking in the most mundane of applications”. So embrace applied work but always look for the abstract deep problems – that summarizes my advice.
Ajay- What are the challenges of running data mining for a big big website.
Dr Fayyad- There are many challenges. Most algorithms will not work due to scale. Also, most of the problems have an unusually high dimensionality – so simple tricks like sampling won’t work. You need to be very clever on how to sample and how to reduce dimensionality by applying the right variable transformations.
The variety of problems is huge, and the fact that the Internet is evolving and growing rapidly, means that the problems are not fixed or stationary. A solution that works well today will likely fail in a few months – so you need to always innovate and always look at new approaches. Also, you need to build automated tools to help detect changes and address them as soon as they arise.
Problems with 1000 10,000 or millions of variables are very common in web challenges. Finally, whatever you do needs to work fast or else you will not be able to keep up with the data flux. Imagine falling behind on processing 25 Terabytes of data per day. If you fall behind by two days, you will never be able to catch up again! Not within any reasonable budget constraint. So you try never to go down.
Ajay- What are the 5 most important things that the data miner should avoid in doing analysis.
Dr Fayyad-I never thought about this in terms of top 5, but here are the big ones that come to mind, not necessarily in any order
a. The algorithms knows nothing about the data, and the knowledge of the domain is in the head of the domain experts. As I always say, an ounce of knowledge is worth a ton of data – so seek and model what the experts know or your results will look silly
b. Don’t let an algorithm fish blindly when you have lots of data. Use what you know to reduce the dimensionality quickly. The curse of dimensionality is never to be under-estimated
c. Resist the temptation to cheat: selecting training and test sets can easily fool you into thinking you have something that works. Test it honestly against new data, never “peek” at the test data – what you see will force you to cheat without knowing it.
d. Business rules typically dominate data mining accuracy, so be sure to incorporate the business and legal constraints into your mining.
e. I have never seen a large database in my life that came from a static distribution that was sampled independently. Real databases grow to be big through lots of systematic changes and biases, and they are collected over years from changing underlying distribution: segmentation is a pre-requisite to any analysis. Most algorithms assume that data is IID (independent and identically distributed)
Ajay- Do you think softwares like Hadoop and MapReduce will change the online database permanently. What further developments do you see in this area.
Dr Fayyad- I think they will (and have) changed the landscape dramatically, but they do not address everything. Many problems lend themselves naturally to Map-Reduce and many new approaches are enabled by Map-Reduce. However, there are many problems where M-R does not do much. I see a lot of problems being addressed by a large grid nowadays when they don’t need it. This is often a huge waste of computational resources. We need to learn how to deal with a mix of tools and platforms. I think M-R will be with us for a long time and will be a staple tool – but not a universal one.
Ajay- I look forward to the day when I have just a low priced netbook and fast internet connection, and upload a Gigabyte of data and run advanced analytics on the browser. How far or soon do you think it is possible?
Dr Fayyad- Well, I thnk the day is already here. In fact, much of our web search today is conducted exactly in that model. A lot of web analysis, and much of scientific analysis is done like this today.
Ajay- Describe some of the conferences you are currently involved with and the research areas that excites you the most.
Dr Fayyad- I am still very involved in knowledge discovery and data mining conferences (especially the KDD series), machine learning, some statistics, and some conferences on search and internet. Most exciting conferences for me are ones that cover a mix of topics but that address real problems. Examples include understanding how social networks evolve and behave, understanding dimensionality reductions (like random projections in very high-D spaces) and generally any work that gives us insight into why a particular technique works better and where the open challenges are.
Ajay- What are the next breakthrough areas in data mining. Can we have a Google or Yahoo in fields of business intelligence as well given their huge market potential and uncertain ROI.
Dr Fayyad- We already have some large and healthy businesses in BI and quite a huge industry in consulting. If you are asking particularly about the tools market then I think that market is very limited. The users of analysis tools are always going to be small in number. However, once the BI and Data Mining tools are embedded in vertical applications, then the number of users will be tremendous. That’s where you will see success.
Consider the examples of Google or Yahoo! – and now Microsoft with BING search engine. Search engines today would not be good without machine learning/data mining technology. In fact MLR (Machine Learned Ranking) is at the core of the ranking methodology that decides which search results bubble to the top of the list. The typical web query is 2.6 keywords long and has about a billion matches. What matters are the top 10. The function that determines these is a relevance ranking algrorithm that uses machine learning to tune a formula that considers hundreds or thousands of variables about each document. So in many ways, you have a great example of this technology being used by hundreds of millions of people every day – without knowing it!
Success will be in applications where the technology becomes invisible – much like the internal combustion engine in your car or the electric motor in your coffee grinder or power supply fan. I think once people start building verticalized solutions that embed data mining and BI, we will hit success. This already has happened in web search, in direct marketing, in advertising targeting, in credit scoring, in fraud detection, and so on…
Ajay- What do you do to relax. What publications would you recommend for staying up to date for the data mining people especially the younger analysts.
Dr Fayyad- My favorite activity is sleep when I can get it J. But more seriously, I enjoy reading books, playing chess, skiing (on water or snow – downhill or x-country), or any activities with my kids. I swim a lot and that gives me much time to think and sort things out.
I think for keeping up with the technical advances in data mining: the KDD conferences, some of the applied analytics conferences, the WWW conferences, and the data mining journals. The ACM SIGKDD publishes a nice newsletter called SIGKDD explorations. It is free with a very low membership fee and it has a lot of announcements and survey papers on new topics and important areas (www.kdd.org). Also, a good list to keep up with is an email list called KDNuggets edited by Gregory Piatetsky-Shapiro.
Biography (www.fayyad.com/usama )-
Usama Fayyad founded Open Insights (www.open-insights.com) to deliver on the vision of bridging the gap between data and insights and to help companies develop strategies and solutions not only to turn data into working business assets, but to turn the insights available from the growing amounts of data into critical components of an enterprise’s strategy for approaching markets, dealing with competitors, and acquire and retain customers.
In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations. He also built up the targeting systems and the data strategy for how to utilize data to enhance revenue and to create new revenue sources for the company.
In addition, he was the founding executive for Yahoo! Research, a scientific research organization that became the top research place in the world working on inventing the new sciences of the Internet.
Ok the events in the following poem really happened when I visited Austin, Texas for a business visit, and thanks to the Austin Post ( a new media Austin based site) for publishing it.
And you can read the rest at http://www.austinpost.org/content/live-music-sixth-street-austin-poetry
If you like poetry, live music, Austin ….
At risk of annoying a lot of friendly people, I am going to ask an old question and try and answer it quantitatively.
Who can buy SAS institute?
As you can see from the graph (note the post 2001-2004 period) – which is a nice smoothed curve, textbook normal distribution on the left side, SAS Institute grew during the tough economic year of 2008 to show slowed but firm revenue growth. However if you use the same price/revenue multiple as for the SPSS acquisition ( 1.2 billion/ 300 million (2008) revenues) – that would put a price of 9.2 USD billion on SAS Institute.
Who has that kind of money? Well it seems the usual suspects are-
Cash and cash equivalents on 12.851 Billion USD as on April 30, 2009.
2) Oracle- Oracle would be hard pressed to integrate both Sun and SAS in the same year, but may have financial leverage to do both.
Fiscal year 2009
GAAP revenues were up 4% to $23.3 billion, while annual GAAP net income was up 1% to $5.6
billion. Total GAAP new software license revenues for the year were down 5% to $7.1 billion.
GAAP software license updates and product support revenues were up 14% to $11.8 billion.
GAAP operating income was up 6% to $8.3 billion, and GAAP operating margins were up 80
basis points to 36% in fiscal year 2009.
3) IBM -from ftp://ftp.software.ibm.com/annualreport/2008/2008_ibm_financials.pdf
Cash on hand was 12.7 Billion USD as on 31 Dec 2008, and the company repurchased it’s own stock in 2008
In the current economic environment growth can come through acquisitions of newer clients ( not much) or new companies. IBM has capabilities to acquire BOTH SPSS and SAS Institute and merge the strong R and D facilities.
various sources of loan capital:
profit after income taxes for 2008 was slightly lower than for the previous year, we increased cash flows from operating activities 12% to € 2,158 million (2007: € 1,932 million) through efficient management of working capital.
- To finance the acquisition of Business Objects, we entered into an agreement for a credit facility that was originally for € 5 billion and is repayable by December 31, 2009 (amount outstanding on December 31, 2008: € 2.3 billion). We did not draw the full € 5 billion available under the facility because we paid part of the purchase price from available cash.
- To increase financial flexibility, in November 2004 we obtained a € 1 billion syndicated credit facility through an international group of banks. We already had other lines of credit in place; the new line was arranged to provide additional financial flexibility. As in the previous year, we did not draw on this facility during the year.
- At the end of 2008, the other, bilateral lines of credit available to SAP AG totaled approximately € 597 million (2007: € 599 million). We did not draw on these facilities during 2008 or 2007. Several subsidiaries in the SAP Group had credit lines in their local currency. These totaled € 52 million (2007: € 44 million), for which SAP AG was guarantor. At the end of the year, the subsidiaries had drawn € 21 million under these facilities (2007: € 27 million).
Given these cash positions it seems that almost everyone can buy SAS Institute if and this is a big IF- someone sells it. Microsoft which some years allegedly tried and lost at acquiring Yahoo ( only to realize huge savings!) and SAS, would be also another suitor for SAS- and Google also has the financial and operating synergies with the best text mining capabilities could also act as a white knight in merging it’s Google Applications and Enterprise solutions ( especially the cloud based OS and cloud based productivity suite) with SAS Institute. I personally would favor a Google- SAS Institute joint venture on enterprise software solely based on the common history and shared values ( Note Google has dual ownership stock including class A and class B shares)
Who is John Galt ?
Another option could be using the Google Way and for SAS Institute to go for dual ownership IPO, with class A shares for the common public and class B shares for the founders and executives. A substantial endowment to colleges and universities can also be expected in the future, given the philanthropic tradition of SAS Institute owners and executives. Also could SAS try and buy SPSS- it would lead to synergies in both software ( with the SPSS GUI) as well as new clients. At the very minimum it would boost the valuation of other stock in this sector as well make SPSS more realistic valued.
So who will buy SAS Institute?
I don’ know 🙂 and I am just brushing off my half a decade old financial valuation skills here
A brief study of the charts at http://tr.im/vDA4 ( CourtesyGoogle Finance) would suggest IBM is getting a bargain for SPSS Inc.
And Oracle, Microsoft and other companies ( even the privately held SAS Institute) can do well to step in and take it away or at the very minimum make the valuation even more steep for IBM to hold on to.
SPSS reported in 2007 Total Revenue of $291million with a Net Income of $33.73million and in 2008 Total Revenue of $302.91million with a Net Income of $36.05million. Shares of SPSS Inc. (Public, NASDAQ:SPSS) increased from about $35 per share before the announcement to $49.50 per share after the announcement.
Chart at http://tr.im/vDA4
Here is a good open content Journal for people wanting to keep track of latest in statistical software.
It is called Journal of Statistical Software.
Established in 1996, the Journal of Statistical Software publishes articles, book reviews, code snippets, and software reviews on the subject of statistical software and algorithms. The contents are freely available on-line. For both articles and code snippets the source code is published along with the paper.
Implementations can use languages such as C, C++, S, Fortran, Java, PHP, Python and Ruby or environments such as Mathematica, MATLAB, R, S-PLUS, SAS, Stata, and XLISP-STAT.
E.g Book Reviews of A Handbook of Statistical Analyses Using SAS (Third Edition)
It is really cutting edge stuff for someone who wants to keep up with the latest and fast moving tech trends in statistical software and has convenient RSS feeds as well announce alerts for emails.
Note- Various Journals can be ranked using a quantitative index called Impact Factor
E.G For Statistics
In these columns, total citations to a journal’s published papers are divided by the total number of papers that the journal published, producing a citations-per-paper impact score over a five-year period (middle column) and a 26-year period (right-hand column).
Journals Ranked by Impact:
Statistics & Probability
J. Royal Stat. Soc. B
J. Royal Stat. Soc. B
3 Chemom. Intell. Lab.
J. Am. Stat. Assoc.
J. Computat. Biology
5 J. Royal Stat. Soc. B
Annals of Statistics
6 IEEE ACM T Comp. Bi.
7 J. Am. Stat. Assoc.
J. Am. Stat. Assoc.
8 Multivar. Behav. Res.
Multivar. Behav. Res.
9 J. Computat. Biology
Annals of Statistics
10 Annals of Statistics
Stat. in Medicine
J. Royal Stat. Soc. A
The following is an excellent list of High Performance Computing using R.
CRAN Task View: High Performance and Parallel Computing
Maintainer: Dirk Eddelbuettel Contact: Dirk.Eddelbuettel at R-project.org Version: 2009-06-12
This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing (HPC) with R. In this context, we are defining ‘high-performance computing’ rather loosely as just about anything related to pushing R a littler further: using compiled code, parallel computing (in both explicit and implicit modes), working with large objects as well as profiling.
Unless otherwise mentioned, all packages presented with hyperlinks are available from CRAN, the Comprehensive R Archive Network.
Several of the areas discussed in this Task View are undergoing rapid change. Please send suggestions for additions and extensions for this task view to the task view maintainer .
Suggestions and corrections by Achim Zeileis, Markus Schmidberger, Martin Morgan, Max Kuhn, Tomas Radivoyevitch, Jochen Knaus, Tobias Verbeke, Hao Yu, and David Roseberg are gratefully acknowledged.
Parallel computing: Explicit parallelism
- Several packages provide the communications layer required for parallel computing. The first package in this area was rpvm by Li and Rossini which uses the PVM (Parallel Virtual Machine) standard and libraries. rpvm is no longer actively maintained.
- In recent years, the alternative MPI (Message Passing Interface) standard has become the de facto standard in parallel computing. It is supported in R via the Rmpi by Yu. Rmpi package is mature yet actively maintained and offers access to numerous functions from the MPI API, as well as a number of R-specific extensions. Rmpi can be used with the LAM/MPI, MPICH / MPICH2, Open MPI, and Deino MPI implementations. It should be noted that LAM/MPI is now in maintenance mode, and new development is focussed on Open MPI.
- An alternative is provided by the nws (NetWorkSpaces) packages from REvolution Computing. It is the successor to the earlier LindaSpaces approach to parallel computing, and is implemented on top of the Twisted networking toolkit for Python.
- The snow (Simple Network of Workstations) package by Tierney et al. can use PVM, MPI, NWS as well as direct networking sockets. It provides an abstraction layer by hiding the communications details. The snowFT package provides fault-tolerance extensions to snow.
- The snowfall package by Knaus provides a more recent alternative to snow. Functions can be used in sequential or parallel mode.
- The papply package by Currie provided a subset of the Rmpi functionality, but is no longer actively maintained either.
- The biopara package by Lazar and Schoenfeld offers socket-based parallel execution with some support for load-balancing and fault-tolerance.
- The taskPR package by Samatova et al. builds on top of LAM/MPI and offers parallel execution of tasks.
- The Simple Parallel R INTerface (SPRINT) package by Hill et al. ( link , paper ) provides a prototype framework that allows the addition of parallelised functions to R for easy exploitation of HPC systems. Currently only a parallised correlation calculation is provided.
Parallel computing: Implicit parallelism
- The pnmath package by Tierney ( link ) uses the Open MP parallel processing directives of recent compilers (such gcc 4.2 or later) for implicit parallelism by replacing a number of internal R functions with replacements that can make use of multiple cores — without any explicit requests from the user. The alternate pnmath0 package offers the same functionality using Pthreads for environments in which the newer compilers are not available. Similar functionality is expected to become integrated into R ‘eventually’.
- The romp package by Jamitzky was presented at useR! 2008 ( slides ) and offers another interface to Open MP using Fortran. The code is still pre-alpha and available from the Google Code project romp. An R-Forge project romp was initiated but there is no package, yet.
- The fork package by Warnes provides R-equivalents to low-level Unix system functions like fork, signal, wait, kill and exit in order to spawn sub-processes for parallel execution.
- The multicore package by Urbanek provides a way of running parallel computations in R on machines with multiple cores or CPUs.
- The R/parallel package by Vera, Jansen and Suppi offers a C++-based master-slave dispatch mechanism for parallel execution ( link )
- The RScaLAPACK package by Samatova et al. provides an interface to the ScaLAPACK libraries which can replace the standard BLAS libraries and offer parallel execution of the same BLAS functions.
- The SPRINT package by Hill adds another parallel framework to R ( link ).
- The mapReduce package by Brown provides a simple framework for parallel computations following the Google mapReduce approach. It provides a pure R implementation, a syntax following the mapReduce paper and a flexible and parallelizable back end.
Parallel computing: Grid computing
- The GridR package by Wegener et al. can be used in a grid computing environment via a web service, via ssh or via Condor or Globus.
- The multiR package by Grose was presented at useR! 2008 but has not been released. It may offer a snow-style framework on a grid computing platform.
- The biocep-distrib project by Chine offers a Java-based framework for local, Grid, or Cloud computing. It is under active development.
- The RHIPE package by Guha profides an interface between R and Hadoop for a Map/Reduce programming framework. ( link )
Parallel computing: Random numbers
- Random-number generators for parallel computing are available via the rsprng package by Li, and the rlecuyer package by Sevcikova and Rossini.
Parallel computing: Resource managers and batch schedulers
- Job-scheduling toolkits permit management of parallel computing resources and tasks. The slurm (Simple Linux Utility for Resource Management) set of programs (written by a consortium led by Lawrence Livermore Labs) works well with MPI. ( link )
- The Condor toolkit ( link ) from the University of Wisconsin-Madison has been used with R as described in this R News article .
- The sfCluster package by Knaus can be used with snowfall. ( link ) but is currently limited to LAM/MPI.
- The Rsge package by Bode offers an interface to the Sun Grid Engine batch-queuing system.
- The Rlsf package by Smith et al. offers an interface to the LSF cluster/grid system.
Parallel computing: Applications
- The caret package by Kuhn can use can use various frameworks (MPI, NWS etc) to parallelized cross-validation and bootstrap characterizations of predictive models.
- The multtest package by Pollard et al. can use snow, Rmpi or rpvm for resampling-based testing of multiple hypothesis.
- The maanova package by Wu can use snow and Rmpi for the analysis of micro-array experiments.
- The pvclust package by Suzuki and Shimodaira can use snow and Rmpi for hierarchical clustering via multiscale bootstraps; and the scaleboot package by Shimodaira can use pvclust, snow and Rmpi for computing approximately unbiased p-values via multiscale bootstraps.
- The tm package by Feinerer can use snow and Rmpi for parallelized text mining.
- The varSelRF package by Diaz-Uriarte can use snow and Rmpi for parallelized use of variable selection via random forests; and the ADaCGH package by Diaz-Uriarte and Rueda can use Rmpi and papply for parallelized analysis of array CGH data.
- The bcp package by Erdman and Emerson for the bayesian analysis of change points, and the bigmemory package by Kane and Emerson can use nws for parallelized operations.
- The networksis package by Admiraal and Handcock can use rpvm and snow for parallelized simulation of bipartite graphs via sequential importance smapling.
- The BARD package by Altman for better automated redistring, the GAMBoost package by Binder for glm and gam model fitting via boosting using b-splines, the Geneland package by Estoup, Guillot and Santos for structure detection from multilocus genetic data, the Matching package by Sekhon for multivariate and propensity score matching, the STAR package by Pouzat for spike train analysis, the bnlearn package by Scutari for bayesian network structure learning, the latentnet package by Krivitsky and Handcock for latent position and cluster models, the lga package by Harrington for linear grouping analysis, the peperr package by Porelius and Binder for parallised estimation of prediction error, the orloca package by Fernandez-Palacin and Munoz-Marquez for operations research locational analysis, the rgenoud package by Mebane and Sekhon for genetic optimization using derivatives the affyPara package by Schmidberger, Vicedo and Mansmann for parallel normalization of Affymetrix microarrays, the puma package by Pearson et al. which propagates uncertainty into standard microarray analyses such as differential expression and the ccems package for combinatorically complex equilibrium model selection all can use snow for parallelized operations using either one of the MPI, PVM, NWS or socket protocols supported by snow.
- The bugsparallel package uses Rmpi for distributed computing of multiple MCMC chains using WinBUGS.
- The partDSA package uses nws for generating a piecewise constant estimation list of increasingly complex predictors based on an intensive and comprehensive search over the entire covariate space.
Parallel computing: GPUs
- The gputools package by Buckner provides several common data-mining algorithms which are implemented using a mixture of nVidia’s CUDA langauge and cublas library. Given a computer with an nVidia GPU these functions may be substantially more efficient than native R routines.
Large memory and out-of-memory data
- The biglm package by Lumley uses incremental computations to offers lm() and glm() functionality to data sets stored outside of R’s main memory.
- The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.
- The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory and uses external pointer objects to refer to them. This permits transparent access from R without bumping against R’s internal memory limits. Several R processes on the same computer can also shared big memory objects.
- A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table by Dowle) are also of potential interest but not reviewed here.
- The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also facilitates operating on data in a streaming fashion which does not require Hadoop.
Easier interfaces for Compiled code
- The inline package by Sklyar, Murdoch and Smith eases adding code in C, C++ or Fortran to R. It takes care of the compilation, linking and loading of embeded code segments that are stored as R strings.
- The Rcpp package by Eddelbuettel offers a number of C++ clases that makes transferring R objects to C++ functions (and back) easier, and the RInside package by Eddelbuettel allows easy embedding of R itself into C++ applications for faster and more direct data transfer..
- The rJava package by Urbanek provides a low-level interface to Java similar to the .Call() interface for C and C++.
- Rmpi (core)
- snow (core)
- HPC computing notes by Luke Tierney for HPC class at University of Iowa
- Mailing List: R Special Interest Group High Performance Computing
- Schmidberger, Morgan, Eddelbuettel, Yu, Tierney and Mansmann (2009) paper on ‘State-of-the-Art in Parallel Computing with R
- Luke Tierney’s code directory for pnmath and pnmath0
- R-Forge Project: biocep-distrib
- R-Forge Project: RInside
- Bioconductor Package: affyPara
- Bioconductor Package: puma
- Google Code Project: romp
- Google Code Project: bugsparallel
- Slurm project at Lawrence Livermore National Laboratory
- Condor project at University of Wisconsin-Madison
- Parallel Computing in R with sfCluster/snowfall
- Wikipedia: Message Passing Interface (MPI)
- Wikipedia: Parallel Virtual Machine (PVM)
Slides from Introduction to High-Performance Computing with R tutorial / workshop presentation