For that reason, we’ve settled on the more manageable question, “which packages are most often installed by normal R users?”
This last question could potentially be answered in a variety of ways. Our current approach uses a convenience sample of installation data that we’ve collected from volunteers in the R community, who kindly agreed to send us a list of the packages they have on their systems. We’ve anonymized this data and compiled a set of metadata-based predictors that allow us to predict the installation probabilities quite well. We’re releasing all of our current work, including the data we have and all of the code we’ve used so far for our exploratory analyses. The contest itself will go live on Kaggle on Sunday and will end four months from Sunday on February 10, 2011. The rules, prizes and official data sets are all described below.
Rules and Prizes
To win the contest, you need to predict the probability that a user U has a package P installed on their system for every pair, (U, P). We’ll assess your performance using ROC methods, which will be evaluated against a held out test data set. The winning team will receive 3 UseR! books of their choosing. In order to win the contest, you’ll have to provide your analysis code to us by creating a fork of our GitHub repository. You’ll also be required to provide a written description of your approach. We’re asking for so much openness from the winning team because we want this contest to serve as a stepping stone for the R community. We’re also hoping that enterprising data hackers will extend the lessons learned through this contest to other programming languages.
New software just released from the guys in California (@RevolutionR) so if you are a Linux user and have academic credentials you can download it for free (@Cmastication doesnt), you can test it to see what the big fuss is all about (also see http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php) –
Revolution Analytics has just released Revolution R Enterprise 4.0.1 for Red Hat Enterprise Linux, a significant step forward in enterprise data analytics. Revolution R Enterprise 4.0.1 is built on R 2.11.1, the latest release of the open-source environment for data analysis and graphics. Also available is the initial release of our deployment server solution, RevoDeployR 1.0, designed to help you deliver R analytics via the Web. And coming soon to Linux: RevoScaleR, a new package for fast and efficient multi-core processing of large data sets.
As a registered user of the Academic version of Revolution R Enterprise for Linux, you can take advantage of these improvements by downloading and installing Revolution R Enterprise 4.0.1 today. You can install Revolution R Enterprise 4.0.1 side-by-side with your existing Revolution R Enterprise installations; there is no need to uninstall previous versions.
Download Information
The following information is all you will need to download and install the Academic Edition.
Supported Platforms:
Revolution R Enterprise Academic edition and RevoDeployR are supported on Red Hat® Enterprise Linux® 5.4 or greater (64-bit processors).
Approximately 300MB free disk space is required for a full install of Revolution R Enterprise. We recommend at least 1GB of RAM to use Revolution R Enterprise.
For the full list of system requirements for RevoDeployR, refer to the RevoDeployR™ Installation Guide for Red Hat® Enterprise Linux®.
Download Links:
You will first need to download the Revolution R Enterprise installer.
Installation Instructions for Revolution R Enterprise Academic Edition
After downloading the installer, do the following to install the software:
Log in as root if you have not already.
Change directory to the directory containing the downloaded installer.
Unpack the installer using the following command:
tar -xzf Revo-Ent-4.0.1-RHEL5-desktop.tar.gz
Change directory to the RevolutionR_4.0.1 directory created.
Run the installer by typing ./install.py and following the on-screen prompts.
Getting Started with the Revolution R Enterprise
After you have installed the software, launch Revolution R Enterprise by typing Revo64 at the shell prompt.
Documentation is available in the form of PDF documents installed as part of the Revolution R Enterprise distribution. Type Revo.home(“doc”) at the R prompt to locate the directory containing the manuals Getting Started with Revolution R (RevoMan.pdf) and the ParallelR User’s Guide(parRman.pdf).
Installation Instructions for RevoDeployR (and RServe)
After downloading the RevoDeployR distribution, use the following steps to install the software:
Note: These instructions are for an automatic install. For more details or for manual install instructions, refer to RevoDeployR_Installation_Instructions_for_RedHat.pdf.
Log into the operating system as root.
su –
Change directory to the directory containing the downloaded distribution for RevoDeployR and RServe.
Unzip the contents of the RevoDeployR tar file. At prompt, type:
tar -xzf deployrRedHat.tar.gz
Change directories. At the prompt, type:
cd installFiles
Launch the automated installation script and follow the on-screen prompts. At the prompt, type:
./installRedHat.sh Note:Red Hat installs MySQL without a password.
Getting Started with RevoDeployR
After installing RevoDeployR, you will be directed to the RevoDeployR landing page. The landing page has links to documentation, the RevoDeployR management console, the API Explorer development tool, and sample code.
The simple R-benchmark-25.R test script is a quick-running survey of general R performance. The Community-developed test consists of three sets of small benchmarks, referred to in the script as Matrix Calculation, Matrix Functions, and Program Control.
Revolution Analytics has created its own tests to simulate common real-world computations. Their descriptions are explained below.
Linear Algebra Computation
Base R 2.9.2
Revolution R (1-core)
Revolution R (4-core)
Speedup (4 core)
Matrix Multiply
243 sec
22 sec
5.9 sec
41x
Cholesky Factorization
23 sec
3.8 sec
1.1 sec
21x
Singular Value Decomposition
62 sec
13 sec
4.9 sec
12.6x
Principal Components Analysis
237 sec
41 sec
15.6 sec
15.2x
Linear Discriminant Analysis
142 sec
49 sec
32.0 sec
4.4x
Speedup = Slower time / Faster Time – 1
Matrix Multiply
This routine creates a random uniform 10,000 x 5,000 matrix A, and then times the computation of the matrix product transpose(A) * A.
set.seed (1)
m <- 10000
n <- 5000
A <- matrix (runif (m*n),m,n)
system.time (B <- crossprod(A))
The system will respond with a message in this format:
User system elapsed
37.22 0.40 9.68
The “elapsed” times indicate total wall-clock time to run the timed code.
The table above reflects the elapsed time for this and the other benchmark tests. The test system was an INTEL® Xeon® 8-core CPU (model X55600) at 2.5 GHz with 18 GB system RAM running Windows Server 2008 operating system. For the Revolution R benchmarks, the computations were limited to 1 core and 4 cores by calling setMKLthreads(1) and setMKLthreads(4) respectively. Note that Revolution R performs very well even in single-threaded tests: this is a result of the optimized algorithms in the Intel MKL library linked to Revolution R. The slight greater than linear speedup may be due to the greater total cache available to all CPU cores, or simply better OS CPU scheduling–no attempt was made to pin execution threads to physical cores. Consult Revolution R’s documentation to learn how to run benchmarks that use less cores than your hardware offers.
Cholesky Factorization
The Cholesky matrix factorization may be used to compute the solution of linear systems of equations with a symmetric positive definite coefficient matrix, to compute correlated sets of pseudo-random numbers, and other tasks. We re-use the matrix B computed in the example above:
system.time (C <- chol(B))
Singular Value Decomposition with Applications
The Singular Value Decomposition (SVD) is a numerically-stable and very useful matrix decompisition. The SVD is often used to compute Principal Components and Linear Discriminant Analysis.
# Singular Value Deomposition
m <- 10000
n <- 2000
A <- matrix (runif (m*n),m,n)
system.time (S <- svd (A,nu=0,nv=0))
# Principal Components Analysis
m <- 10000
n <- 2000
A <- matrix (runif (m*n),m,n)
system.time (P <- prcomp(A))
# Linear Discriminant Analysis require (‘MASS’)
g <- 5
k <- round (m/2)
A <- data.frame (A, fac=sample (LETTERS[1:g],m,replace=TRUE))
train <- sample(1:m, k)
system.time (L <- lda(fac ~., data=A, prior=rep(1,g)/g, subset=train))
LibreOffice is a new fork from OpenOffice– Basically people who want to ensure OpenOffice remains free. It basically consists of efforts from everybody except Apple, Microsoft and Oracle (http://www.documentfoundation.org/supporters/) and it’s a new kind of workable office productivity suite-determined to remain free. I have used it- a bit shaky- but I really liked the new design and willingly will test it (and auto submit bugs) . It would be interesting to see the reaction of enterprise vendors like SAS, IBM,Dell, HP (and Lenovo) and etc -as their support would be critical to both Unbreakable Oracle Linux and Unshakable LibreOffice.
Here is an interesting interview with Quentin G, CEO AsterData, Marketing trumpeting aside apart-the insights on the whats next vision thing are quite good.
As you look down the road, what are the three major challenges you see for vendors who keep trying to solve big data and other “now” problems with old tools?
Old tools and traditional architectures cannot scale effectively to handle massive data volumes that reach 100’s of terabytes nor can they effectively process large data volumes in a high performance manner. Further, they are restricted to what SQL querying allows. The three challenges I have noted are:
First, performance, specifically, poor performance on large data volumes and heavy workloads: The pre-existing systems rely on storing data in a traditional DBMS or data warehouse and then extracting a sample of data to a separate processing tier. This greatly restricts data insights and analytics as only a sample of data is analyzed and understood. As more data is stored in these systems they suffer from performance degradation as more users try to access the system concurrently. Additionally moving masses of data out of the traditional DBMS to a separate processing tier adds latency and slows down analytics and response times. This pre-existing architecture greatly limits performance especially as data sizes grow.
Second, limited analytics: Pre-existing systems rely mostly on SQL for data querying and analysis. SQL poses several limitations and is not suited for ad hoc querying, deep data exploration and a range of other analytics. MapReduce overcomes the limitations of SQL and SQL-MapReduce in particular opens up a new class of analytics that cannot be achieved with SQL alone.
And, third, limitations of types of data that can be stored and analyzed: Traditional systems are not designed for non-relational or unstructured data. New solutions such as Aster Data’s are designed from the ground up to handle both relational and non-relational data. Organizations want to store and process a range of data types and do this in a single platform. New solutions allow for different data types to be handled in a single platform whereas pre-existing architectures and solutions are specialized around a single data type or format – this restricts the diversity of analytics that can be performed on these systems.
Ubuntu One Basic – available now This is the same as the current free 2 GB option but with a new name. Users can continue to sync files, contacts, bookmarks and notes for free as part of our basic service and access the integrated Ubuntu One Music Store. We are also extending our platform support to include a Windows client, which will be available in Beta very soon.
Ubuntu One Mobile – available October 7th Ubuntu One Mobile is our first example of a service that helps you do more with the content stored in your personal cloud. With Ubuntu One Mobile’s main feature – mobile music streaming – users can listen to any MP3 songs in their personal cloud (any owned MP3s, not just those purchased from the Ubuntu One Music Store) using our custom developed apps for iPhone and Android (coming soon to their respective marketplaces). These will be open source and available from Launchpad. Ubuntu One Mobile will also include the mobile contacts sync feature that was launched in Beta for the 10.04 release.
Ubuntu One Mobile is available for $3.99 (USD) per month or $39.99 (USD) per year. Users interested in this add-on can try the service free for 30 days. Ubuntu One Mobile will be the perfect companion to your morning exercise, daily commute, and weekend at the beach – we’re really excited to bring you this service!
Ubuntu One 20-Packs – available now A 20-Pack is 20 GB of storage for files, contacts, notes, and bookmarks. Users will be able to add multiple 20-Packs at $2.99 (USD) per month or $29.99 (USD) per year each. If you start with Ubuntu One Basic (2 GB) and add 1 20-Pack (20 GB), you will have 22 GB of storage.
All add-ons are available for purchase in multiple currencies – USD, EUR and, recently added, GBP.
Users currently paying for the old 50 GB plan (including mobile contacts sync) can either keep their existing service or switch to the new plans structure to get more value from Ubuntu One at a lower price.