Events in the field of data that impacted us in 2011
1) Oracle unveiled plans for R Enterprise. This is one of the strongest statements of its focus on in-database analytics. Oracle also unveiled plans for a Public Cloud
2) SAS Institute released version 9.3 , a major analytics software in industry use.
3) IBM acquired many companies in analytics and high tech. Again.However the expected benefits from Cognos-SPSS integration are yet to show a spectacular change in market share.
2011 Selected acquisitions
Q1 Labs October 2011
Algorithmics September 2011i2 August 2011
Tririga March 2011
4) SAP promised a lot with SAP HANA- again no major oohs and ahs in terms of market share fluctuations within analytics.
5) Amazon continued to lower prices of cloud computing and offer more options.
6) Google continues to dilly -dally with its analytics and cloud based APIs. I do not expect all the APIs in the Google APIs suit to survive and be viable in the enterprise software space. This includes Google Cloud Storage, Cloud SQL, Prediction API at https://code.google.com/apis/console/b/0/ Some of the location based , translation based APIs may have interesting spin offs that may be very very commercially lucrative.
7) Microsoft -did- hmm- I forgot. Except for its investment in Revolution Analytics round 1 many seasons ago- very little excitement has come from MS plans in data mining- The plugins for cloud based data mining from Excel remain promising yet , while Azure remains a stealth mode starter.
8) Revolution Analytics promised us a GUI and didnt deliver (till yet 🙂 ) . But it did reveal a much better Enterprise software Revolution R 5.0 is one of the strongest enterprise software in the R /Stat Computing space and R’s memory handling problem is now an issue of perception than actual stuff thanks to newer advances in how it is used.
9) More conferences, more books and more news on analytics startups in 2011. Big Data analytics remained a strong buzzword. Expect more from this space including creative uses of Hadoop based infrastructure.
10) Data privacy issues continue to hamper and impede effective analytics usage. So does rational and balanced regulation in some of the most advanced economies. We expect more regulation and better guidelines in 2012.
An amazing example of R being used sucessfully in combination (and not is isolation) with other enterprise software is the add-ins functionality of JMP and it’s R integration.
See the following JMP add-ins which use R
JMP Add-in: Multidimensional Scaling using R
This add-in creates a new menu command under the Add-Ins Menu in the submenu R Add-ins. The script will launch a custom dialog (or prompt for a JMP data table is one is not already open) where you can cast columns into roles for performing MDS on the data table. The analysis results in a data table of MDS dimensions and associated output graphics. MDS is a dimension reduction method that produces coordinates in Euclidean space (usually 2D, 3D) that best represent the structure of a full distance/dissimilarity matrix. MDS requires that input be a symmetric dissimilarity matrix. Input to this application can be data that is already in the form of a symmetric dissimilarity matrix or the dissimilarity matrix can be computed based on the input data (where dissimilarity measures are calculated between rows of the input data table in R).
|Submitted by: Kelci Miclaus||Initiative: All|
|Application: Add-Ins||Analysis: Exploratory Data Analysis|
Chernoff Faces Add-in
One way to plot multivariate data is to use Chernoff faces. For each observation in your data table, a face is drawn such that each variable in your data set is represented by a feature in the face. This add-in uses JMP’s R integration functionality to create Chernoff faces. An R install and the TeachingDemos R package are required to use this add-in.
|Submitted by: Clay Barker||Initiative: All|
|Application: Add-Ins||Analysis: Data Visualization|
Support Vector Machine for Classification
By simply opening a data table, specifying X, Y variables, selecting a kernel function, and specifying its parameters on the user-friendly dialog, you can build a classification model using Support Vector Machine. Please note that R package ‘e1071’ should be installed before running this dialog. The package can be found from http://cran.r-project.org/web/packages/e1071/index.html.
|Submitted by: Jong-Seok Lee||Initiative: All|
|Application: Add-Ins||Analysis: Exploratory Data Analysis/Mining|
Penalized Regression Add-in
This add-in uses JMP’s R integration functionality to provide access to several penalized regression methods. Methods included are the LASSO (least absolutee shrinkage and selection operator, LARS (least angle regression), Forward Stagewise, and the Elastic Net. An R install and the “lars” and “elasticnet” R packages are required to use this add-in.
|Submitted by: Clay Barker||Initiative: All|
|Application: Add-Ins||Analysis: Regression|
MP Addin: Univariate Nonparametric Bootstrapping
This script performs simple univariate, nonparametric bootstrap sampling by using the JMP to R Project integration. A JMP Dialog is built by the script where the variable you wish to perform bootstrapping over can be specified. A statistic to compute for each bootstrap sample is chosen and the data are sent to R using new JSL functionality available in JMP 9. The boot package in R is used to call the boot() function and the boot.ci() function to calculate the sample statistic for each bootstrap sample and the basic bootstrap confidence interval. The results are brought back to JMP and displayed using the JMP Distribution platform.
|Submitted by: Kelci Miclaus||Initiative: All|
|Application: Add-Ins||Analysis: Basic Statistics|
Revolution Analytics Webinar-
CTO, Revolution Analytics
|Tuesday, December 20th|
|11:00AM – 11:30AM Pacific
Click here for the webinar time in your local time zone
Traditional IT infrastructure is simply unable to meet
the demands of the new “Big Data Analytics” landscape. Many enterprises are turning to the “R” statistical programming language and Hadoop (both open source projects) as a potential solution. This webinar will introduce the statistical capabilities of R within the Hadoop ecosystem. We’ll cover:
- An introduction to new packages developed by Revolution Analytics to facilitate interaction with the data stores HDFS and HBase so that they can be leveraged from the R environment
- An overview of how to write Map Reduce jobs in R using Hadoop
- Special considerations that need to be made when working with R and Hadoop.
We’ll also provide additional resources that are available to people interested in integrating R and Hadoop.
|Wed, Dec 14th
11:00AM – 11:30AM PT
|Revolution R Enterprise – 100% R and MoreR users already know why the R language is the lingua franca of statisticians today: because it’s the most powerful statistical language in the world. Revolution Analytics builds on the power of open source R, and adds performance, productivity and integration features to create Revolution R Enterprise. In this webinar, author and blogger David Smith will introduce the additional capabilities of Revolution R Enterprise.|
One of the seminal papers establishing the importance of data visualization (as it is now called) was the 1973 paper by F J Anscombe in http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf
It has probably the most elegant introduction to an advanced statistical analysis paper that I have ever seen-
1. Usefulness of graphs
Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:
(1) numerical calculations are exact, but graphs are rough;
(2) for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
(3) performing intricate calculations is virtuous, whereas actually looking at the data is cheating.
A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.
Of course the dataset makes it very very interesting for people who dont like graphical analysis too much.
The x values are the same for the first three datasets.
For all four datasets:
|Mean of x in each case||9 exact|
|Variance of x in each case||11 exact|
|Mean of y in each case||7.50 (to 2 decimal places)|
|Variance of y in each case||4.122 or 4.127 (to 3 d.p.)|
|Correlation between x and y in each case||0.816 (to 3 d.p.)|
|Linear regression line in each case||y = 3.00 + 0.500x (to 2 d.p. and 3 d.p. resp.)|
Here is an interview with Zach Goldberg, who is the product manager of Google Prediction API, the next generation machine learning analytics-as-an-api service state of the art cloud computing model building browser app.
Ajay- Describe your journey in science and technology from high school to your current job at Google.
Zach- First, thanks so much for the opportunity to do this interview Ajay! My personal journey started in college where I worked at a startup named Invite Media. From there I transferred to the Associate Product Manager (APM) program at Google. The APM program is a two year rotational program. I did my first year working in display advertising. After that I rotated to work on the Prediction API.
Ajay- How does the Google Prediction API help an average business analytics customer who is already using enterprise software , servers to generate his business forecasts. How does Google Prediction API fit in or complement other APIs in the Google API suite.
Zach- The Google Prediction API is a cloud based machine learning API. We offer the ability for anybody to sign up and within a few minutes have their data uploaded to the cloud, a model built and an API to make predictions from anywhere. Traditionally the task of implementing predictive analytics inside an application required a fair amount of domain knowledge; you had to know a fair bit about machine learning to make it work. With the Google Prediction API you only need to know how to use an online REST API to get started.
Ajay- What are the additional use cases of Google Prediction API that you think traditional enterprise software in business analytics ignore, or are not so strong on. What use cases would you suggest NOT using Google Prediction API for an enterprise.
Zach- We are living in a world that is changing rapidly thanks to technology. Storing, accessing, and managing information is much easier and more affordable than it was even a few years ago. That creates exciting opportunities for companies, and we hope the Prediction API will help them derive value from their data.
The Prediction API focuses on providing predictive solutions to two types of problems: regression and classification. Businesses facing problems where there is sufficient data to describe an underlying pattern in either of these two areas can expect to derive value from using the Prediction API.
Ajay- What are your separate incentives to teach about Google APIs to academic or researchers in universities globally.
Zach- I’d refer you to our university relations page–
Google thrives on academic curiosity. While we do significant in-house research and engineering, we also maintain strong relations with leading academic institutions world-wide pursuing research in areas of common interest. As part of our mission to build the most advanced and usable methods for information access, we support university research, technological innovation and the teaching and learning experience through a variety of programs.
Ajay- What is the biggest challenge you face while communicating about Google Prediction API to traditional users of enterprise software.
Zach- Businesses often expect that implementing predictive analytics is going to be very expensive and require a lot of resources. Many have already begun investing heavily in this area. Quite often we’re faced with surprise, and even skepticism, when they see the simplicity of the Google Prediction API. We work really hard to provide a very powerful solution and take care of the complexity of building high quality models behind the scenes so businesses can focus more on building their business and less on machine learning.
Webinar: Using R within Oracle — Nov 30, noon EST
— URL for TechCast: https://stbeehive.oracle.com/bconf/confDetails?confID=334B:3BF0:owch:38893C00F42F38A1E0404498C8A6612B0004075AECF7&guest=true&confKey=608880
— Web Conference ID: 303397
— Web Conference Key: 608880
— Dialup: 1-866-682-4770 , ID 5548204, passcode 1234
After a steady rise in the past few years, in 2010 the open source data mining software R overtook other tools to become the tool used by more data miners (43%) than any other (http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html).
Several analytic tool vendors have added R-integration to their software. However, Oracle is the largest company to throw their weight behind R. On October 3, Oracle unveiled their integration of R: Oracle R Enterprise (http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html) as part of their Oracle Big Data Appliance announcement (http://www.oracle.com/us/corporate/press/512001).
Oracle R Enterprise allows users to perform statistical analysis with advanced visualization on data stored in Oracle Database. Oracle R Enterprise enables scalable R solutions, while facilitating production deployment of R scripts and Hadoop based solutions, as well as integration of R results with Oracle BI Publisher and OBIEE dashboards.
This TechCast introduces the various Oracle R Enterprise components and features, along with R script demonstrations that interface with Oracle Database.
TechCast presenter: Mark Hornick, Senior Manager, Oracle Advanced Analytics Development.
This TechCast is part of the ongoing TechCasts series coordinated by Oracle BIWA: The BI, Warehousing and Analytics SIG (http://www.oracleBIWA.org).