Software Review- BigML.com – Machine Learning meets the Cloud

I had a chance to dekko the new startup BigML https://bigml.com/ and was suitably impressed by the briefing and my own puttering around the site. Here is my review-

1) The website is very intutively designed- You can create a dataset from an uploaded file in one click and you can create a Decision Tree model in one click as well. I wish other cloud computing websites like  Google Prediction API make design so intutive and easy to understand. Also unlike Google Prediction API, the models are not black box models, but have a description which can be understood.

2) It includes some well known data sources for people trying it out. They were kind enough to offer 5 invite codes for readers of Decisionstats ( if you want to check it yourself, use the codes below the post, note they are one time only , so the first five get the invites.

BigML is still invite only but plan to get into open release soon.

3) Data Sources can only be by uploading files (csv) but they plan to change this hopefully to get data from buckets (s3? or Google?) and from URLs.

4) The one click operation to convert data source into a dataset shows a histogram (distribution) of individual variables.The back end is clojure , because the team explained it made the easiest sense and fit with Java. The good news (?) is you would never see the clojure code at the back end. You can read about it from http://clojure.org/

As cloud computing takes off (someday) I expect clojure popularity to take off as well.

Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR, and JavaScript). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language – it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reflection.

Clojure is a dialect of Lisp

 

5) As of now decision trees is the only distributed algol, but they expect to roll out other machine learning stuff soon. Hopefully this includes regression (as logit and linear) and k means clustering. The trees are created and pruned in real time which gives a slightly animated (and impressive effect). and yes model building is an one click operation.

The real time -live pruning is really impressive and I wonder why /how it can ever be replicated in other software based on desktop, because of the sheer interactive nature.

 

Making the model is just half the work. Creating predictions and scoring the model is what is really the money-earner. It is one click and customization is quite intuitive. It is not quite PMML compliant yet so I hope some Zemanta like functionality can be added so huge amounts of models can be applied to predictions or score data in real time.

 

If you are a developer/data hacker, you should check out this section too- it is quite impressive that the designers of BigML have planned for API access so early.

https://bigml.com/developers

BigML.io gives you:

  • Secure programmatic access to all your BigML resources.
  • Fully white-box access to your datasets and models.
  • Asynchronous creation of datasets and models.
  • Near real-time predictions.

 

Note: For your convenience, some of the snippets below include your real username and API key.

Please keep them secret.

REST API

BigML.io conforms to the design principles of Representational State Transfer (REST)BigML.io is enterely HTTP-based.

BigML.io gives you access to four basic resources: SourceDatasetModel and Prediction. You cancreatereadupdate, and delete resources using the respective standard HTTP methods: POSTGET,PUT and DELETE.

All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the “multipart/form-data” content-type

HTTPS

All access to BigML.io must be performed over HTTPS

and https://bigml.com/developers/quick_start ( In think an R package which uses JSON ,RCurl  would further help in enhancing ease of usage).

 

Summary-

Overall a welcome addition to make software in the real of cloud computing and statistical computation/business analytics both easy to use and easy to deploy with fail safe mechanisms built in.

Check out https://bigml.com/ for yourself to see.

The invite codes are here -one time use only- first five get the invites- so click and try your luck, machine learning on the cloud.

If you dont get an invite (or it is already used, just leave your email there and wait a couple of days to get approval)

  1. https://bigml.com/accounts/register/?code=E1FE7
  2. https://bigml.com/accounts/register/?code=09991
  3. https://bigml.com/accounts/register/?code=5367D
  4. https://bigml.com/accounts/register/?code=76EEF
  5. https://bigml.com/accounts/register/?code=742FD

JMP 10 released

JMP , the visual data exploration, statistical quality control software from SAS Institute launched version 10 of its software today.

Source-http://jmp.com/about/events/webcasts/jmp_webcast.shtml?name=jmp10

JMP 10 includes:

Numerous enhancements to the drag-and-drop Graph Builder, including a new iPad application.

A cutting-edge Control Chart Builder to create process control charts with drag-and-drop ease.

New reliability capabilities, including growth and forecast models.

Additions and improvements for sorting and filtering data, design of experiments, statistical modeling, scripting, add-in and application development, script debugging and more.

From JohnSall’s blog post at http://blogs.sas.com/content/jmp/2012/03/20/discover-more-with-jmp-10/

Much of the development centered on four focus areas:

1. Graph Builder everywhere. The Graph Builder platform itself has new features like Heatmap and Treemap, an elements palette and properties panel, making the choices more visible. But Graph Builder also has some descendents now, including the new Control Chart Builder, which makes creating control charts an interactive process. In addition, some of the drag-and-drop features that are used to change columns in Graph Builder are also available in Distribution, Fit Y by X, and a few other places. Finally, Graph Builder has been ported to the iPad. For the first time, you can use JMP for exploration and presentation on a mobile device for free. So just think of Graph Builder as gradually taking over in lots of places.

2. Expert-driven design.reliability, measurement systems, and partial least squares analyses.

3. Performance.  this release has the most new multithreading so far

4. Application development

You can read more here –http://jmp.com/about/events/webcasts/jmpwebcast_detail.shtml?reglink=70130000001r9IP

Radoop 0.3 launched- Open Source Graphical Analytics meets Big Data

What is Radoop? Quite possibly an exciting mix of analytics and big data computing

 

http://blog.radoop.eu/?p=12

What is Radoop?

Hadoop is an excellent tool for analyzing large data sets, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but its data size is limited by the memory available, and a single machine is often not enough to run the analyses on time. In this project, we combine the strengths of both projects and provide a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop.

We have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.

 

and what’s new

http://blog.radoop.eu/?p=198

Radoop 0.3 released – fully graphical big data analytics

Today, Radoop had a major step forward with its 0.3 release. The new version of the visual big data analytics package adds full support for all major Hadoop distributions used these days: Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution including Apache Hadoop 3 (CDH3). It also adds support for large clusters by allowing the namenode, the jobtracker and the Hive server to reside on different nodes.

As Radoop’s promise is to make big data analytics easier, the 0.3 release is also focused on improving the user interface. It has an enhanced breakpointing system which allows to investigate intermediate results, and it adds dozens of quick fixes, so common process design mistakes get much easier to solve.

There are many further improvements and fixes, so please consult the release notes for more details. Radoop is in private beta mode, but heading towards a public release in Q2 2012. If you would like to get early access, then please apply at the signup page or describe your use case in email (beta at radoop.eu).

Radoop 0.3 (15 February 2012)

  • Support for Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution Including Apache Hadoop 3 (CDH3) in a single release
  • Support for clusters with separate master nodes (namenode, jobtracker, Hive server)
  • Enhanced breakpointing to evaluate intermediate results
  • Dozens of quick fixes for the most common process design errors
  • Improved process design and error reporting
  • New welcome perspective to help in the first steps
  • Many bugfixes and performance improvements

Radoop 0.2.2 (6 December 2011)

  • More Aggregate functions and distinct option
  • Generate ID operator for convenience
  • Numerous bug fixes and improvements
  • Improved user interface

Radoop 0.2.1 (16 September 2011)

  • Set Role and Data Multiplier operators
  • Management panel for testing Hadoop connections
  • Stability improvements for Hive access
  • Further small bugfixes and improvements

Radoop 0.2 (26 July 2011)

  • Three new algoritms: Fuzzy K-Means, Canopy, and Dirichlet clustering
  • Three new data preprocessing operators: Normalize, Replace, and Replace Missing Values
  • Significant speed improvements in data transmission and interactive analytics
  • Increased stability and speedup for K-Means
  • More flexible settings for Join operations
  • More meaningful error messages
  • Other small bugfixes and improvements

Radoop 0.1 (14 June 2011)

Initial release with 26 operators for data transmission, data preprocessing, and one clustering algorithm.

Note that Rapid Miner also has a great R extension so you can use R, a graphical interface and big data analytics is now easier and more powerful than ever.


SAS Institute Financials 2011

SAS Institute has release it’s financials for 2011 at http://www.sas.com/news/preleases/2011financials.html,

Revenue surged across all solution and industry categories. Software to detect fraud saw a triple-digit jump. Revenue from on-demand solutions grew almost 50 percent. Growth from analytics and information management solutions were double digit, as were gains from customer intelligence, retail, risk and supply chain solutions

AJAY- and as a private company it is quite nice that they are willing to share so much information every year.

The graphics are nice ( and the colors much better than in 2010) , but pie-charts- seriously dude there is no way to compare how much SAS revenue is shifting across geographies or even across industries. So my two cents is – lose the pie charts, and stick to line graphs please for the share of revenue by country /industry.

In 2011, SAS grew staff 9.2 percent and reinvested 24 percent of revenue into research and development

AJAY- So that means 654 million dollars spent in Research and Development.  I wonder if SAS has considered investing in much smaller startups (than it’s traditional strategy of doing all research in-house and completely acquiring a smaller company)

Even a small investment of say 5-10 million USD in open source , or even Phd level research projects could greatly increase the ROI on that.

That means

Analyzing a private company’s financials are much more fun than a public company, and I remember the words of my finance professor ( “dig , dig”) to compare 2011 results with 2010 results.

http://www.sas.com/news/preleases/2010financials.html

The percentage invested in R and D is exactly the same (24%) and the percentages of revenue earned from each geography is exactly the same . So even though revenue growth increased from 5.2 % to 9% in 2011, both the geographic spread of revenues and share  R&D costs remained EXACTLY the same.

The Americas accounted for 46 percent of total revenue; Europe, Middle East and Africa (EMEA) 42 percent; and Asia Pacific 12 percent.

Overall, I think SAS remains a 35% market share (despite all that noise from IBM, SAS clones, open source) because they are good at providing solutions customized for industries (instead of just software products), the market for analytics is not saturated (it seems to be growing faster than 12% or is it) , and its ability to attract and retain the best analytical talent (which in a non -American tradition for a software company means no stock options, job security, and great benefits- SAS remains almost Japanese in HR practices).

In 2010, SAS grew staff by 2.4 percent, in 2011 SAS grew staff by 9 percent.

But I liked the directional statement made here-and I think that design interfaces, algorithmic and computational efficiencies should increase analytical time, time to think on business and reduce data management time further!

“What would you do with the extra time if your code ran in two minutes instead of five hours?” Goodnight challenged.

WordPress 3.3 Sonny

WordPress 3.3 Sonny is out.

http://wordpress.org/news/2011/12/sonny/

Eat your heart out, blogspotters!!

http://googlesystem.blogspot.com/2011/07/try-bloggers-new-interface.html

 

On the internet- look and feel and design are the new old new buzzwords again!.

Free online education by Stanford and MIT

One more reason American education is the best in the world- it has a big heart.

Stanford just announced free courses starting from Jan 2012- and they are online (so no visa blues) and free( as in speech and free as in beer) and just the same as actual courses (yes , the homework will have to be done, and the dog cannot eat the homework)

http://www.venture-class.org/

MIT meanwhile has 2000 courses  at http://ocw.mit.edu/courses/

 

– but I liked Stanford’s minimal , clutter free interface ( I read Steve Jobs biography- the interface hangover continues).

Hurrah for Stanford!

MIT needs to DESIGN  their free online courses website and maybe do more search engine optimization at

http://ocw.mit.edu/courses/.

 

Preview- Google Cloud SQL

From –http://code.google.com/apis/sql/

What is Google Cloud SQL?

Google Cloud SQL is web service that allows you to create, configure, and use relational databases with your App Engine applications. It is a fully-managed service that maintains, manages, and administers your databases, allowing you to focus on your applications and services.

By offering the capabilities of a MySQL database, the service enables you to easily move your data, applications, and services into and out of the cloud. This allows for high data portability and helps in faster time-to-market because you can quickly leverage your existing database (using JDBC and/or DB-API) in your App Engine application.

Here is where you can get an invite to the beta only Google Cloud SQL

Sign up for Limited Preview

Google Cloud SQL is available to a limited number of users. To sign up for the service:

  1. Visit the Google APIs Console. The console opens the All services pane.
  2. Find the SQL Service line in the Services table and click Request access…
  3. Fill out the enrollment form.
  4. Our team will review your enrollment information and respond by email to the address associated with your Google Account.
  5. Follow the link in the email to view the Terms of Service. Please read these carefully before accepting.
  6. Sign up for the google-cloud-sql-announce group to receive important announcements and product news. (NOTE- Members: 384)
and after all that violence and double talk, a walk in the clouds with SQL.
1. There are three kinds of instances in the beta view
2. Wait for the Instance to be created note- the Design of the Interface uptil now is much better than Amazon’s.  
Note you need to have an appspot application from Google Apps and can choose between the Python and Java versions. Quite clearly there is a play for other languages too. I think GO is also supported.
3. You can import your data from your Google Storage bucket
4. I am not that hot at coding or maybe the interface was too pretty. Anyways- the log tells me that import of the text file has failed from Google Storage to Google Cloud SQL 
5. Incidentally the Google Cloud Storage interface is also much better than the Amazon GUI for transferring data- Note I was using the classical statistical dataset Boston Housing Data as the test case. 
6. The SQL prompt is the weakest part of the design process of the Interphase. There is no Query builder and the SELECT FROM WHERE prompt is slightly amusing/ insulting . I mean guys either throw in a fully fledged GUI for query builder similar to the MYSQL Workbench , than create a pretty white command prompt.
7. You can also export your data back to your Google Storage bucket 
These are early days, and I am trying to see if there is a play for some cloud kind of ODBC action between R, Prediction API , and the cloud SQL… so try it out yourself at http://code.google.com/apis/sql/ and see if there is any juice you can build  here.