Interview BigML.com

Here is an interview with Charlie Parker, head of large scale online algorithms at http://bigml.com

Ajay-  Describe your own personal background in scientific computing, and how you came to be involved with machine learning, cloud computing and BigML.com

Charlie- I am a machine learning Ph.D. from Oregon State University. Francisco Martin (our founder and CEO), Adam Ashenfelter (the lead developer on the tree algorithm), and myself were all studying machine learning at OSU around the same time. We all went our separate ways after that.

Francisco started Strands and turned it into a 100+ million dollar company building recommender systems. Adam worked for CleverSet, a probabilistic modeling company that was eventually sold to Cisco, I believe. I worked for several years in the research labs at Eastman Kodak on data mining, text analysis, and computer vision.

When Francisco left Strands to start BigML, he brought in Justin Donaldson who is a brilliant visualization guy from Indiana, and an ex-Googler named Jose Ortega who is responsible for most of our data infrastructure. They pulled in Adam and I a few months later. We also have Poul Petersen, a former Strands employee, who manages our herd of servers. He is a wizard and makes everyone else’s life much easier.

Ajay- You use clojure for the back end of BigML.com .Are there any other languages and packages you are considering? What makes clojure such a good fit for cloud computing ?

Charlie- Clojure is a great language because it offers you all of the benefits of Java (extensive libraries, cross-platform compatibility, easy integration with things like Hadoop, etc.) but has the syntactical elegance of a functional language. This makes our code base small and easy to read as well as powerful.

We’ve had occasional issues with speed, but that just means writing the occasional function or library in Java. As we build towards processing data at the Terabyte level, we’re hoping to create a framework that is language-agnostic to some extent. So if we have some great machine learning code in C, for example, we’ll use Clojure to tie everything together, but the code that does the heavy lifting will still be in C. For the API and Web layers, we use Python and Django, and Justin is a huge fan of HaXe for our visualizations.

 Ajay- Current support is for Decision Trees. When can we see SVM, K Means Clustering and Logit Regression?

Charlie- Right now we’re focused on perfecting our infrastructure and giving you new ways to put data in the system, but expect to see more algorithms appearing in the next few months. We want to make sure they are as beautiful and easy to use as the trees are. Without giving too much away, the first new thing we will probably introduce is an ensemble method of some sort (such as Boosting or Bagging). Clustering is a little further away but we’ll get there soon!

Ajay- How can we use the BigML.com API using R and Python.

Charlie- We have a public github repo for the language bindings. https://github.com/bigmlcom/io Right now, there there are only bash scripts but that should change very soon. The python bindings should be there in a matter of days, and the R bindings in probably a week or two. Clojure and Java bindings should follow shortly after that. We’ll have a blog post about it each time we release a new language binding. http://blog.bigml.com/

Ajay-  How can we predict large numbers of observations using a Model  that has been built and pruned (model scoring)?

Charlie- We are in the process of refactoring our backend right now for better support for batch prediction and model evaluation. This is something that is probably only a few weeks away. Keep your eye on our blog for updates!

Ajay-  How can we export models built in BigML.com for scoring data locally.

Charlie- This is as simple as a call to our API. https://bigml.com/developers/models The call gives you a JSON object representing the tree that is roughly equivalent to a PMML-style representation.

About-

You can read about Charlie Parker at http://www.linkedin.com/pub/charles-parker/11/85b/4b5 and the rest of the BigML team at

https://bigml.com/team

 

Interview Kelci Miclaus, SAS Institute Using #rstats with JMP

Here is an interview with Kelci Miclaus, a researcher working with the JMP division of the SAS Institute, in which she demonstrates examples of how the R programming language is a great hit with JMP customers who like to be flexible.

 

Ajay- How has JMP been using integration with R? What has been the feedback from customers so far? Is there a single case study you can point out where the combination of JMP and R was better than any one of them alone?

Kelci- Feedback from customers has been very positive. Some customers are using JMP to foster collaboration between SAS and R modelers within their organizations. Many are using JMP’s interactive visualization to complement their use of R. Many SAS and JMP users are using JMP’s integration with R to experiment with more bleeding-edge methods not yet available in commercial software. It can be used simply to smooth the transition with regard to sending data between the two tools, or used to build complete custom applications that take advantage of both JMP and R.

One customer has been using JMP and R together for Bayesian analysis. He uses R to create MCMC chains and has found that JMP is a great tool for preparing the data for analysis, as well as displaying the results of the MCMC simulation. For example, the Control Chart platform and the Bubble Plot platform in JMP can be used to quickly verify convergence of the algorithm. The use of both tools together can increase productivity since the results of an analysis can be achieved faster than through scripting and static graphics alone.

I, along with a few other JMP developers, have written applications that use JMP scripting to call out to R packages and perform analyses like multidimensional scaling, bootstrapping, support vector machines, and modern variable selection methods. These really show the benefit of interactive visual analysis of coupled with modern statistical algorithms. We’ve packaged these scripts as JMP add-ins and made them freely available on our JMP User Community file exchange. Customers can download them and now employ these methods as they would a regular JMP platform. We hope that our customers familiar with scripting will also begin to contribute their own add-ins so a wider audience can take advantage of these new tools.

(see http://www.decisionstats.com/jmp-and-r-rstats/)

Ajay- Are there plans to extend JMP integration with other languages like Python?

Kelci- We do have plans to integrate with other languages and are considering integrating with more based on customer requests. Python has certainly come up and we are looking into possibilities there.

 Ajay- How is R a complimentary fit to JMP’s technical capabilities?

Kelci- R has an incredible breadth of capabilities. JMP has extensive interactive, dynamic visualization intrinsic to its largely visual analysis paradigm, in addition to a strong core of statistical platforms. Since our brains are designed to visually process pictures and animated graphs more efficiently than numbers and text, this environment is all about supporting faster discovery. Of course, JMP also has a scripting language (JSL) allowing you to incorporate SAS code, R code, build analytical applications for others to leverage SAS, R and other applications for users who don’t code or who don’t want to code.

JSL is a powerful scripting language on its own. It can be used for dialog creation, automation of JMP statistical platforms, and custom graphic scripting. In other ways, JSL is very similar to the R language. It can also be used for data and matrix manipulation and to create new analysis functions. With the scripting capabilities of JMP, you can create custom applications that provide both a user interface and an interactive visual back-end to R functionality. Alternatively, you could create a dashboard using statistical and/or graphical platforms in JMP to explore the data and with the click of a button, send a portion of the data to R for further analysis.

Another JMP feature that complements R is the add-in architecture, which is similar to how R packages work. If you’ve written a cool script or analysis workflow, you can package it into a JMP add-in file and send it to your colleagues so they can easily use it.

Ajay- What is the official view on R from your organization? Do you think it is a threat, or a complimentary product or another statistical platform that coexists with your offerings?

Kelci- Most definitely, we view R as complimentary. R contributors are providing a tremendous service to practitioners, allowing them to try a wide variety of methods in the pursuit of more insight and better results. The R community as a whole is providing a valued role to the greater analytical community by focusing attention on newer methods that hold the most promise in so many application areas. Data analysts should be encouraged to use the tools available to them in order to drive discovery and JMP can help with that by providing an analytic hub that supports both SAS and R integration.

Ajay-  While you do use R, are there any plans to give back something to the R community in terms of your involvement and participation (say at useR events) or sponsoring contests.

 Kelci- We are certainly open to participating in useR groups. At Predictive Analytics World in NY last October, they didn’t have a local useR group, but they did have a Predictive Analytics Meet-up group comprised of many R users. We were happy to sponsor this. Some of us within the JMP division have joined local R user groups, myself included.  Given that some local R user groups have entertained topics like Excel and R, Python and R, databases and R, we would be happy to participate more fully here. I also hope to attend the useR! annual meeting later this year to gain more insight on how we can continue to provide tools to help both the JMP and R communities with their work.

We are also exploring options to sponsor contests and would invite participants to use their favorite tools, languages, etc. in pursuit of the best model. Statistics is about learning from data and this is how we make the world a better place.

About- Kelci Miclaus

Kelci is a research statistician developer for JMP Life Sciences at SAS Institute. She has a PhD in Statistics from North Carolina State University and has been using SAS products and R for several years. In addition to research interests in statistical genetics, clinical trials analysis, and multivariate analysis/visualization methods, Kelci works extensively with JMP, SAS, and R integration.

.

 

PMML Augustus

Here is a new-old system in open source for

for building and scoring statistical models designed to work with data sets that are too large to fit into memory.

http://code.google.com/p/augustus/

Augustus is an open source software toolkit for building and scoring statistical models. It is written in Python and its
most distinctive features are:
• Ability to be used on sets of big data; these are data sets that exceed either memory capacity or disk capacity, so
that existing solutions like R or SAS cannot be used. Augustus is also perfectly capable of handling problems
that can fit on one computer.
• PMML compliance and the ability to both:
– produce models with PMML-compliant formats (saved with extension .pmml).
– consume models from files with the PMML format.
Augustus has been tested and deployed on serveral operating systems. It is intended for developers who work in the
financial or insurance industry, information technology, or in the science and research communities.
Usage
Augustus produces and consumes Baseline, Cluster, Tree, and Ruleset models. Currently, it uses an event-based
approach to building Tree, Cluster and Ruleset models that is non-standard.

New to PMML ?

Read on http://code.google.com/p/augustus/wiki/PMML

The Predictive Model Markup Language or PMML is a vendor driven XML markup language for specifying statistical and data mining models. In other words, it is an XML language so that Continue reading “PMML Augustus”

App to App Porting

I often wonder why bright, intelligent software programmers go out of their way to write turgid and lengthy words in documentation, do not make  step by step screenshot/slides for Tutorials, and practically force everyone to reinvent the wheel everytime they create a new platform.

Top of my wish list for 2012-

1) Better GUI  for APP CREATION-

example-A GUI utility to create chrome apps something similar to Android  App creator http://www.appinventorbeta.com/about/

2)  Automated Porting or Translation-

An automated appsot app for reading in an iOS app (or iPhone app) and churning out the necessary Android app code. This is similar to translating blogs from one blogging platform to another using Python at http://code.google.com/p/google-blog-converters-appengine/

 

but the woefully underpowered http://wordpress2blogger.appspot.com/ currently allows only downloads less than 1 MB, while WordPress itself allows 15 MB export files.

3) Better interaction between cloud and desktop apps

example – (google docs and libre office)  or webcams to (google hangouts and google voice /youtube)

Are we there yet? Not appy enough !

 

 

Preview- Google Cloud SQL

From –http://code.google.com/apis/sql/

What is Google Cloud SQL?

Google Cloud SQL is web service that allows you to create, configure, and use relational databases with your App Engine applications. It is a fully-managed service that maintains, manages, and administers your databases, allowing you to focus on your applications and services.

By offering the capabilities of a MySQL database, the service enables you to easily move your data, applications, and services into and out of the cloud. This allows for high data portability and helps in faster time-to-market because you can quickly leverage your existing database (using JDBC and/or DB-API) in your App Engine application.

Here is where you can get an invite to the beta only Google Cloud SQL

Sign up for Limited Preview

Google Cloud SQL is available to a limited number of users. To sign up for the service:

  1. Visit the Google APIs Console. The console opens the All services pane.
  2. Find the SQL Service line in the Services table and click Request access…
  3. Fill out the enrollment form.
  4. Our team will review your enrollment information and respond by email to the address associated with your Google Account.
  5. Follow the link in the email to view the Terms of Service. Please read these carefully before accepting.
  6. Sign up for the google-cloud-sql-announce group to receive important announcements and product news. (NOTE- Members: 384)
and after all that violence and double talk, a walk in the clouds with SQL.
1. There are three kinds of instances in the beta view
2. Wait for the Instance to be created note- the Design of the Interface uptil now is much better than Amazon’s.  
Note you need to have an appspot application from Google Apps and can choose between the Python and Java versions. Quite clearly there is a play for other languages too. I think GO is also supported.
3. You can import your data from your Google Storage bucket
4. I am not that hot at coding or maybe the interface was too pretty. Anyways- the log tells me that import of the text file has failed from Google Storage to Google Cloud SQL 
5. Incidentally the Google Cloud Storage interface is also much better than the Amazon GUI for transferring data- Note I was using the classical statistical dataset Boston Housing Data as the test case. 
6. The SQL prompt is the weakest part of the design process of the Interphase. There is no Query builder and the SELECT FROM WHERE prompt is slightly amusing/ insulting . I mean guys either throw in a fully fledged GUI for query builder similar to the MYSQL Workbench , than create a pretty white command prompt.
7. You can also export your data back to your Google Storage bucket 
These are early days, and I am trying to see if there is a play for some cloud kind of ODBC action between R, Prediction API , and the cloud SQL… so try it out yourself at http://code.google.com/apis/sql/ and see if there is any juice you can build  here.

Google Cloud SQL

Another xing bang API from the boyz in Mountain View. (entry by invite only) But it is free and you can test your stuff on a MySQL db =10 GB

Database as a service ? (Maybe)— while Amazon was building fires (and Fire)

—————————————————————–

https://code.google.com/apis/sql/index.html

What is Google Cloud SQL?

Google Cloud SQL is a web service that provides a highly available, fully-managed, hosted SQL storage solution for your App Engine applications.

What are the benefits of using Google Cloud SQL?

You can access a familiar, highly available SQL database from your App Engine applications, without having to worry about provisioning, management, and integration with other Google services.

How much does Google Cloud SQL cost?

We will not be billing for this service in 2011. We will give you at least 30 days’ advance notice before we begin billing in the future. Other services such as Google App Engine, Google Cloud Storage etc. that you use with Google Cloud SQL may have their own payment terms, and you need to pay for them. Please consult their documentation for details.

Currently you are limited to the three instance sizes. What if I need to store more data or need better performance?

In the Limited Preview period, we only have three sizes available. If you have specific needs, we would like to hear from you on our google-cloud-sqldiscussion board.

When is Google Cloud SQL be out of Limited Preview?

We are working hard to make the service generally available.We don’t have a firm date that we can announce right now.

Do you support all the features of MySQL?

In general, Google Cloud SQL supports all the features of MySQL. The following are lists of all the unsupported features and notable differences that Google Cloud SQL has from MySQL.

Unsupported Features:

  • User defined functions
  • MySql replication

Unsupported MySQL statements:

  • LOAD DATA INFILE
  • SELECT ... INTO OUTFILE
  • SELECT ... INTO DUMPFILE
  • INSTALL PLUGIN .. SONAME ...
  • UNINSTALL PLUGIN
  • CREATE FUNCTION ... SONAME ...

Unsupported SQL Functions:

  • LOAD_FILE()

Notable Differences:

  • If you want to import databases with binary data into your Google Cloud SQL instance, you must use the --hex-blob option with mysqldump.Although this is not a required flag when you are using a local MySQL server instance and the MySQL command line, it is required if you want to import any databases with binary data into your Google Cloud SQL instance. For more information, see Importing Data.
How large a database can I use with Google Cloud SQL?
Currently, in this limited preview period, your database instance must be no larger than 10GB.
How can I be notified when there are any changes to Google Cloud SQL?
You can sign up for the sql-announcements forum where we post announcements and news about the Google Cloud SQL.
How can I cancel my Google Cloud SQL account?
To remove all data from your Google Cloud SQL account and disable the service:

  1. Delete all your data. You can remove your tables, databases, and indexes using the drop command. For more information, see SQL DROP statement.
  2. Deactivate the Google Cloud SQL by visiting the Services pane and clicking the On button next to Google Cloud SQL. The button changes from Onto Off.
How do I report a bug, request a feature, or ask a question?
You can report bugs and request a feature on our project page.You can ask a question in our discussion forum.

Getting Started

Can I use languages other than Java or Python?
Only Java and Python are supported for Google Cloud SQL.
Can I use Google Cloud SQL outside of Google App Engine?
The Limited Preview is primarily focused on giving Google App Engine customers the ability to use a familiar relational database environment. Currently, you cannot access Google Cloud SQL from outside Google App Engine.
What database engine are we using in the Google Cloud SQL?
MySql Version 5.1.59
Do I need to install a local version of MySQL to use the Development Server?
Yes.

Managing Your Instances

Do I need to use the Google APIs Console to use Google Cloud SQL?
Yes. For basic tasks like granting access control to applications, creating instances, and deleting instances, you need to use the Google APIs Console.
Can I import or export specific databases?
No, currently it is not possible to export specific databases. You can only export your entire instance.
Do I need a Google Cloud Storage account to import or export my instances?
Yes, you need to sign up for a Google Cloud Storage account or have access to a Google Cloud Storage account to import or export your instances. For more information, see Importing and Exporting Data.
If I delete my instance, can I reuse the instance name?
Yes, but not right away. The instance name is reserved for up to two months before it can be reused.

Tools & Resources

Can I use Django with Google Cloud SQL?
No, currently Google Cloud SQL is not compatible with Django.
What is the best tool to use for interacting with my instance?
There are a variety of tools available for Google Cloud SQL. For executing simple statements, you can use the SQL prompt. For executing more complicated tasks, you might want to use the command line tool. If you want to use a tool with a graphical interface, the SQuirrel SQL Client provides an interface you can use to interact with your instance.

Common Technical Questions

Should I use InnoDB for my tables?
Yes. InnoDB is the default storage engine in MySQL 5.5 and is also the recommended storage engine for Google Cloud SQL. If you do not need any features that require MyISAM, you should use InnoDB. You can convert your existing tables using the following SQL command, replacing tablename with the name of the table to convert:

ALTER tablename ENGINE = InnoDB;

If you have a mysqldump file where all your tables are in MyISAM format, you can convert them by piping the file through a sed script:

mysqldump --databases database_name [-u username -p  password] --hex-blob database_name | sed 's/ENGINE=MyISAM/ENGINE=InnoDB/g' > database_file.sql

Warning: You should not do this if your mysqldump file contains the mysql schema. Those files must remain in MyISAM.

Are there any size or QPS limits?
Yes, the following limits apply to Google Cloud SQL:

Resource Limits from External Requests Limits from Google App Engine
Queries Per Second (QPS) 5 QPS No limit
Maximum Request Size 16 MB
Maximum Response Size 16 MB

Google App Engine Limits

Google App Engine applications are also subject to additional Google App Engine quotas and limits. Requests from Google App Engine applications to Google Cloud SQL are subject to the following time limits:

  • All database requests must finish within the HTTP request timer, around 60 seconds.
  • Offline requests like cron tasks have a time limit of 10 minutes.
  • Backend requests to Google Cloud SQL have a time limit of 10 minutes.

App Engine-specific quotas and access limits are discussed on the Google App Engine Quotas page.

Should I use Google Cloud SQL with my non-High Replication App Engine application?
We recommend that you use Google Cloud SQL with High Replication App Engine applications. While you can use use Google Cloud SQL with applications that do not use high replication, doing so might impact performance.
Source-
https://code.google.com/apis/sql/faq.html#supportmysqlfeatures

Interesting announcement from PiCloud

An interesting announcement from PiCloud who is a cloud computing startup, but focused on python (as the name suggests). They basically have created a cloud library (or in R lingo – a package) that enables you to call cloud power sitting from the desktop interface itself. This announcement is for multiple IP addresses. Real parallel processing or just a quick trick in technical jargon- you decide!

  1. Prepare
  2. Run
  3. Monitor
Prepare

s1 cores are comparable in performance to c1 cores with one extra trick up their sleeve: each job running in parallel will have a different IP.

Why is this important?
Using unique IPs is necessary to minimize the automated throttling most sites will impose when seeing fast, repeated access from a single IP.

How do I use it?
If you’re already using our c1 cores, all you’ll need to do is set the _type keyword.

cloud.call(func, _type=’s1′)

How much?
$0.04/core/hour

Why don’t other cores have individual IPs?
For other core types, such as c2, multiple cores may be running on a single machine that is assigned only a single IP address. When using s1 cores, you’re guaranteed that each core sits on a different machine.

 

http://www.picloud.com/