PMML Augustus

Here is a new-old system in open source for

for building and scoring statistical models designed to work with data sets that are too large to fit into memory.

http://code.google.com/p/augustus/

Augustus is an open source software toolkit for building and scoring statistical models. It is written in Python and its
most distinctive features are:
• Ability to be used on sets of big data; these are data sets that exceed either memory capacity or disk capacity, so
that existing solutions like R or SAS cannot be used. Augustus is also perfectly capable of handling problems
that can fit on one computer.
• PMML compliance and the ability to both:
– produce models with PMML-compliant formats (saved with extension .pmml).
– consume models from files with the PMML format.
Augustus has been tested and deployed on serveral operating systems. It is intended for developers who work in the
financial or insurance industry, information technology, or in the science and research communities.
Usage
Augustus produces and consumes Baseline, Cluster, Tree, and Ruleset models. Currently, it uses an event-based
approach to building Tree, Cluster and Ruleset models that is non-standard.

New to PMML ?

Read on http://code.google.com/p/augustus/wiki/PMML

The Predictive Model Markup Language or PMML is a vendor driven XML markup language for specifying statistical and data mining models. In other words, it is an XML language so that Continue reading “PMML Augustus”

Opera Unite- the future of cloud computing browsers

The boys (and ladies) at opera have been busy writing code , while the rest of the coders on the cloud were issuing press releases, attending meetings or just sky diving from the cloud. Judging by the language of apps and extensions, it seems that the  engineers de Vikings et Slavs were busy coding while the Anglo Saxons were busy preparing for IPOs.

I really like the complete anonymity offered by Opera and especially Opera Unite

1) The Adblock option blocks all ads (same as other extensions)

2) The lovely Opera Unite has incredible apps for peer to peer sharing. You can create your own spotify, host your own chat application, transfer files, remote manage your computer. C’est magnifique!

Some really awesome apps on Opera Unite

All these apps can make your own desktop into a remotely managed website- so SOPA is irrelevant even if passed without any protest or non violent protests

(SOPA- an acronym for STOP OBAMA or STOP A (?) , since OBAMA is the one the internet really supports , and he is dependent on that goodwill for fundraising or A is the acronym of a legendary media myth of an imaginary web based organization (imaginary as in iota)

QUOTE

I think it would be a good idea.

 Mahatma Gandhiwhen asked what he thought of Western civilization

Some Ways Anonymous Could Disrupt the Internet if SOPA is passed

This is a piece of science fiction. I wrote while reading Isaac Assimov’s advice to writers in GOLD, while on a beach in Anjuna.

1) Identify senators, lobbyists, senior executives of companies advocating for SOPA. Go for selective targeting of these people than massive Denial of Service Attacks.

This could also include election fund raising websites in the United States.

2) Create hacking tools with simple interfaces to probe commonly known software errors, to enable wider audience including the Occupy Movement students to participate in hacking. thus making hacking more democratic. What are the top 25 errors as per  http://cwe.mitre.org/cwss/

http://www.decisionstats.com/top-25-most-dangerous-software-errors/ ?

 

Easy interface tools to check vulnerabilities would be the next generation to flooding tools like HOIC, LOIC – Massive DDOS atttacks make good press coverage but not so good technically

3) Disrupt digital payment mechanisms for selected targets (in step1) using tools developed in Step 2, and introduce random noise errors in payment transfers.

4) Help create a better secure internet by embedding Tor within Chromium with all tools for anonymity embedded for easy usage – a more secure peer to peer browser (like a mashup of Opera , tor and chromium).

or maybe embed bit torrents within a browser.

5) Disrupt media companies and cloud computing based companies like iTunes, Spotify or Google Music, just like virus, ant i viruses disrupted the desktop model of computing. After that offer solutions to the problems like companies of anti virus software did for decades.

6) Hacking websites is fine fun, but hacking internet databases and massively parallel data scrapers can help disrupt some of the status quo.

This applies to databases that offer data for sale, like credit bureaus etc. Making this kind of data public will eliminate data middlemen.

7) Use cross border, cross country regulatory arbitrage for better risk control of hacker attacks.

8) recruiting among universities using easy to use hacking tools to expand the pool of dedicated hacker armies.

9) using operations like those targeting child pornography to increase political acceptability of the hacker sub culture. Refrain from overtly negative and unimaginative bad Press Relations

10) If you cant convince  them to pass SOPA, confuse them 😉 Use bots for random clicks on ads to confuse internet commerce.

 

2011 Analytics Recap

Events in the field of data that impacted us in 2011

1) Oracle unveiled plans for R Enterprise. This is one of the strongest statements of its focus on in-database analytics. Oracle also unveiled plans for a Public Cloud

2) SAS Institute released version 9.3 , a major analytics software in industry use.

3) IBM acquired many companies in analytics and high tech. Again.However the expected benefits from Cognos-SPSS integration are yet to show a spectacular change in market share.

2011 Selected acquisitions

Emptoris Inc. December 2011

Cúram Software Ltd. December 2011

DemandTec December 2011

Platform Computing October 2011

 Q1 Labs October 2011

Algorithmics September 2011

 i2 August 2011

Tririga March 2011

 

4) SAP promised a lot with SAP HANA- again no major oohs and ahs in terms of market share fluctuations within analytics.

http://www.sap.com/india/news-reader/index.epx?articleID=17619

5) Amazon continued to lower prices of cloud computing and offer more options.

http://aws.amazon.com/about-aws/whats-new/2011/12/21/amazon-elastic-mapreduce-announces-support-for-cc2-8xlarge-instances/

6) Google continues to dilly -dally with its analytics and cloud based APIs. I do not expect all the APIs in the Google APIs suit to survive and be viable in the enterprise software space.  This includes Google Cloud Storage, Cloud SQL, Prediction API at https://code.google.com/apis/console/b/0/ Some of the location based , translation based APIs may have interesting spin offs that may be very very commercially lucrative.

7) Microsoft -did- hmm- I forgot. Except for its investment in Revolution Analytics round 1 many seasons ago- very little excitement has come from MS plans in data mining- The plugins for cloud based data mining from Excel remain promising yet , while Azure remains a stealth mode starter.

8) Revolution Analytics promised us a GUI and didnt deliver (till yet 🙂 ) . But it did reveal a much better Enterprise software Revolution R 5.0 is one of the strongest enterprise software in the R /Stat Computing space and R’s memory handling problem is now an issue of perception than actual stuff thanks to newer advances in how it is used.

9) More conferences, more books and more news on analytics startups in 2011. Big Data analytics remained a strong buzzword. Expect more from this space including creative uses of Hadoop based infrastructure.

10) Data privacy issues continue to hamper and impede effective analytics usage. So does rational and balanced regulation in some of the most advanced economies. We expect more regulation and better guidelines in 2012.

Interview Zach Goldberg, Google Prediction API

Here is an interview with Zach Goldberg, who is the product manager of Google Prediction API, the next generation machine learning analytics-as-an-api service state of the art cloud computing model building browser app.
Ajay- Describe your journey in science and technology from high school to your current job at Google.

Zach- First, thanks so much for the opportunity to do this interview Ajay!  My personal journey started in college where I worked at a startup named Invite Media.   From there I transferred to the Associate Product Manager (APM) program at Google.  The APM program is a two year rotational program.  I did my first year working in display advertising.  After that I rotated to work on the Prediction API.

Ajay- How does the Google Prediction API help an average business analytics customer who is already using enterprise software , servers to generate his business forecasts. How does Google Prediction API fit in or complement other APIs in the Google API suite.

Zach- The Google Prediction API is a cloud based machine learning API.  We offer the ability for anybody to sign up and within a few minutes have their data uploaded to the cloud, a model built and an API to make predictions from anywhere. Traditionally the task of implementing predictive analytics inside an application required a fair amount of domain knowledge; you had to know a fair bit about machine learning to make it work.  With the Google Prediction API you only need to know how to use an online REST API to get started.

You can learn more about how we help businesses by watching our video and going to our project website.

Ajay-  What are the additional use cases of Google Prediction API that you think traditional enterprise software in business analytics ignore, or are not so strong on.  What use cases would you suggest NOT using Google Prediction API for an enterprise.

Zach- We are living in a world that is changing rapidly thanks to technology.  Storing, accessing, and managing information is much easier and more affordable than it was even a few years ago.  That creates exciting opportunities for companies, and we hope the Prediction API will help them derive value from their data.

The Prediction API focuses on providing predictive solutions to two types of problems: regression and classification. Businesses facing problems where there is sufficient data to describe an underlying pattern in either of these two areas can expect to derive value from using the Prediction API.

Ajay- What are your separate incentives to teach about Google APIs  to academic or researchers in universities globally.

Zach- I’d refer you to our university relations page

Google thrives on academic curiosity. While we do significant in-house research and engineering, we also maintain strong relations with leading academic institutions world-wide pursuing research in areas of common interest. As part of our mission to build the most advanced and usable methods for information access, we support university research, technological innovation and the teaching and learning experience through a variety of programs.

Ajay- What is the biggest challenge you face while communicating about Google Prediction API to traditional users of enterprise software.

Zach- Businesses often expect that implementing predictive analytics is going to be very expensive and require a lot of resources.  Many have already begun investing heavily in this area.  Quite often we’re faced with surprise, and even skepticism, when they see the simplicity of the Google Prediction API.  We work really hard to provide a very powerful solution and take care of the complexity of building high quality models behind the scenes so businesses can focus more on building their business and less on machine learning.

 

 

Amazon CC2 – The Big Cloud is finally here

Finally a powerful enough cloud computing instance from Amazon EC2 – called CC2 priced at 3$ per hour (for Windows instances) and 2.4$/hour for Linux

It would be interesting to see how SAS, IBM SPSS or R can leverage these

Storage – On the storage front, the CC2 instance type is packed with 60.5 GB of RAM and 3.37 TB of instance storage.

Processing – The CC2 instance type includes 2 Intel Xeon processors, each with 8 hardware cores. We’ve enabled Hyper-Threading, allowing each core to process a pair of instruction streams in parallel. Net-net, there are 32 hardware execution threads and you can expect 88 EC2 Compute Units (ECU’s) from this 64-bit instance type

On a somewhat smaller scale, you can launch your own array of 290 CC2 instances and create a Top500 supercomputer (63.7 teraFLOPS) at a cost of less than $1000 per hour

http://aws.typepad.com/aws/2011/11/next-generation-cluster-computing-on-amazon-ec2-the-cc2-instance-type.html

 

 

and

http://aws.amazon.com/hpc-applications/

 

 

Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cc2.8xlarge
Price: Starting from $2.40 per hour

But some caveats

  • The instances are available in a single Availability Zone in the US East (Northern Virginia) Region. We plan to add capacity in other EC2 Regions throughout 2012.
  • You can run 2 CC2 instances by default.
  • You cannot currently launch instances of this type within a Virtual Private Cloud (VPC).

PiCloud gives away 20 free compute hours PER month

Announcement from PiCloud- (and this is apart from the 5 hours free that a beginner account gets)

http://www.picloud.com/

 

Starting this month, all users will get 20 c1 core hours worth of credits each and every month.

 

  • If you ran out of your original 5 core hour credits, you can come back and play around some more!
  • If you have minimal computing needs, this means that you can now use PiCloud regularly without even having to enter a credit card.

 

Looking for more? Don’t forget, we’re giving away $500 worth of credits as part of our Academic Research Program. Applications are due this Thursday, October 27th