Churn Analytics Contest at Crowd Analytix

Crowd Analytix- the Bangalore based Indian startup is moving fast in the

data scientist contest space (so watch out Kaggle!! )

Churn (loss of customers to competition) is a problem for telecom companies because it is more expensive to acquire a new customer than to keep your existing one from leaving. This contest is about enabling churn reduction using analytics.

To join, go to –

Interview James G Kobielus IBM Big Data

Here is an interview with  James G Kobielus, who is the Senior Program Director, Product Marketing, Big Data Analytics Solutions at IBM. Special thanks to Payal Patel Cudia of IBM’s communication team,for helping with the logistics for this.

Ajay -What are the specific parts of the IBM Platform that deal with the three layers of Big Data -variety, velocity and volume

James-Well first of all, let’s talk about the IBM Information Management portfolio. Our big data platform addresses the three layers of big data to varying degrees either together in a product , or two out of the three or even one of the three aspects. We don’t have separate products for the variety, velocity and volume separately.

Let us define these three layers-Volume refers to the hundreds of terabytes and petabytes of stored data inside organizations today. Velocity refers to the whole continuum from batch to real time continuous and streaming data.

Variety refers to multi-structure data from structured to unstructured files, managed and stored in a common platform analyzed through common tooling.

For Volume-IBM has a highly scalable Big Data platform. This includes Netezza and Infosphere groups of products, and Watson-like technologies that can support petabytes volume of data for analytics. But really the support of volume ranges across IBM’s Information Management portfolio both on the database side and the advanced analytics side.

For real time Velocity, we have real time data acquisition. We have a product called IBM Infosphere, part of our Big Data platform, that is specifically built for streaming real time data acquisition and delivery through complex event processing. We have a very rich range of offerings that help clients build a Hadoop environment that can scale.

Our Hadoop platform is the most real time capable of all in the industry. We are differentiated by our sheer breadth, sophistication and functional depth and tooling integrated in our Hadoop platform. We are differentiated by our streaming offering integrated into the Hadoop platform. We also offer a great range of modeling and analysis tools, pretty much more than any other offering in the Big Data space.

Attached- Jim’s slides from Hadoop World

Ajay- Any plans for Mahout for Hadoop

Jim- I cant speak about product plans. We have plans but I cant tell you anything more. We do have a feature in Big Insights called System ML, a library for machine learning.

Ajay- How integral are acquisitions for IBM in the Big Data space (Netezza,Cognos,SPSS etc). Is it true that everything that you have in Big Data is acquired or is the famous IBM R and D contributing here . (see a partial list of IBM acquisitions at at )

Jim- We have developed a lot on our own. We have the deepest R and D of anybody in the industry in all things Big Data.

For example – Watson has Big Insights Hadoop at its core. Apache Hadoop is the heart and soul of Big Data (see ). A great deal that makes Big Insights so differentiated is that not everything that has been built has been built by the Hadoop community.

We have built additions out of the necessity for security, modeling, monitoring, and governance capabilities into BigInsights to make it truly enterprise ready. That is one example of where we have leveraged open source and we have built our own tools and technologies and layered them on top of the open source code.

Yes of course we have done many strategic acquisitions over the last several years related to Big Data Management and we continue to do so. This quarter we have done 3 acquisitions with strong relevance to Big Data. One of them is Vivisimo ( ).

Vivisimo provides federated Big Data discovery, search and profiling capabilities to help you figure out what data is out there,what is relevance of that data to your data science project- to help you answer the question which data should you bring in your Hadoop Cluster.

 We also did Varicent , which is more performance management and we did TeaLeaf , which is a customer experience solution provider where customer experience management and optimization is one of the hot killer apps for Hadoop in the cloud. We have done great many acquisitions that have a clear relevance to Big Data.

Netezza already had a massively parallel analytics database product with an embedded library of models called Netezza Analytics, and in-database capabilties to massively parallelize Map Reduce and other analytics management functions inside the database. In many ways, Netezza provided capabilities similar to that IBM had provided for many years under the Smart Analytics Platform ( ) .

There is a differential between Netezza and ISAS.

ISAS was built predominantly in-house over several years . If you go back a decade ago IBM acquired Ascential Software , a product portfolio that was the heart and soul of IBM InfoSphere Information Manager that is core to our big Data platform. In addition to Netezza, IBM bought SPSS two years back. We already had data mining tools and predictive modeling in the InfoSphere portfolio, but we realized we needed to have the best of breed, SPSS provided that and so IBM acquired them.

 Cognos– We had some BI reporting capabilities in the InfoSphere portfolio that we had built ourselves and also acquired for various degrees from prior acquisitions. But clearly Cognos was one of the best BI vendors , and we were lacking such a rich tool set in our product in visualization and cubing and so for that reason we acquired Cognos.

There is also Unica – which is a marketing campaign optimization which in many ways is a killer app for Hadoop. Projects like that are driving many enterprises.

Ajay- How would you rank order these acquisitions in terms of strategic importance rather than data of acquisition or price paid.

Jim-Think of Big Data as an ecosystem that has components that are fitted to particular functions for data analytics and data management. Is the database the core, or the modeling tool the core, or the governance tools the core, or is the hardware platform the core. Everything is critically important. We would love to hear from you what you think have been most important. Each acquisition has helped play a critical role to build the deepest and broadest solution offering in Big Data. We offer the hardware, software, professional services, the hosting service. I don’t think there is any validity to a rank order system.

Ajay-What are the initiatives regarding open source that Big Data group have done or are planning?

Jim- What we are doing now- We are very much involved with the Apache Hadoop community. We continue to evolve the open source code that everyone leverages.. We have built BigInsights on Apache Hadoop. We have the closest, most up to date in terms of version number to Apache Hadoop ( Hbase,HDFS, Pig etc) of all commercial distributions with our BigInsights 1.4 .

We have an R library integrated with BigInsights . We have a R library integrated with Netezza Analytics. There is support for R Models within the SPSS portfolio. We already have a fair amount of support for R across the portfolio.

Ajay- What are some of the concerns (privacy,security,regulation) that you think can dampen the promise of Big Data.

Jim- There are no showstoppers, there is really a strong momentum. Some of the concerns within the Hadoop space are immaturity of the technology, the immaturity of some of the commercial offerings out there that implement Hadoop, the lack of standardization for formal sense for Hadoop.

There is no Open Standards Body that declares, ratifies the latest version of Mahout, Map Reduce, HDFS etc. There is no industry consensus reference framework for layering these different sub projects. There are no open APIs. There are no certifications or interoperability standards or organizations to certify different vendors interoperability around a common API or framework.

The lack of standardization is troubling in this whole market. That creates risks for users because users are adopting multiple Hadoop products. There are lots of Hadoop deployments in the corporate world built around Apache Hadoop (purely open source). There may be no assurance that these multiple platforms will interoperate seamlessly. That’s a huge issue in terms of just magnifying the risk. And it increases the need for the end user to develop their own custom integrated code if they want to move data between platforms, or move map-reduce jobs between multiple distributions.

Also governance is a consideration. Right now Hadoop is used for high volume ETL on multi structured and unstructured data sources, or Hadoop is used for exploratory sand boxes for data scientists. These are important applications that are a majority of the Hadoop deployments . Some Hadoop deployments are stand alone unstructured data marts for specific applications like sentiment analysis like.

Hadoop is not yet ready for data warehousing. We don’t see a lot of Hadoop being used as an alternative to data warehouses for managing the single version of truth of system or record data. That day will come but there needs to be out there in the marketplace a broader range of data governance mechanisms , master data management, data profiling products that are mature that enterprises can use to make sure their data inside their Hadoop clusters is clean and is the single version of truth. That day has not arrived yet.

One of the great things about IBM’s acquisition of Vivisimo is that a piece of that overall governance picture is discovery and profiling for unstructured data , and that is done very well by Vivisimo for several years.

What we will see is vendors such as IBM will continue to evolve security features inside of our Hadoop platform. We will beef up our data governance capabilities for this new world of Hadoop as the core of Big Data, and we will continue to build up our ability to integrate multiple databases in our Hadoop platform so that customers can use data from a bit of Hadoop,some data from a bit of traditional relational data warehouse, maybe some noSQL technology for different roles within a very complex Big Data environment.

That latter hybrid deployment model is becoming standard across many enterprises for Big Data. A cause for concern is when your Big Data deployment has a bit of Hadoop, bit of noSQL, bit of EDW, bit of in-memory , there are no open standards or frameworks for putting it all together for a unified framework not just for interoperability but also for deployment.

There needs to be a virtualization or abstraction layer for unified access to all these different Big Data platforms by the users/developers writing the queries, by administrators so they can manage data and resources and jobs across all these disparate platforms in a seamless unified way with visual tooling. That grand scenario, the virtualization layer is not there yet in any standard way across the big data market. It will evolve, it may take 5-10 years to evolve but it will evolve.

So, that’s the concern that can dampen some of the enthusiasm for Big Data Analytics.


You can read more about Jim at or

follow him on Twitter at

You can read more about IBM Big Data at

Software Review- Google Drive versus Dropbox

Here are some notes from reviewing Google Drive vs Dropbox

1) Google Drive gives more free space upfront  than Dropbox.5GB versus 2GB

2) Dropbox has a referral system 500 mb per referral while there is no referral system for Google Drive

3) The sync facility with Google Docs makes Google Drive especially useful for prior users of Google Docs.

4) API access to Google Drive is only for Chrome apps which is intriguing!

Apps will not have any API access to files unless users have first installed the app in Chrome Web Store.

You can use the Dropbox API much more easily –

See the platforms at

Choose your platform:

iOS Android Python Ruby


(though I wonder if you set the R working directory to the local shared drive for Google Drive it should sync up as well but of course be slower –

5) Google Drive icon is ugly (seriously, dude!) , but the features in the Windows app is just the same as the Dropbox App. Too similar 😉


6) Upgrade space is much more cheaper to Google Drive than Dropbox ( by Google Drive prices being exactly  a quarter of prices on Dropbox and max storage being 16 times as much). This will affect power storage users. I expect to see some slowdown in Dropbox new business unless G Drive has outage (like Gmail) . Existing users at Dropbox probably wont shift for the small dollar amount- though it is quite easy to do so.


Install Google Drive on your local workstation and cut and paste your Dropbox local folder to the Google Drive local folder!!

7) Dropbox deserves credit for being first (like Hotmail and AOL) but Google Drive is almost better in all respects!

Google Drive

5 GB of Drive (0% used)
10 GB of Gmail (48% used)
1 GB of Picasa (0% used)


25 GB
2,49 $ / Month
+25 GB for Drive and Picasa
Bonus: Your Gmail storage will be upgraded to 25 GB.
Choose this plan

100 GB
4,99 $ / Month
+100 GB for Drive and Picasa
Bonus: Your Gmail storage will be upgraded to 25 GB.
Choose this plan

 Need more storage?

Up to 16 TB available


Current account type

Large DropboxDropbox Badge greenFree
Up to 18 GB (2 GB + 500 MB per referral)
Account info 

Other account types

Large DropboxDropbox Badge orange50 GB +
Pro 50
+1 GB per referral, up to +32 GB
$9.99/month or $99.00/year Upgrade to Pro 50
Large DropboxDropbox Badge purple100 GB +
Pro 100
+1 GB per referral, up to +32 GB
$19.99/month or $199.00/year Upgrade to Pro 100
Triple DropboxDropbox For Teams Badge1 TB +
Plans starting at 1 TB
Large shared quota, centralized admin and billing, and more!




Interview: Hjálmar Gíslason, CEO of

Here is an interview with Hjálmar Gíslason, CEO of  . DataMarket is an active marketplace for structured data and statistics. Through powerful search and visual data exploration, DataMarket connects data seekers with data providers.


Ajay-  Describe your journey as an entrepreneur and techie in Iceland. What are the 10 things that surprised you most as a tech entrepreneur.

HG- DataMarket is my fourth tech start-up since at age 20 in 1996. The previous ones have been in gaming, mobile and web search. I come from a technical background but have been moving more and more to the business side over the years. I can still prototype, but I hope there isn’t a single line of my code in production!

Funny you should ask about the 10 things that have surprised me the most on this journey, as I gave a presentation – literally yesterday – titled: “9 things nobody told me about the start-up business”

They are:
* Do NOT generalize – especially not to begin with
* Prioritize – and find a work-flow that works for you
* Meet people – face to face
* You are a sales person – whether you like it or not
* Technology is not a product – it’s the entire experience
* Sell the current version – no matter how amazing the next one is
* Learn from mistakes – preferably others’
* Pick the right people – good people is not enough
* Tell a good story – but don’t make them up

I obviously elaborate on each of these points in the talk, but the points illustrate roughly some of the things I believe I’ve learned … so far 😉

9 things nobody told me about the start-up business


Both Amazon  and Google  have entered the public datasets space. Infochimps  has 14,000+ public datasets. The US has

So clearly the space is both competitive and yet the demand for public data repositories is clearly under served still. 

How does DataMarket intend to address this market in a unique way to differentiate itself from others.

HG- DataMarket is about delivering business data to decision makers. We help data seekers find the data they need for planning and informed decision making, and data publishers reaching this audience. is the meeting point, where data seekers can come to find the best available data, and data publishers can make their data available whether for free or for a fee. We’ve populated the site with a wealth of data from public sources such as the UN, Eurostat, World Bank, IMF and others, but there is also premium data that is only available to those that subscribe to and pay for the access. For example we resell the entire data offering from the EIU (Economist Intelligence Unit) (link: allows all this data to be searched, visualized, compared and downloaded in a single place in a standard, unified manner.

We see many of these efforts not as competition, but as valuable potential sources of data for our offering, while others may be competing with parts of our proposition, such as easy access to the public data sets.


Ajay- What are your views on data confidentiality and access to data owned by Governments funded by tax payer money.

HG- My views are very simple: Any data that is gathered or created for taxpayers’ money should be open and free of charge unless higher priorities such as privacy or national security indicate otherwise.

Reflecting that, any data that is originally open and free of charge is still open and free of charge on, just easier to find and work with.

Ajay-  How is the technology entrepreneurship and venture capital scene in Iceland. What things work and what things can be improved?

HG- The scene is quite vibrant, given the small community. Good teams with promising concepts have been able to get the funding they need to get started and test their footing internationally. When the rapid growth phase is reached outside funding may still be needed.

There are positive and negative things about any location. Among the good things about Iceland from the stand point of a technology start-up are highly skilled tech people and a relatively simple corporate environment. Among the bad things are a tiny local market, lack of skills in international sales and marketing and capital controls that were put in place after the crash of the Icelandic economy in 2008.

I’ve jokingly said that if a company is hot in the eyes of VCs it would get funding even if it was located in the jungles of Congo, while if they’re only lukewarm towards you, they will be looking for any excuse not to invest. Location can certainly be one of them, and in that case being close to the investor communities – physically – can be very important.

We’re opening up our sales and marketing offices in Boston as we speak. Not to be close to investors though, but to be close to our market and current customers.

Ajay- Describe your hobbies when you are not founding amazing tech startups.

HG- Most of my time is spent working – which happens to by my number one hobby.

It is still important to step away from it all every now and then to see things in perspective and come back with a clear mind.

I *love* traveling to exotic places. Me and my wife have done quite a lot of traveling in Africa and S-America: safari, scuba diving, skiing, enjoying nature. When at home I try to do some sports activities 3-4 times a week at least, and – recently – play with my now 8 month old son as much as I can.



Hjalmar GislasonHjálmar Gíslason, Founder and CEO: Hjalmar is a successful entrepreneur, founder of three startups in the gaming, mobile and web sectors since 1996. Prior to launching DataMarket, Hjalmar worked on new media and business development for companies in the Skipti Group (owners of Iceland Telecom) after their acquisition of his search startup – Spurl. Hjalmar offers a mix of business, strategy and technical expertise. DataMarket is based largely on his vision of the need for a global exchange for structured data.

To know more, have a quick  look at

Amazon gives away 750 hours /month of Windows based computing

and an additional 750 hours /month of Linux based computing. The windows instance is really quite easy for users to start getting the hang of cloud computing. and it is quite useful for people to tinker around, given Google’s retail cloud offerings are taking so long to hit the market

But it is only for new users.

WS Free Usage Tier now Includes Microsoft Windows on EC2

The AWS Free Usage Tier now allows you to run Microsoft Windows Server 2008 R2 on an EC2 t1.micro instance for up to 750 hours per month. This benefit is open to new AWS customers and to those who are already participating in the Free Usage Tier, and is available in all AWS Regions with the exception of GovCloud. This is an easy way for Windows users to start learning about and enjoying the benefits of cloud computing with AWS.

The micro instances provide a small amount of consistent processing power and the ability to burst to a higher level of usage from time to time. You can use this instance to learn about Amazon EC2, support a development and test environment, build an AWS application, or host a web site (or all of the above). We’ve fine-tuned the micro instances to make them even better at running Microsoft Windows Server.

You can launch your instance from the AWS Management Console:

We have lots of helpful resources to get you started:

Along with 750 instance hours of Windows Server 2008 R2 per month, the Free Usage Tier also provides another 750 instance hours to run Linux (also on a t1.micro), Elastic Load Balancer time and bandwidth, Elastic Block Storage, Amazon S3 Storage, and SimpleDB storage, a bunch of Simple Queue Service and Simple Notification Service requests, and some CloudWatch metrics and alarms (see the AWS Free Usage Tier page for details). We’ve also boosted the amount of EBS storage space offered in the Free Usage Tier to 30GB, and we’ve doubled the I/O requests in the Free Usage Tier, to 2 million.


2011 Analytics Recap

Events in the field of data that impacted us in 2011

1) Oracle unveiled plans for R Enterprise. This is one of the strongest statements of its focus on in-database analytics. Oracle also unveiled plans for a Public Cloud

2) SAS Institute released version 9.3 , a major analytics software in industry use.

3) IBM acquired many companies in analytics and high tech. Again.However the expected benefits from Cognos-SPSS integration are yet to show a spectacular change in market share.

2011 Selected acquisitions

Emptoris Inc. December 2011

Cúram Software Ltd. December 2011

DemandTec December 2011

Platform Computing October 2011

 Q1 Labs October 2011

Algorithmics September 2011

 i2 August 2011

Tririga March 2011


4) SAP promised a lot with SAP HANA- again no major oohs and ahs in terms of market share fluctuations within analytics.

5) Amazon continued to lower prices of cloud computing and offer more options.

6) Google continues to dilly -dally with its analytics and cloud based APIs. I do not expect all the APIs in the Google APIs suit to survive and be viable in the enterprise software space.  This includes Google Cloud Storage, Cloud SQL, Prediction API at Some of the location based , translation based APIs may have interesting spin offs that may be very very commercially lucrative.

7) Microsoft -did- hmm- I forgot. Except for its investment in Revolution Analytics round 1 many seasons ago- very little excitement has come from MS plans in data mining- The plugins for cloud based data mining from Excel remain promising yet , while Azure remains a stealth mode starter.

8) Revolution Analytics promised us a GUI and didnt deliver (till yet 🙂 ) . But it did reveal a much better Enterprise software Revolution R 5.0 is one of the strongest enterprise software in the R /Stat Computing space and R’s memory handling problem is now an issue of perception than actual stuff thanks to newer advances in how it is used.

9) More conferences, more books and more news on analytics startups in 2011. Big Data analytics remained a strong buzzword. Expect more from this space including creative uses of Hadoop based infrastructure.

10) Data privacy issues continue to hamper and impede effective analytics usage. So does rational and balanced regulation in some of the most advanced economies. We expect more regulation and better guidelines in 2012.

JMP and R – #rstats

An amazing example of R being used sucessfully in combination (and not is isolation) with other enterprise software is the add-ins functionality of JMP and it’s R integration.

See the following JMP add-ins which use R

JMP Add-in: Multidimensional Scaling using R

This add-in creates a new menu command under the Add-Ins Menu in the submenu R Add-ins. The script will launch a custom dialog (or prompt for a JMP data table is one is not already open) where you can cast columns into roles for performing MDS on the data table. The analysis results in a data table of MDS dimensions and associated output graphics. MDS is a dimension reduction method that produces coordinates in Euclidean space (usually 2D, 3D) that best represent the structure of a full distance/dissimilarity matrix. MDS requires that input be a symmetric dissimilarity matrix. Input to this application can be data that is already in the form of a symmetric dissimilarity matrix or the dissimilarity matrix can be computed based on the input data (where dissimilarity measures are calculated between rows of the input data table in R).

Submitted by: Kelci Miclaus SAS employee Initiative: All
Application: Add-Ins Analysis: Exploratory Data Analysis

Chernoff Faces Add-in

One way to plot multivariate data is to use Chernoff faces. For each observation in your data table, a face is drawn such that each variable in your data set is represented by a feature in the face. This add-in uses JMP’s R integration functionality to create Chernoff faces. An R install and the TeachingDemos R package are required to use this add-in.

Submitted by: Clay Barker SAS employee Initiative: All
Application: Add-Ins Analysis: Data Visualization

Support Vector Machine for Classification

By simply opening a data table, specifying X, Y variables, selecting a kernel function, and specifying its parameters on the user-friendly dialog, you can build a classification model using Support Vector Machine. Please note that R package ‘e1071’ should be installed before running this dialog. The package can be found from

Submitted by: Jong-Seok Lee SAS employee Initiative: All
Application: Add-Ins Analysis: Exploratory Data Analysis/Mining

Penalized Regression Add-in

This add-in uses JMP’s R integration functionality to provide access to several penalized regression methods. Methods included are the LASSO (least absolutee shrinkage and selection operator, LARS (least angle regression), Forward Stagewise, and the Elastic Net. An R install and the “lars” and “elasticnet” R packages are required to use this add-in.

Submitted by: Clay Barker SAS employee Initiative: All
Application: Add-Ins Analysis: Regression

MP Addin: Univariate Nonparametric Bootstrapping

This script performs simple univariate, nonparametric bootstrap sampling by using the JMP to R Project integration. A JMP Dialog is built by the script where the variable you wish to perform bootstrapping over can be specified. A statistic to compute for each bootstrap sample is chosen and the data are sent to R using new JSL functionality available in JMP 9. The boot package in R is used to call the boot() function and the function to calculate the sample statistic for each bootstrap sample and the basic bootstrap confidence interval. The results are brought back to JMP and displayed using the JMP Distribution platform.

Submitted by: Kelci Miclaus SAS employee Initiative: All
Application: Add-Ins Analysis: Basic Statistics