SAS Institute Financials 2011

SAS Institute has release it’s financials for 2011 at http://www.sas.com/news/preleases/2011financials.html,

Revenue surged across all solution and industry categories. Software to detect fraud saw a triple-digit jump. Revenue from on-demand solutions grew almost 50 percent. Growth from analytics and information management solutions were double digit, as were gains from customer intelligence, retail, risk and supply chain solutions

AJAY- and as a private company it is quite nice that they are willing to share so much information every year.

The graphics are nice ( and the colors much better than in 2010) , but pie-charts- seriously dude there is no way to compare how much SAS revenue is shifting across geographies or even across industries. So my two cents is – lose the pie charts, and stick to line graphs please for the share of revenue by country /industry.

In 2011, SAS grew staff 9.2 percent and reinvested 24 percent of revenue into research and development

AJAY- So that means 654 million dollars spent in Research and Development.  I wonder if SAS has considered investing in much smaller startups (than it’s traditional strategy of doing all research in-house and completely acquiring a smaller company)

Even a small investment of say 5-10 million USD in open source , or even Phd level research projects could greatly increase the ROI on that.

That means

Analyzing a private company’s financials are much more fun than a public company, and I remember the words of my finance professor ( “dig , dig”) to compare 2011 results with 2010 results.

http://www.sas.com/news/preleases/2010financials.html

The percentage invested in R and D is exactly the same (24%) and the percentages of revenue earned from each geography is exactly the same . So even though revenue growth increased from 5.2 % to 9% in 2011, both the geographic spread of revenues and share  R&D costs remained EXACTLY the same.

The Americas accounted for 46 percent of total revenue; Europe, Middle East and Africa (EMEA) 42 percent; and Asia Pacific 12 percent.

Overall, I think SAS remains a 35% market share (despite all that noise from IBM, SAS clones, open source) because they are good at providing solutions customized for industries (instead of just software products), the market for analytics is not saturated (it seems to be growing faster than 12% or is it) , and its ability to attract and retain the best analytical talent (which in a non -American tradition for a software company means no stock options, job security, and great benefits- SAS remains almost Japanese in HR practices).

In 2010, SAS grew staff by 2.4 percent, in 2011 SAS grew staff by 9 percent.

But I liked the directional statement made here-and I think that design interfaces, algorithmic and computational efficiencies should increase analytical time, time to think on business and reduce data management time further!

“What would you do with the extra time if your code ran in two minutes instead of five hours?” Goodnight challenged.

Quantitative Modeling for Arbitrage Positions in Ad KeyWords Internet Marketing

Assume you treat an ad keyword as an equity stock. There are slight differences in the cost for advertising for that keyword across various locations (Zurich vs Delhi) and various channels (Facebook vs Google) . You get revenue if your website ranks naturally in organic search for the keyword, and you have to pay costs for getting traffic to your website for that keyword.
An arbitrage position is defined as a riskless profit when cost of keyword is less than revenue from keyword. We take examples of Adsense  and Adwords primarily.
There are primarily two types of economic curves on the foundation of which commerce of the  internet  resides-
1) Cost Curve- Cost of Advertising to drive traffic into the website  (Google Adwords, Twitter Ads, Facebook , LinkedIn ads)
2) Revenue Curve – Revenue from ads clicked by the incoming traffic on website (like Adsense, LinkAds, Banner Ads, Ad Sharing Programs , In Game Ads)
The cost and revenue curves are primarily dependent on two things
1) Type of KeyWord-Also subdependent on
a) Location of Prospective Customer, and
b) Net Present Value of Good and Service to be eventually purchased
For example , keyword for targeting sales of enterprise “business intelligence software” should ideally be costing say X times as much as keywords for “flower shop for birthdays” where X is the multiple of the expected payoffs from sales of business intelligence software divided by expected payoff from sales of flowers (say in Location, Daytona Beach ,Florida or Austin, Texas)
2) Traffic Volume – Also sub-dependent on Time Series and
a) Seasonality -Annual Shoppping Cycle
b) Cyclicality– Macro economic shifts in time series
The cost and revenue curves are not linear and ideally should be continuous in a definitive exponential or polynomial manner, but in actual reality they may have sharp inflections , due to location, time, as well as web traffic volume thresholds
Type of Keyword – For example ,keywords for targeting sales for Eminem Albums may shoot up in a non linear manner after the musician dies.
The third and not so publicly known component of both the cost and revenue curves is factoring in internet industry dynamics , including relative market share of internet advertising platforms, as well as percentage splits between content creator and ad providing platforms.
For example, based on internet advertising spend, people belive that the internet advertising is currently heading for a duo-poly with Google and Facebook are the top two players, while Microsoft/Skype/Yahoo and LinkedIn/Twitter offer niche options, but primarily depend on price setting from Google/Bing/Facebook.
It is difficut to quantify  the elasticity and efficiency of market curves as most literature and research on this is by in-house corporate teams , or advisors or mentors or consultants to the primary leaders in a kind of incesteous fraternal hold on public academic research on this.
It is recommended that-
1) a balance be found in the need for corporate secrecy to protest shareholder value /stakeholder value maximization versus the need for data liberation for innovation and grow the internet ad pie faster-
2) Cost and Revenue Curves between different keywords, time,location, service providers, be studied by quants for hedging inetrent ad inventory or /and choose arbitrage positions This kind of analysis is done for groups of stocks and commodities in the financial world, but as commerce grows on the internet this may need more specific and independent quants.
3) attention be made to how cost and revenue curves mature as per level of sophistication of underlying economy like Brazil, Russia, China, Korea, US, Sweden may be in different stages of internet ad market evolution.
For example-
A study in cost and revenue curves for certain keywords across domains across various ad providers across various locations from 2003-2008 can help academia and research (much more than top ten lists of popular terms like non quantitative reports) as well as ensure that current algorithmic wightings are not inadvertently given away.
Part 2- of this series will explore the ways to create third party re-sellers of keywords and measuring impacts of search and ad engine optimization based on keywords.

2011 Analytics Recap

Events in the field of data that impacted us in 2011

1) Oracle unveiled plans for R Enterprise. This is one of the strongest statements of its focus on in-database analytics. Oracle also unveiled plans for a Public Cloud

2) SAS Institute released version 9.3 , a major analytics software in industry use.

3) IBM acquired many companies in analytics and high tech. Again.However the expected benefits from Cognos-SPSS integration are yet to show a spectacular change in market share.

2011 Selected acquisitions

Emptoris Inc. December 2011

Cúram Software Ltd. December 2011

DemandTec December 2011

Platform Computing October 2011

 Q1 Labs October 2011

Algorithmics September 2011

 i2 August 2011

Tririga March 2011

 

4) SAP promised a lot with SAP HANA- again no major oohs and ahs in terms of market share fluctuations within analytics.

http://www.sap.com/india/news-reader/index.epx?articleID=17619

5) Amazon continued to lower prices of cloud computing and offer more options.

http://aws.amazon.com/about-aws/whats-new/2011/12/21/amazon-elastic-mapreduce-announces-support-for-cc2-8xlarge-instances/

6) Google continues to dilly -dally with its analytics and cloud based APIs. I do not expect all the APIs in the Google APIs suit to survive and be viable in the enterprise software space.  This includes Google Cloud Storage, Cloud SQL, Prediction API at https://code.google.com/apis/console/b/0/ Some of the location based , translation based APIs may have interesting spin offs that may be very very commercially lucrative.

7) Microsoft -did- hmm- I forgot. Except for its investment in Revolution Analytics round 1 many seasons ago- very little excitement has come from MS plans in data mining- The plugins for cloud based data mining from Excel remain promising yet , while Azure remains a stealth mode starter.

8) Revolution Analytics promised us a GUI and didnt deliver (till yet 🙂 ) . But it did reveal a much better Enterprise software Revolution R 5.0 is one of the strongest enterprise software in the R /Stat Computing space and R’s memory handling problem is now an issue of perception than actual stuff thanks to newer advances in how it is used.

9) More conferences, more books and more news on analytics startups in 2011. Big Data analytics remained a strong buzzword. Expect more from this space including creative uses of Hadoop based infrastructure.

10) Data privacy issues continue to hamper and impede effective analytics usage. So does rational and balanced regulation in some of the most advanced economies. We expect more regulation and better guidelines in 2012.

High Performance Analytics

Marry Big Data Analytics to High Performance Computing, and you get the buzzword of this season- High Performance Analytics.

It basically consists of Parallelized code to run in parallel on custom hardware, in -database analytics for speed, and cloud computing /high performance computing environments. On an operational level, it consists of software (as in analytics) partnering with software (as in databases, Map reduce, Hadoop) plus some hardware (HP or IBM mostly). It is considered a high margin , highly profitable, business with small number of deals compared to say desktop licenses.

As per HPC Wire- which is a great tool/newsletter to keep updated on HPC , SAS Institute has been busy on this front partnering with EMC Greenplum and TeraData (who also acquired  SAS Partner AsterData to gain a much needed foot in the MR/SQL space) Continue reading “High Performance Analytics”

Top Ten Graphs for Business Analytics -Pie Charts (1/10)

I have not been really posting or writing worthwhile on the website for some time, as I am still busy writing ” R for Business Analytics” which I hope to get out before year end. However while doing research for that, I came across many types of graphs and what struck me is the actual usage of some kinds of graphs is very different in business analytics as compared to statistical computing.

The criterion of top ten graphs is as follows-

1) Usage-The order in which they appear is not strictly in terms of desirability but actual frequency of usage. So a frequently used graph like box plot would be recommended above say a violin plot.

2) Adequacy- Data Visualization paradigms change over time- but the need for accurate conveying of maximum information in a minium space without overwhelming reader or misleading data perceptions.

3) Ease of creation- A simpler graph created by a single function is more preferrable to writing 4-5 lines of code to create an elaborate graph.

4) Aesthetics– Aesthetics is relative and  in addition studies have shown visual perception varies across cultures and geographies. However , beauty is universally appreciated and a pretty graph is sometimes and often preferred over a not so pretty graph. Here being pretty is in both visual appeal without compromising perceptual inference from graphical analysis.

 

so When do we use a bar chart versus a line graph versus a pie chart? When is a mosaic plot more handy and when should histograms be used with density plots? The list tries to capture most of these practicalities.

Let me elaborate on some specific graphs-

1) Pie Chart- While Pie Chart is not really used much in stats computing, and indeed it is considered a misleading example of data visualization especially the skewed or two dimensional charts. However when it comes to evaluating market share at a particular instance, a pie chart is simple to understand. At the most two pie charts are needed for comparing two different snapshots, but three or more pie charts on same data at different points of time is definitely a bad case.

In R you can create piechart, by just using pie(dataset$variable)

As per official documentation, pie charts are not  recommended at all.

http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/pie.html

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists.

—-

Despite this, pie charts are frequently used as an important metric they inevitably convey is market share. Market share remains an important analytical metric for business.

The pie3D( ) function in the plotrix package provides 3D exploded pie charts.An exploded pie chart remains a very commonly used (or misused) chart.

From http://lilt.ilstu.edu/jpda/charts/chart%20tips/Chartstip%202.htm#Rules

we see some rules for using Pie charts.

 

  1. Avoid using pie charts.
  2. Use pie charts only for data that add up to some meaningful total.
  3. Never ever use three-dimensional pie charts; they are even worse than two-dimensional pies.
  4. Avoid forcing comparisons across more than one pie chart

 

From the R Graph Gallery (a slightly outdated but still very comprehensive graphical repository)

http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=4

par(bg="gray")
pie(rep(1,24), col=rainbow(24), radius=0.9)
title(main="Color Wheel", cex.main=1.4, font.main=3)
title(xlab="(test)", cex.lab=0.8, font.lab=3)
(Note adding a grey background is quite easy in the basic graphics device as well without using an advanced graphical package)

 

HIGHLIGHTS from REXER Survey :R gives best satisfaction

Simple graph showing hierarchical clustering. ...
Image via Wikipedia

A Summary report from Rexer Analytics Annual Survey

 

HIGHLIGHTS from the 4th Annual Data Miner Survey (2010):

 

•   FIELDS & GOALS: Data miners work in a diverse set of fields.  CRM / Marketing has been the #1 field in each of the past four years.  Fittingly, “improving the understanding of customers”, “retaining customers” and other CRM goals are also the goals identified by the most data miners surveyed.

 

•   ALGORITHMS: Decision trees, regression, and cluster analysis continue to form a triad of core algorithms for most data miners.  However, a wide variety of algorithms are being used.  This year, for the first time, the survey asked about Ensemble Models, and 22% of data miners report using them.
A third of data miners currently use text mining and another third plan to in the future.

 

•   MODELS: About one-third of data miners typically build final models with 10 or fewer variables, while about 28% generally construct models with more than 45 variables.

 

•   TOOLS: After a steady rise across the past few years, the open source data mining software R overtook other tools to become the tool used by more data miners (43%) than any other.  STATISTICA, which has also been climbing in the rankings, is selected as the primary data mining tool by the most data miners (18%).  Data miners report using an average of 4.6 software tools overall.  STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.

 

•   TECHNOLOGY: Data Mining most often occurs on a desktop or laptop computer, and frequently the data is stored locally.  Model scoring typically happens using the same software used to develop models.  STATISTICA users are more likely than other tool users to deploy models using PMML.

 

•   CHALLENGES: As in previous years, dirty data, explaining data mining to others, and difficult access to data are the top challenges data miners face.  This year data miners also shared best practices for overcoming these challenges.  The best practices are available online.

 

•   FUTURE: Data miners are optimistic about continued growth in the number of projects they will be conducting, and growth in data mining adoption is the number one “future trend” identified.  There is room to improve:  only 13% of data miners rate their company’s analytic capabilities as “excellent” and only 8% rate their data quality as “very strong”.

 

Please contact us if you have any questions about the attached report or this annual research program.  The 5th Annual Data Miner Survey will be launching next month.  We will email you an invitation to participate.

 

Information about Rexer Analytics is available at www.RexerAnalytics.com. Rexer Analytics continues their impressive journey see http://www.rexeranalytics.com/Clients.html

|My only thought- since most data miners are using multiple tools including free tools as well as paid software, Perhaps a pie chart of market share by revenue and volume would be handy.

Also some ideas on comparing diverse data mining projects by data size, or complexity.

 

PAW Blog Partnership

Please use the following code  to get a 15% discount on the 2 Day Conference Pass: AJAY11.

 

 

 

 

Predictive Analytics World announces new full-day workshops coming to San Francisco March 13-19, amounting to seven consecutive days of content.

These workshops deliver top-notch analytical and business expertise across the hottest topics.

Register now for one or more workshops, offered just before and after the full two-day Predictive Analytics World conference program (March 14-15). Early Bird registration ends on January 31st – take advantage of reduced pricing before then.

Driving Enterprise Decisions with Business Analytics – March 13, 2011
James Taylor, CEO, Decision Management Solutions
NEW – R for Predictive Modeling: A Hands-On Introduction – March 13, 2011
Max Kuhn, Director, Nonclinical Statistics, Pfizer
The Best and Worst of Predictive Analytics: Predictive Modeling Methods and Common Data Mining Mistakes – March 16, 2011
John Elder, Ph.D., CEO and Founder, Elder Research, Inc.
Hands-On Predictive Analytics – March 17, 2011
Dean Abbott, President, Abbott Analytics
NEW – Net Lift Models: Optimizing the Impact of Your Marketing – March 18-19, 2011
Kim Larsen, VP of Analytical Insights, Market Share Partners

Download the Conference Preview or view the Predictive Analytics World Agenda online

Make savings now with the early bird rate. Receive $200 off your registration rate for Predictive Analytics World – San Francisco (March 14-15), plus $100 off each workshop for which you register.

Register now before Early Bird Price expires on January 31st!

Additional savings of $200 on the two-day conference pass when you register a colleague at the same time.