KDNuggets Poll on SAS: Churn in Analytics Users

Here are the some surprising results from the Bible of all Data Miners , KDNuggets.com with some interesting comments about SAS being the Microsoft of analytics.

I believe technically advanced users will probably want to try out R before going in for a commercial license from Revolution Analytics as it is free to try out. Also WPS offers a one month free preview for its software- the latest release of it competes with SAS/Stat and SAS/Access, SAS/Graph and Base SAS- so anyone having these installations on a server would be interested to atleast test it for free. Also WPS would be interested in increasing engines (like they have for Oracle and Teradata).

One very crucial difference for SAS is it’s ability to pull in data from almost all data formats- so if you are using SAS/Connect to remote submit code- then you may not be able to switch soon.

Also the more license heavy customers are not the kind of cutomers who have lots of data in their local desktops but is usually pulled and then crunched before analysed. R has recently made some strides with the RevoScaler package from Revolution Analytics but it’s effectiveness would be tested and tried in the coming months- it seems like a great step in the right direction.

For SAS, the feedback should be a call to improve their product bundling – some of which can feel like over selling at times- but they have been fighting off challenges since past 4 decades and have the pockets and intention to sustain market share battles including discounts ( for repeat customers SAS can be much cheaper than say a first time user of WPS or R)

http://teamwpc.co.uk/home

This really should come as a surprise to some people. You can see the comments on WPS and R at the site itself. Interesting stufff and we can see after say 1 year to see how many actually DID switch.

http://www.kdnuggets.com/polls/2010/switching-from-sas-to-wps.html

Dryad- Microsoft’s answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • and
  • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
    • public static IQueryable<R>
      MapReduce<S,M,K,R>(this IQueryable<S> source,
      Expression<Func<S,IEnumerable<M>>> mapper,
      Expression<Func<M,K>> keySelector,
      Expression<Func<K,IEnumerable<M>,R>> reducer)
      {
      return source.SelectMany(mapper).GroupBy(keySelector, reducer);
      }

    and http://research.microsoft.com/en-us/projects/dryad/

    Dryad

    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Overview

    New! An academic release of DryadLINQ is now available for public download.

    Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

    The Structure of Dryad Jobs

    A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

    Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

    The Dryad Software Stack

    As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

    • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
    • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
    • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
      1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
      2. It allows the construction of pipes spanning multiple machines, across a cluster.
      3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
    • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

    Publications

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

    Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

    Also interesting to read-

    Why does Dryad use a DAG?

    he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

    So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

    from

    http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

      Dryad- Microsoft's answer to MR

      While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

      http://research.microsoft.com/en-us/projects/dryadlinq/

      DryadLINQ

      DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

      Overview

      New! An academic release of Dryad/DryadLINQ is now available for public download.

      The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

      Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

      DryadLINQ translates LINQ programs into distributed Dryad computations:

      • C# and LINQ data objects become distributed partitioned files.
      • LINQ queries become distributed Dryad jobs.
      • C# methods become code running on the vertices of a Dryad job.

      DryadLINQ has the following features:

      • Declarative programming: computations are expressed in a high-level language similar to SQL
      • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
      • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
      • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
    • and
    • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
      • public static IQueryable<R>
        MapReduce<S,M,K,R>(this IQueryable<S> source,
        Expression<Func<S,IEnumerable<M>>> mapper,
        Expression<Func<M,K>> keySelector,
        Expression<Func<K,IEnumerable<M>,R>> reducer)
        {
        return source.SelectMany(mapper).GroupBy(keySelector, reducer);
        }

      and http://research.microsoft.com/en-us/projects/dryad/

      Dryad

      The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

      Overview

      New! An academic release of DryadLINQ is now available for public download.

      Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

      The Structure of Dryad Jobs

      A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

      Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

      The Dryad Software Stack

      As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

      • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
      • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
      • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
        1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
        2. It allows the construction of pipes spanning multiple machines, across a cluster.
        3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
      • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

      Publications

      Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
      European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

      Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

      Also interesting to read-

      Why does Dryad use a DAG?

      he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

      So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

      from

      http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

        Big Data and R: New Product Release by Revolution Analytics

        Press Release by the Guys in Revolution Analytics- this time claiming to enable terabyte level analytics with R. Interesting stuff but techie details are awaited.

        Revolution Analytics Brings

        Big Data Analysis to R

        The world’s most powerful statistics language can now tackle terabyte-class data sets using

        Revolution R Enterpriseat a fraction of the cost of legacy analytics products


        JSM 2010 – VANCOUVER (August 3, 2010) — Revolution Analytics today introduced ‘Big Data’ analysis to its Revolution R Enterprise software, taking the popular R statistics language to unprecedented new levels of capacity and performance for analyzing very large data sets. For the first time, R users will be able to process, visualize and model terabyte-class data sets in a fraction of the time of legacy products—without employing expensive or specialized hardware.

        The new version of Revolution R Enterprise introduces an add-on package called RevoScaleR that provides a new framework for fast and efficient multi-core processing of large data sets. It includes:

        • The XDF file format, a new binary ‘Big Data’ file format with an interface to the R language that provides high-speed access to arbitrary rows, blocks and columns of data.
        • A collection of widely-used statistical algorithms optimized for Big Data, including high-performance implementations of Summary Statistics, Linear Regression, Binomial Logistic Regressionand Crosstabs—with more to be added in the near future.
        • Data Reading & Transformation tools that allow users to interactively explore and prepare large data sets for analysis.
        • Extensibility, expert R users can develop and extend their own statistical algorithms to take advantage of Revolution R Enterprise’s new speed and scalability capabilities.

        “The R language’s inherent power and extensibility has driven its explosive adoption as the modern system for predictive analytics,” said Norman H. Nie, president and CEO of Revolution Analytics. “We believe that this new Big Data scalability will help R transition from an amazing research and prototyping tool to a production-ready platform for enterprise applications such as quantitative finance and risk management, social media, bioinformatics and telecommunications data analysis.”

        Sage Bionetworks is the nonprofit force behind the open-source collaborative effort, Sage Commons, a place where data and disease models can be shared by scientists to better understand disease biology. David Henderson, Director of Scientific Computing at Sage, commented: “At Sage Bionetworks, we need to analyze genomic databases hundreds of gigabytes in size with R. We’re looking forward to using the high-speed data-analysis features of RevoScaleR to dramatically reduce the times it takes us to process these data sets.”

        Take Hadoop and Other Big Data Sources to the Next Level

        Revolution R Enterprise fits well within the modern ‘Big Data’ architecture by leveraging popular sources such as Hadoop, NoSQL or key value databases, relational databases and data warehouses. These products can be used to store, regularize and do basic manipulation on very large datasets—while Revolution R Enterprise now provides advanced analytics at unparalleled speed and scale: producing speed on speed.

        “Together, Hadoop and R can store and analyze massive, complex data,” said Saptarshi Guha, developer of the popular RHIPE R package that integrates the Hadoop framework with R in an automatically distributed computing environment. “Employing the new capabilities of Revolution R Enterprise, we will be able to go even further and compute Big Data regressions and more.”

        Platforms and Availability

        The new RevoScaleR package will be delivered as part of Revolution R Enterprise 4.0, which will be available for 32-and 64-bit Microsoft Windows in the next 30 days. Support for Red Hat Enterprise Linux (RHEL 5) is planned for later this year.

        On its website (http://www.revolutionanalytics.com/bigdata), Revolution Analytics has published performance and scalability benchmarks for Revolution R Enterprise analyzing a 13.2 gigabyte data set of commercial airline information containing more than 123 million rows, and 29 columns.

        Additionally, the company will showcase its new Big Data solution in a free webinar on August 25 at 9:00 a.m. Pacific.

        Additional Resources

        •      Big Data Benchmark whitepaper

        •      The Revolution Analytics Roadmap whitepaper

        •      Revolutions Blog

        •      Download free academic copy of Revolution R Enterprise

        •      Visit Inside-R.org for the most comprehensive set of information on R

        •      Spread the word: Add a “Download R!” badge on your website

        •      Follow @RevolutionR on Twitter

        About Revolution Analytics

        Revolution Analytics (http://www.revolutionanalytics.com) is the leading commercial provider of software and support for the popular open source R statistics language. Its Revolution R products help make predictive analytics accessible to every type of user and budget. The company is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital.

        Media Contact

        Chantal Yang
        Page One PR, for Revolution Analytics
        Tel: +1 415-875-7494

        Email:  revolution@pageonepr.com

        Open Source and Software Strategy

        Curt Monash at Monash Research pointed out some ongoing open source GPL issues for WordPress and the Thesis issue (Also see http://ma.tt/2009/04/oracle-and-open-source/ and  http://www.mattcutts.com/blog/switching-things-around/).

        As a user of both going upwards of 2 years- I believe open source and GPL license enforcement are general parts of software strategy of most software companies nowadays. Some thoughts on  open source and software strategy-Thesis remains a very very popular theme and has earned upwards of 100,000 $ for its creator (estimate based on 20k plus installs and 60$ avg price)

        • Little guys like to give away code to get some satisfaction/ recognition, big guys give away free code only when its necessary or when they are not making money in that product segment anyway.
        • As Ethan Hunt said, ” Every Hero needs a Villian”. Every software (market share) war between players needs One Big Company Holding more market share and Open Source Strategy between other player who is not able to create in house code, so effectively out sources by creating open source project. But same open source propent rarely gives away the secret to its own money making project.
          • Examples- Google creates open source Android, but wont reveal its secret algorithm for search which drives its main profits,
          • Google again puts a paper for MapReduce but it’s Yahoo that champions Hadoop,
          • Apple creates open source projects (http://www.apple.com/opensource/) but wont give away its Operating Source codes (why?) which help people buys its more expensive hardware,
          • IBM who helped kickstart the whole proprietary code thing (remember MS DOS) is the new champion of open source (http://www.ibm.com/developerworks/opensource/) and
          • Microsoft continues to spark open source debate but read http://blogs.technet.com/b/microsoft_blog/archive/2010/07/02/a-perspective-on-openness.aspx and  also http://www.microsoft.com/opensource/
          • SAS gives away a lot of open source code (Read Jim Davis , CMO SAS here , but will stick to Base SAS code (even though it seems to be making more money by verticals focus and data mining).
          • SPSS was the first big analytics company that helps supports R (open source stats software) but will cling to its own code on its softwares.
          • WordPress.org gives away its software (and I like Akismet just as well as blogging) for open source, but hey as anyone who is on WordPress.com knows how locked in you can get by its (pricy) platform.
          • Vendor Lock-in (wink wink price escalation) is the elephant in the room for Big Software Proprietary Companies.
          • SLA Quality, Maintenance and IP safety is the uh-oh for going in for open source software mostly.
        • Lack of IP protection for revenue models for open source code is the big bottleneck  for a lot of companies- as very few software users know what to do with source code if you give it to them anyways.
          • If companies were confident that they would still be earning same revenue and there would be less leakage or theft, they would gladly give away the source code.
          • Derivative softwares or extensions help popularize the original softwares.
            • Half Way Steps like Facebook Applications  the original big company to create a platform for third party creators),
            • IPhone Apps and Android Applications show success of creating APIs to help protect IP and software control while still giving some freedom to developers or alternate
            • User Interfaces to R in both SAS/IML and JMP is a similar example
        • Basically open source is mostly done by under dog while top dog mostly rakes in money ( and envy)
        • There is yet to a big commercial success in open source software, though they are very good open source softwares. Just as Google’s success helped establish advertising as an alternate ( and now dominant) revenue source for online companies , Open Source needs a big example of a company that made billions while giving source code away and still retaining control and direction of software strategy.
        • Open source people love to hate proprietary packages, yet there are more shades of grey (than black and white) and hypocrisy (read lies) within  the open source software movement than the regulated world of big software. People will be still people. Software is just a piece of code.  😉

        (Art citation-http://gapingvoid.com/about/ and http://gapingvoidgallery.com/

        Algorithms and Ads: No Free Lunches and Hill Climbing

        From http://www.no-free-lunch.org/

        More formally, where
        d = training set;
        m = number of elements in training set;
        f = ‘target’ input-output relationships;
        h = hypothesis (the algorithm’s guess for f made in response to d); and
        C = off-training-set ‘loss’ associated with f and h (‘generalization error’)
        all algorithms are equivalent, on average, by any of the following measures of risk: E(C|d), E(C|m), E(C|f,d), or E(C|f,m).

        How well you do is determined by how ‘aligned’ your learning algorithm P(h|d) is with the actual posterior, P(f|d).

        Wolpert’s result, in essence, formalizes Hume, extends him and calls the whole of science into question.

        Bing Ad

        Make Bing your decision engine

        Google Ad

        _null_

        From http://en.wikipedia.org/wiki/Hill_climbing

        hill climbing is a mathematical optimization technique which belongs to the family of local search. It is relatively simple to implement, making it a popular first choice. Although more advanced algorithms may give better results, in some situations hill climbing works just as well.

        Hill climbing can be used to solve problems that have many solutions, some of which are better than others. It starts with a random (potentially poor) solution, and iteratively makes small changes to the solution, each time improving it a little. When the algorithm cannot see any improvement anymore, it terminates. Ideally, at that point the current solution is close to optimal, but it is not guaranteed that hill climbing will ever come close to the optimal solution.

        For example, hill climbing can be applied to the traveling salesman problem. It is easy to find a solution that visits all the cities but will be very poor compared to the optimal solution. The algorithm starts with such a solution and makes small improvements to it, such as switching the order in which two cities are visited. Eventually, a much better route is obtained.

        Hill climbing is used widely in artificial intelligence, for reaching a goal state from a starting node. Choice of next node and starting node can be varied to give a list of related algorithms.

        Bing Ad for Hill Climbing-

        Climbing at Amazon

        Buy books at Amazon.com and save. Qualified orders over $25 ship free

        Amazon.com/books

        Google Ad for Hill Climbing Algorithm

        _null_

        A year after Google’s Kill Bill OS announcements and Ballmer’s lets buy our way outta here- there seem still more sense to stick to Google ‘s ad algols. Unless you want to climb Microsoft’s online hills only to find there is no free lunch in their ad rates and offers.

        Like the free and virus prone browser.

        Window to a Blue Cloud: Azure Pricing

        Citation:

        http://www.microsoft.com/windowsazure/offers/

        Note: I have not technically evaluated it. But Cloud looks good.