Google Code Devfest – in Asia

Interesting series of conferences in Asia courtesy Google Code-

http://googlecode.blogspot.com/2010/09/devfest-asia-pacific-tour-registrations.html

Data Mining 2010:SAS Conference in Vegas

An interesting conference which I attended last year, this year one of the main guests is an ex professor of mine at UTenn. I am India bound this year though for family reasons.

http://www.sas.com/events/dmconf/over.html

Latest News

Early Bird Special
Register for M2010 before Sept. 17 and save $200 on conference fees!

Additional Data Mining Resources
Find additional data mining resouces including links to whitepapers, webinars, audio seminars, videos, blogs and online communities.

Location
Caesars Palace
Las Vegas, NV

Conference: October 25-26
Pre-conference workshops: October 24
Post-conference training: October 27-29

The M2010 Data Mining Conference is an international educational conference and exhibition for data mining practitioners including analysts, statisticians, programmers, consultants and anyone involved with data management within their organization, Hosted by SAS, M2010 is now in its 13th year and has become the world’s largest data mining conference, attracting over 600 people from various industries including Financial Services, Retail, Insurance, Technology, Education, Healthcare, Pharmaceutical, Government and more.

This conference is the top-choice for serious education and career networking. Conference highlights include

  • 6 keynotes
  • 36 sessions
  • 6 session tracks
  • exhibit hall
  • poster session
  • SAS software training
  • educational workshops
  • special events
  • networking opportunities
  • predictive modeling certification testing event.

Session Topics

  • Business applications
  • Data augmentation
  • Perspectives from the financial services industry
  • Fraud detection
  • Perspectives from the healthcare industry
  • New and emerging technologies
  • Perspectives from the retail industry
  • Data mining in marketing
  • Retention and Life Cycle Analysis
  • Text mining
  • And more! (View session abstracts.)

Event: Predictive analytics with R, PMML and ADAPA

From http://www.meetup.com/R-Users/calendar/14405407/

The September meeting is at the Oracle campus. (This is next door to the Oracle towers, so there is plenty of free parking.) The featured talk is from Alex Guazzelli (Vice President – Analytics, Zementis Inc.) who will talk about “Predictive analytics with R, PMML and ADAPA”.

Agenda:
* 6:15 – 7:00 Networking and Pizza (with thanks to Revolution Analytics)
* 7:00 – 8:00 Talk: Predictive analytics with R, PMML and ADAPA
* 8:00 – 8:30 General discussion

Talk overview:

The rule in the past was that whenever a model was built in a particular development environment, it remained in that environment forever, unless it was manually recoded to work somewhere else. This rule has been shattered with the advent of PMML (Predictive Modeling Markup Language). By providing a uniform standard to represent predictive models, PMML allows for the exchange of predictive solutions between different applications and various vendors.

Once exported as PMML files, models are readily available for deployment into an execution engine for scoring or classification. ADAPA is one example of such an engine. It takes in models expressed in PMML and transforms them into web-services. Models can be executed either remotely by using web-services calls, or via a web console. Users can also use an Excel add-in to score data from inside Excel using models built in R.

R models have been exported into PMML and uploaded in ADAPA for many different purposes. Use cases where clients have used the flexibility of R to develop and the PMML standard combined with ADAPA to deploy range from financial applications (e.g., risk, compliance, fraud) to energy applications for the smart grid. The ability to easily transition solutions developed in R to the operational IT production environment helps eliminate the traditional limitations of R, e.g. performance for high volume or real-time transactional systems and memory constraints associated with large data sets.

Speaker Bio:

Dr. Alex Guazzelli has co-authored the first book on PMML, the Predictive Model Markup Language which is the de facto standard used to represent predictive models. The book, entitled PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics, is available on Amazon.com. As the Vice President of Analytics at Zementis, Inc., Dr. Guazzelli is responsible for developing core technology and analytical solutions under ADAPA, a PMML-based predictive decisioning platform that combines predictive analytics and business rules. ADAPA is the first system of its kind to be offered as a service on the cloud.
Prior to joining Zementis, Dr. Guazzelli was involved in not only building but also deploying predictive solutions for large financial and telecommunication institutions around the globe. In academia, Dr. Guazzelli worked with data mining, neural networks, expert systems and brain theory. His work in brain theory and computational neuroscience has appeared in many peer reviewed publications. At Zementis, Dr. Guazzelli and his team have been involved in a myriad of modeling projects for financial, health-care, gaming, chemical, and manufacturing industries.

Dr. Guazzelli holds a Ph.D. in Computer Science from the University of Southern California and a M.S and B.S. in Computer Science from the Federal University of Rio Grande do Sul, Brazil.

Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • and
  • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
    • public static IQueryable<R>
      MapReduce<S,M,K,R>(this IQueryable<S> source,
      Expression<Func<S,IEnumerable<M>>> mapper,
      Expression<Func<M,K>> keySelector,
      Expression<Func<K,IEnumerable<M>,R>> reducer)
      {
      return source.SelectMany(mapper).GroupBy(keySelector, reducer);
      }

    and http://research.microsoft.com/en-us/projects/dryad/

    Dryad

    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Overview

    New! An academic release of DryadLINQ is now available for public download.

    Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

    The Structure of Dryad Jobs

    A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

    Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

    The Dryad Software Stack

    As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

    • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
    • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
    • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
      1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
      2. It allows the construction of pipes spanning multiple machines, across a cluster.
      3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
    • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

    Publications

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

    Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

    Also interesting to read-

    Why does Dryad use a DAG?

    he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

    So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

    from

    http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

      Towards better analytical software

      Here are some thoughts on using existing statistical software for better analytics and/or business intelligence (reporting)-

      1) User Interface Design Matters- Most stats software have a legacy approach to user interface design. While the Graphical User Interfaces need to more business friendly and user friendly- example you can call a button T Test or You can call it Compare > Means of Samples (with a highlight called T Test). You can call a button Chi Square Test or Call it Compare> Counts Data. Also excessive reliance on drop down ignores the next generation advances in OS- namely touchscreen instead of mouse click and point.

      Given the fact that base statistical procedures are the same across softwares, a more thoughtfully designed user interface (or revamped interface) can give softwares an edge over legacy designs.

      2) Branding of Software Matters- One notable whine against SAS Institite products is a premier price. But really that software is actually inexpensive if you see other reporting software. What separates a Cognos from a Crystal Reports to a SAS BI is often branding (and user interface design). This plays a role in branding events – social media is often the least expensive branding and marketing channel. Same for WPS and Revolution Analytics.

      3) Alliances matter- The alliances of parent companies are reflected in the sales of bundled software. For a complete solution , you need a database plus reporting plus analytical software. If you are not making all three of the above, you need to partner and cross sell. Technically this means that software (either DB, or Reporting or Analytics) needs to talk to as many different kinds of other softwares and formats. This is why ODBC in R is important, and alliances for small companies like Revolution Analytics, WPS and Netezza are just as important as bigger companies like IBM SPSS, SAS Institute or SAP. Also tie-ins with Hadoop (like R and Netezza appliance)  or  Teradata and SAS help create better usage.

      4) Cloud Computing Interfaces could be the edge- Maybe cloud computing is all hot air. Prudent business planing demands that any software maker in analytics or business intelligence have an extremely easy to load interface ( whether it is a dedicated on demand website) or an Amazon EC2 image. Easier interfaces win and with the cloud still in early stages can help create an early lead. For R software makers this is critical since R is bad in PC usage for larger sets of data in comparison to counterparts. On the cloud that disadvantage vanishes. An easy to understand cloud interface framework is here ( its 2 years old but still should be okay) http://knol.google.com/k/data-mining-through-cloud-computing#

      5) Platforms matter- Softwares should either natively embrace all possible platforms or bundle in middle ware themselves.

      Here is a case study SAS stopped supporting Apple OS after Base SAS 7. Today Apple OS is strong  ( 3.47 million Macs during the most recent quarter ) and the only way to use SAS on a Mac is to do either

      http://goo.gl/QAs2

      or do a install of Ubuntu on the Mac ( https://help.ubuntu.com/community/MacBook ) and do this

      http://ubuntuforums.org/showthread.php?t=1494027

      Why does this matter? Well SAS is free to academics and students  from this year, but Mac is a preferred computer there. Well WPS can be run straight away on the Mac (though they are curiously not been able to provide academics or discounted student copies 😉 ) as per

      http://goo.gl/aVKu

      Does this give a disadvantage based on platform. Yes. However JMP continues to be supported on Mac. This is also noteworthy given the upcoming Chromium OS by Google, Windows Azure platform for cloud computing.

      Predictive Analytics World Conference

      A note from Predictive Analytics World  Conference


      Predictive Analytics World Coming October 19-20, 2010 to Washington, DC

      Dates:        October 19-20, 2010
      Location:   Washington, DC

      Predictive Analytics World (pawcon.com) is the business-focused event for predictive analytics professionals, managers and commercial practitioners, covering today’s commercial deployment of predictive analytics, across industries and across software vendors.  The conference delivers case studies, expertise and resources to achieve two objectives:

      1) Bigger wins:  Strengthen the business impact delivered by predictive analytics

      2) Broader capabilities:  Establish new opportunities with predictive analytics

      The Top Experts

      PAW’s October 2010 program is packed with the top predictive analytics experts, practitioners, authors and business thought leaders, including keynote speakers Piyanka Jain of PayPal, Andrew Pole of Target and Program Chair Eric Siegel, Ph.D. — plus special sessions from industry heavy-weights Usama Fayyad, Ph.D. and John F. Elder, Ph.D.

      Case Studies: How the Leading Enterprises Do It

      Predictive Analytics World focuses on concrete examples of deployed predictive analytics.  Hear from the horse’s mouth precisely how Fortune 500 analytics competitors and other top practitioners deploy predictive modeling, and what kind of business impact it delivers.

      And the leading enterprises have responded, signing up to tell their stories. PAW’s October program includes over 25 sessions across two tracks – an “All Audiences” and an “Expert/Practitioner” track — so you can witness how predictive analytics is applied at

      1-800-FLOWERS, CIBC, Corporate Executive Board, Forrester, Ingram Micro, LifeLine, MetLife, Miles Kimball, Monster, Paychex, PayPal (eBay), SunTrust, Target, UPMC Health Plan, Xerox, and Yahoo!, plus special examples from theU.S. government agencies DoD, DHS, and SSA.

      October’s agenda covers hot topics and advanced methods such as social data, text mining, search marketing, risk management, survey analysis, consumer privacy, sales force optimization and other innovative applications that benefit organizations in new and creative ways.

      Join PAW and access the best keynotes, sessions, exposition, expert panel, live demos, networking coffee breaks, reception, birds-of-a-feather lunches, leading brand-name enterprise leaders, and industry heavyweights in the business.

      Workshops

      Three pre- and post-event workshops complement the core conference program:

      “The Best and the Worst of Predictive Analytics: Predictive Modeling Methods and Common Data Mining Mistakes”
      Instructor:  John F. Elder, Ph.D., CEO and Founder, Elder Research, Inc.
      www.predictiveanalyticsworld.com/dc/2010/predictive_modeling_methods.php

      “Hands-On Predictive Analytics”
      Instructor:  Dean Abbott, President, Abbott Analytics
      www.predictiveanalyticsworld.com/dc/2010/handson_predictive_analytics.php

      “Driving Enterprise Decisions with Business Analytics and Business Rules”
      Instructor:  James Taylor, CEO, Decision Management Solutions
      www.predictiveanalyticsworld.com/dc/2010/predictive_analytics_work.php

      Cross-Industry Applications

      Predictive Analytics World is the only conference of its kind, delivering vendor-neutral sessions across verticals such as banking, financial services, e-commerce, education, government, healthcare, high technology, insurance, non-profits, publishing, retail and telecommunications

      And PAW covers the gamut of commercial applications of predictive analytics, including response modeling, customer retention with churn modeling, product recommendations, fraud detection, online marketing optimization, behavior-based advertising, insurance pricing, sales forecasting, text mining and credit scoring.

      Why bring together such a wide range of endeavors?  No matter how you use predictive analytics, the story is the same:  Predicatively scoring customers optimizes business performance.  Predictive analytics initiatives across industries leverage the same core predictive modeling technology, share similar project overhead and data requirements, and face common process challenges and analytical hurdles.

      Rave Reviews

      “Hands down, best applied, analytics conference I have ever attended. Great exposure to cutting-edge predictive techniques and I was able to turn around and apply some of those learnings to my work immediately. I’ve never been able to say that after any conference I’ve attended before!”

      Jon Francis
      Senior Statistician
      T-Mobile

      Read more:  Articles and blog entries about February and October’s PAW can be found at www.predictiveanalyticsworld.com/pressroom.php

      People Who Need People

      Vendors:

      *    Meet the vendors and learn about their solutions, software and services
      *    Discover the best predictive analytics vendors available to serve your needs
      *    Learn what they do and see how they compare

      Colleagues

      *    Mingle, network and hang out with your best and brightest colleagues
      *    Exchange experiences over lunch, coffee breaks and the conference reception connecting with those professionals who face the same challenges as you

      Get Started

      If you’re new to predictive analytics, kicking off a new initiative, or exploring new ways to position it at your organization, there’s no better place to get your bearings than Predictive Analytics World.  See what other companies are doing, witness vendor demos, participate in discussions with the experts, network with your colleagues and weigh your options!

      For more information, see:
      www.predictiveanalyticsworld.com

      For a complete overview of the conference agenda, see:
      www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda_overview.php

      Be sure to register by September 10th for the Early Bird rate (save $200):
      www.predictiveanalyticsworld.com/register.php

      Save more with this posting’s promotional offer:  Take an additional $150 off the Early Bird – a total savings of $350 – or the regular registration fee with this registration discount code: AOH150

      What is predictive analytics?  See the Predictive Analytics Guide:
      www.predictiveanalyticsworld.com/predictive_analytics.php

      If you’d like our informative event updates, sign up at:
      www.predictiveanalyticsworld.com/notifications.php

      To sign up for the PAW group on LinkedIn, see:
      www.linkedin.com/e/gis/1005097

      For inquiries e-mail registration@predictiveanalyticsworld.com or call (717) 798-3495.