Interview David Katz ,Dataspora /David Katz Consulting

Here is an interview with David Katz ,founder of David Katz Consulting (http://www.davidkatzconsulting.com/) and an analyst at the noted firm http://dataspora.com/. He is a featured speaker at Predictive Analytics World  http://www.predictiveanalyticsworld.com/sanfrancisco/2011/speakers.php#katz)

Ajay-  Describe your background working with analytics . How can we make analytics and science more attractive career options for young students

David- I had an interest in math from an early age, spurred by reading lots of science fiction with mathematicians and scientists in leading roles. I was fortunate to be at Harry and David (Fruit of the Month Club) when they were in the forefront of applying multivariate statistics to the challenge of targeting catalogs and other snail-mail offerings. Later I had the opportunity to expand these techniques to the retail sphere with Williams-Sonoma, who grew their retail business with the support of their catalog mailings. Since they had several catalog titles and product lines, cross-selling presented additional analytic challenges, and with the growth of the internet there was still another channel to consider, with its own dynamics.

After helping to found Abacus Direct Marketing, I became an independent consultant, which provided a lot of variety in applying statistics and data mining in a variety of settings from health care to telecom to credit marketing and education.

Students should be exposed to the many roles that analytics plays in modern life, and to the excitement of finding meaningful and useful patterns in the vast profusion of data that is now available.

Ajay-  Describe your most challenging project in 3 decades of experience in this field.

David- Hard to choose just one, but the educational field has been particularly interesting. Partnering with Olympic Behavior Labs, we’ve developed systems to help identify students who are most at-risk for dropping out of school to help target interventions that could prevent dropout and promote success.

Ajay- What do you think are the top 5 trends in analytics for 2011.

David- Big Data, Privacy concerns, quick response to consumer needs, integration of testing and analysis into business processes, social networking data.

Ajay- Do you think techniques like RFM and LTV are adequately utilized by organization. How can they be propagated further.

David- Organizations vary amazingly in how sophisticated or unsophisticated the are in analytics. A key factor in success as a consultant is to understand where each client is on this continuum and how well that serves their needs.

Ajay- What are the various software you have worked for in this field- and name your favorite per category.

David- I started out using COBOL (that dates me!) then concentrated on SAS for many years. More recently R is my favorite because of its coverage, currency and programming model, and it’s debugging capabilities.

Ajay- Independent consulting can be a strenuous job. What do you do to unwind?

David- Cycling, yoga, meditation, hiking and guitar.

Biography-

David Katz, Senior Analyst, Dataspora, and President, David Katz Consulting.

David Katz has been in the forefront of applying statistical models and database technology to marketing problems since 1980. He holds a Master’s Degree in Mathematics from the University of California, Berkeley. He is one of the founders of Abacus Direct Marketing and was previously the Director of Database Development for Williams-Sonoma.

He is the founder and President of David Katz Consulting, specializing in sophisticated statistical services for a variety of applications, with a special focus on the Direct Marketing Industry. David Katz has an extensive background that includes experience in all aspects of direct marketing from data mining, to strategy, to test design and implementation. In addition, he consults on a variety of data mining and statistical applications from public health to collections analysis. He has partnered with consulting firms such as Ernst and Young, Prediction Impact, and most recently on this project with Dataspora.

For more on David’s Session in Predictive Analytics World, San Fransisco on (http://www.predictiveanalyticsworld.com/sanfrancisco/2011/agenda.php#day2-16a)

Room: Salon 5 & 6
4:45pm – 5:05pm

Track 2: Social Data and Telecom 
Case Study: Major North American Telecom
Social Networking Data for Churn Analysis

A North American Telecom found that it had a window into social contacts – who has been calling whom on its network. This data proved to be predictive of churn. Using SQL, and GAM in R, we explored how to use this data to improve the identification of likely churners. We will present many dimensions of the lessons learned on this engagement.

Speaker: David Katz, Senior Analyst, Dataspora, and President, David Katz Consulting

Exhibit Hours
Monday, March 14th:10:00am to 7:30pm

Tuesday, March 15th:9:45am to 4:30pm

An Introduction to Data Mining-online book

I was reading David Smith’s blog http://blog.revolutionanalytics.com/

where he mentioned this interview of Norman Nie, at TDWI

http://tdwi.org/Articles/2010/11/17/R-101.aspx?Page=2

where I saw this link (its great if you want to study Data Mining btw)

http://www.kdnuggets.com/education/usa-canada.html

and I c/liked the U Toronto link

http://chem-eng.utoronto.ca/~datamining/

Best of All- I really liked this online book created by Professor S. Sayad

Its succinct and beautiful and describes all of the Data Mining you want to read in one Map (actually 4 images painstakingly assembled with perfection)

The best thing is- in the original map- even the sub items are click-able for specifics like Pie Chart and Stacked Column chart are not in one simple drop down like Charts- but rather by nature of the kind of variables that lead to these charts. For doing that- you would need to go to the site itself- ( see http://chem-eng.utoronto.ca/~datamining/dmc/categorical_variables.htm

vs

http://chem-eng.utoronto.ca/~datamining/dmc/categorical_numerical.htm

Again- there is no mention of the data visualization software used to create the images but I think I can take a hint from the Software Page which says software used are-

Software

See it on your own-online book (c)Professor S. Sayad

Really good DIY tutorial

http://chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm

Who searches for this Blog?

Statue of Michael Jackson in Eindhoven, the Ne...
Image via Wikipedia

Using WP- Stats I set about answering this question-

What search keywords lead here-

Clearly Michael Jackson is down this year

And R GUI, Data Mining is up.

How does that affect my writing- given I get almost 250 visitors by search engines alone daily- assume I write nothing on this blog from now on.

It doesnt- I still write what ever code or poem that comes to my mind. So it is hurtful people misunderstimate the effort in writing and jump to conclusions (esp if I write about a company- I am not on payroll of that company- just like if  I write about a poem- I am not a full time poet)

Over to xkcd

All Time (for Decisionstats.Wordpress.com)

Search Views
libre office 818
facebook analytics 806
michael jackson history 240
wps sas lawsuit 180
r gui 168
wps sas 154
wordle.net 118
sas wps 116
decision stats 110
sas wps lawsuit 100
google maps jet ski 94
data mining 88
doug savage 72
hive tutorial 63
spss certification 63
hadley wickham 63
google maps jetski 62
sas sues wps 60
decisionstats 58
donald farmer microsoft 45
libreoffice 44
wps statistics 44
best statistics software 42
r gui ubuntu 41
rstat 37
tamilnadu advanced technical training institute tatti 37

YTD

2009-11-24 to Today

Search Views
libre office 818
facebook analytics 781
wps sas lawsuit 170
r gui 164
wps sas 125
wordle.net 118
sas wps 101
sas wps lawsuit 95
google maps jet ski 94
data mining 86
decision stats 82
doug savage 63
hadley wickham 63
google maps jetski 62
hive tutorial 56
donald farmer microsoft 45

Nice BI Tutorials

Tutorials screenshot.
Image via Wikipedia

Here is a set of very nice, screenshot enabled tutorials from SAP BI. They are a bit outdated (3 years old) but most of it is quite relevant- especially from a Tutorial Design Perspective –

Most people would rather see screenshot based step by step powerpoints, than cluttered or clever presentations , or even videos that force you to sit like a TV zombie. Unfortunately most tutorial presentations I see especially for BI are either slides with one or two points, that abruptly shift to “concepts” or videos that are atleast more than 10 minutes long. That works fine for scripting tutorials or hands on workshops, but cannot be reproduced for later instances of study.

The mode of tutorials especially for GUI software can vary, it may be Slideshare, Scribd, Google Presentation,Microsoft Powerpoint but a step by step screenshot by screenshot tutorial is much better for understanding than commando line jargon/ Youtub   Videos presentations, or Powerpoint with Points.

Have a look at these SAP BI 7 slideshares

and

Speaking of BI, the R Package called Brew is going to brew up something special especially combined with R Apache. However I wish R Apache, or R Web, or RServe had step by step install screenshot tutorials to increase their usage in Business Intelligence.

I tried searching for JMP GUI Tutorials too, but I believe putting all your content behind a registration wall is not so great. Do a Pareto Analysis of your training material, surely you can share a couple more tutorials without registration. It also will help new wanna-migrate users to get a test and feel for the installation complexities as well as final report GUI.

 

Cloud Computing with R

Illusion of Depth and Space (4/22) - Rotating ...
Image by Dominic's pics via Flickr

Here is a short list of resources and material I put together as starting points for R and Cloud Computing It’s a bit messy but overall should serve quite comprehensively.

Cloud computing is a commonly used expression to imply a generational change in computing from desktop-servers to remote and massive computing connections,shared computers, enabled by high bandwidth across the internet.

As per the National Institute of Standards and Technology Definition,
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

(Citation: The NIST Definition of Cloud Computing

Authors: Peter Mell and Tim Grance
Version 15, 10-7-09
National Institute of Standards and Technology, Information Technology Laboratory
http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

From http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Web-Interfaces

R Web Interfaces

Rweb is developed and maintained by Jeff Banfield. The Rweb Home Page provides access to all three versions of Rweb—a simple text entry form that returns output and graphs, a more sophisticated JavaScript version that provides a multiple window environment, and a set of point and click modules that are useful for introductory statistics courses and require no knowledge of the R language. All of the Rweb versions can analyze Web accessible datasets if a URL is provided.
The paper “Rweb: Web-based Statistical Analysis”, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, was published in the Journal of Statistical Software (http://www.jstatsoft.org/v04/i01/).

Ulf Bartel has developed R-Online, a simple on-line programming environment for R which intends to make the first steps in statistical programming with R (especially with time series) as easy as possible. There is no need for a local installation since the only requirement for the user is a JavaScript capable browser. See http://osvisions.com/r-online/ for more information.

Rcgi is a CGI WWW interface to R by MJ Ray. It had the ability to use “embedded code”: you could mix user input and code, allowing the HTMLauthor to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output was possible in PostScript or GIF formats and the executed code was presented to the user for revision. However, it is not clear if the project is still active.

Currently, a modified version of Rcgi by Mai Zhou (actually, two versions: one with (bitmap) graphics and one without) as well as the original code are available from http://www.ms.uky.edu/~statweb/.

CGI-based web access to R is also provided at http://hermes.sdu.dk/cgi-bin/go/. There are many additional examples of web interfaces to R which basically allow to submit R code to a remote server, see for example the collection of links available from http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatCompCourse.

David Firth has written CGIwithR, an R add-on package available from CRAN. It provides some simple extensions to R to facilitate running R scripts through the CGI interface to a web server, and allows submission of data using both GET and POST methods. It is easily installed using Apache under Linux and in principle should run on any platform that supports R and a web server provided that the installer has the necessary security permissions. David’s paper “CGIwithR: Facilities for Processing Web Forms Using R” was published in the Journal of Statistical Software (http://www.jstatsoft.org/v08/i10/). The package is now maintained by Duncan Temple Lang and has a web page athttp://www.omegahat.org/CGIwithR/.

Rpad, developed and actively maintained by Tom Short, provides a sophisticated environment which combines some of the features of the previous approaches with quite a bit of JavaScript, allowing for a GUI-like behavior (with sortable tables, clickable graphics, editable output), etc.
Jeff Horner is working on the R/Apache Integration Project which embeds the R interpreter inside Apache 2 (and beyond). A tutorial and presentation are available from the project web page at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RApacheProject.

Rserve is a project actively developed by Simon Urbanek. It implements a TCP/IP server which allows other programs to use facilities of R. Clients are available from the web site for Java and C++ (and could be written for other languages that support TCP/IP sockets).

OpenStatServer is being developed by a team lead by Greg Warnes; it aims “to provide clean access to computational modules defined in a variety of computational environments (R, SAS, Matlab, etc) via a single well-defined client interface” and to turn computational services into web services.

Two projects use PHP to provide a web interface to R. R_PHP_Online by Steve Chen (though it is unclear if this project is still active) is somewhat similar to the above Rcgi and Rweb. R-php is actively developed by Alfredo Pontillo and Angelo Mineo and provides both a web interface to R and a set of pre-specified analyses that need no R code input.

webbioc is “an integrated web interface for doing microarray analysis using several of the Bioconductor packages” and is designed to be installed at local sites as a shared computing resource.

Rwui is a web application to create user-friendly web interfaces for R scripts. All code for the web interface is created automatically. There is no need for the user to do any extra scripting or learn any new scripting techniques. Rwui can also be found at http://rwui.cryst.bbk.ac.uk.

Finally, the R.rsp package by Henrik Bengtsson introduces “R Server Pages”. Analogous to Java Server Pages, an R server page is typically HTMLwith embedded R code that gets evaluated when the page is requested. The package includes an internal cross-platform HTTP server implemented in Tcl, so provides a good framework for including web-based user interfaces in packages. The approach is similar to the use of the brew package withRapache with the advantage of cross-platform support and easy installation.

Also additional R Cloud Computing Use Cases
http://wwwdev.ebi.ac.uk/Tools/rcloud/

ArrayExpress R/Bioconductor Workbench

Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster

Start the workbench by downloading the package for your operating system (Macintosh or Windows), or via Java Web Start, and you will get access to an instance of R running on one of EBI’s powerful machines. You can install additional packages, upload your own data, work with graphics and collaborate with colleagues, all as if you are running R locally, but unlimited by your machine’s memory, processor or data storage capacity.

  • Most up-to-date R version built for multicore CPUs
  • Access to all Bioconductor packages
  • Access to our computing infrastructure
  • Fast access to data stored in EBI’s repositories (e.g., public microarray data in ArrayExpress)

Using R Google Docs
http://www.omegahat.org/RGoogleDocs/run.pdf
It uses the XML and RCurl packages and illustrates that it is relatively quick and easy
to use their primitives to interact with Web services.

Using R with Amazon
Citation
http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:

  • Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
  • On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
  • Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
  • Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.

Connecting to R on Amazon EC2- Detailed tutorials
Ubuntu Linux version
https://decisionstats.com/2010/09/25/running-r-on-amazon-ec2/
and Windows R version
https://decisionstats.com/2010/10/02/running-r-on-amazon-ec2-windows/

Connecting R to Data on Google Storage and Computing on Google Prediction API
https://github.com/onertipaday/predictionapirwrapper
R wrapper for working with Google Prediction API

This package consists in a bunch of functions allowing the user to test Google Prediction API from R.
It requires the user to have access to both Google Storage for Developers and Google Prediction API:
see
http://code.google.com/apis/storage/ and http://code.google.com/apis/predict/ for details.

Example usage:

#This example requires you had previously created a bucket named data_language on your Google Storage and you had uploaded a CSV file named language_id.txt (your data) into this bucket – see for details
library(predictionapirwrapper)

and Elastic R for Cloud Computing
http://user2010.org/tutorials/Chine.html

Abstract

Elastic-R is a new portal built using the Biocep-R platform. It enables statisticians, computational scientists, financial analysts, educators and students to use cloud resources seamlessly; to work with R engines and use their full capabilities from within simple browsers; to collaborate, share and reuse functions, algorithms, user interfaces, R sessions, servers; and to perform elastic distributed computing with any number of virtual machines to solve computationally intensive problems.
Also see Karim Chine’s http://biocep-distrib.r-forge.r-project.org/

R for Salesforce.com

At the point of writing this, there seem to be zero R based apps on Salesforce.com This could be a big opportunity for developers as both Apex and R have similar structures Developers could write free code in R and charge for their translated version in Apex on Salesforce.com

Force.com and Salesforce have many (1009) apps at
http://sites.force.com/appexchange/home for cloud computing for
businesses, but very few forecasting and statistical simulation apps.

Example of Monte Carlo based app is here
http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016cT9EAI#

These are like iPhone apps except meant for business purposes (I am
unaware if any university is offering salesforce.com integration
though google apps and amazon related research seems to be on)

Force.com uses a language called Apex  and you can see
http://wiki.developerforce.com/index.php/App_Logic and
http://wiki.developerforce.com/index.php/An_Introduction_to_Formulas
Apex is similar to R in that is OOPs

SAS Institute has an existing product for taking in Salesforce.com data.

A new SAS data surveyor is
available to access data from the Customer Relationship Management
(CRM) software vendor Salesforce.com. at
http://support.sas.com/documentation/cdl/en/whatsnew/62580/HTML/default/viewer.htm#datasurveyorwhatsnew902.htm)

Personal Note-Mentioning SAS in an email to a R list is a big no-no in terms of getting a response and love. Same for being careless about which R help list to email (like R devel or R packages or R help)

For python based cloud see http://pi-cloud.com

Search, Sports,Social Media,SlideShares, Scribd

An image of a house fly eye surface by using S...
Image via Wikipedia

Some slideshare.net presentations I really liked.

A tutorial on SEO and SEM-

Carole Ann Matignon deals with optimization and scheduling, rules in the…….NFL!

 

 

Carole, We are waiting for the sequel on  analytics on football and the beer game.

Social Media Screw-Ups

Social Media doesnt matter at all- Social Media matters a lot- Still undecided? Take a look

Slideshare is a great VISUAL interface on sharing content. I liked Google Docs embedding as well, but Matt Mullenberg and Matt Cutts seemed to have stopped talking. Mullenberg is going like Zuckenberg, not willing to align with Sergey Mikhaylovich Brin. or maybe they are afraid of Big Brother Brin. Google loves Java and Javascript (even when they are getting sued for it)- while Matt M  hates it- bad for RIA I guess.

Scribd also is a great way to share content- and probably is small enough for. WordPress.com to allow embedding

Thats the reason why I sometimes prefer Scribd for sharing my poetry to Slideshare and Google Docs. Also I like the enhanced analytics and the much easier and evolved interface for reading. Slideshare is much more successful than Scribd because it is open to sharing with everyone- scribd tries to get you to register …;)

(* Also see MIT’s beer game at http://beergame.mit.edu/ which is ahem different from Duke’s beer games).

 

 

BI Software

Here is the brand new release from Jaspersoft at a groovy price of 9000$. Somebody stop these guys!

It’s a great company to watch for buyouts as well- given their expertise in REPORTING and clientele- especially for anyone looking to im prove thier standing in both open source world and reporting software branding.

From AOL owned Arrogantion’s site http://www.crunchbase.com/company/jaspersoft

 

Total $24.5M
Series D, 8/07 1
Scale Venture Partners
SAP Ventures
Doll Capital Management
Partech International
Morgenthaler Ventures
$12M
Unattributed, 12/08 2
Adams Street Partners
Red Hat
Morgenthaler Ventures
Doll Capital Management
Partech International

 

 

The news-

Announcing JasperReports Server Professional

More Resources

Webinar: Introducing JasperReports Server Professional

Thursday October 14

In this live webinar, learn how a new solution from Jaspersoft combines the world’s favorite reporting server with powerful, mature report server functionality—for about 80% less.

  • Date: Thu, Oct 14
  • Time: 10:00 AM PDT
  • Duration: 60 minutes

The World’s Most Powerful and Affordable Reporting Server

Limited Time Introductory Offer: Starting from $9,000 (restrictions apply)

JasperReports Server is the recommended product for organizations requiring an affordable reporting solution for interactive, operational, and production-based reporting. Deployed as a standalone reporting server or integrated inside another application, JasperReports Server is a flexible, powerful, interactive reporting environment for small or large enterprises.

Powered by the world’s most popular reporting tools in JasperReports and iReport, developers and users can take advantage of more interactivity, security, and scheduling of their reports.

Key Benefits:

  • Affordable: Unlimited reports for unlimited users starting at $9,000
  • Powerful: Report scheduling and distribution to 1,000s of users on a single server
  • Flexible: Web service architecture simplifies application integration
  • Secure: Centralized repository authenticates report access
  • Interactive: Easy to interact, self-serve parameterized-based reports
  • Visual appeal: Flash-based charts and maps engage users and enhance applications
  • Open: Access to any data source including relational, XML, Hibernate, EJB, POJO, and custom

 

Speaking of videos -here is a great video on BI from good ol Tennessee-a great 27 min tutorial on BI for newbies