PAW’s San Francisco 2011 program is the richest and most diverse yet, including over 30 sessions across two tracks – an “All Audiences” and an “Expert/Practitioner” track — so you can witness how predictive analytics is applied at Bank of America, Bank of the West, Best Buy, CA State Automobile Association, Cerebellum Capital, Chessmetrics, Fidelity, Gaia Interactive, GE Capital, Google, HealthMedia, Hewlett Packard, ICICI Bank (India), MetLife, Monster.com, Orbitz, PayPal/eBay, Richmond, VA Police Dept, U. of Melbourne, Yahoo!, YMCA, and a major N. American telecom, plus insights from projects for Anheiser-Busch, the SSA, and Netflix.
PAW’s agenda covers hot topics and advanced methods such as uplift modeling (net lift), ensemble models, social data (6 sessions on this), search marketing, crowdsourcing, blackbox trading, fraud detection, risk management, survey analysis, and other innovative applications that benefit organizations in new and creative ways.
Predictive Analytics World is the only conference of its kind, delivering vendor-neutral sessions across verticals such as banking, financial services, e-commerce, education, government, healthcare, high technology, insurance, non-profits, publishing, social gaming, retail and telecommunications
And PAW covers the gamut of commercial applications of predictive analytics, including response modeling, customer retention with churn modeling, product recommendations, fraud detection, online marketing optimization, human resource decision-making, law enforcement, sales forecasting, and credit scoring.
WORKSHOPS. PAW also features pre- and post-conference workshops that complement the core conference program. Workshop agendas include advanced predictive modeling methods, hands-on training and enterprise decision management.
Ok I promised a weekly cartoon on Friday but it’s Saturday. Last week we spoofed Larry Ellison , Jim Goodnight and Bill Gates– people who created billions of taxes for the economy but would be regarded as evil by some open source guys- though they may have created more jobs for more families than the whole Federal Reserve Bank did in 2008-10. Jobs are necessary for families. Period.
How many accounts in Facebook are one unique customer?
Does 500 million human beings as Facebook customers sound too many duplicates? (and how much more can you get if you get the Chinese market- FB is semi censored there)
Is Facebook response rate on ads statistically same as response rates on websites or response rates on emails or response rates on spam?
Why is my Facebook account (which apparently) I am free to download one big huge 130 mb file, not chunks of small files I can download.
Why cant Facebook use URL shorteners for the links of Photos (ever seen those tiny fonted big big urls below each photo)
How come Facebook use so much R (including making the jjplot package) but wont sponsor a summer of code contest (unlike Google)-100 million for schools and 2 blog posts for R? and how much money for putting e education content and games on Facebook.
Will Facebook ever create an-in house game? Did Google put money in Zynga (FB’s top game partner) because it likes
games 🙂 ? How dependent is FB on Zynga anyways?
So many questions———————————————————— so little time
Here is a short list of resources and material I put together as starting points for R and Cloud Computing It’s a bit messy but overall should serve quite comprehensively.
Cloud computing is a commonly used expression to imply a generational change in computing from desktop-servers to remote and massive computing connections,shared computers, enabled by high bandwidth across the internet.
As per the National Institute of Standards and Technology Definition,
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.
The paper “Rweb: Web-based Statistical Analysis”, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, was published in the Journal of Statistical Software (http://www.jstatsoft.org/v04/i01/).
Rcgi is a CGI WWW interface to R by MJ Ray. It had the ability to use “embedded code”: you could mix user input and code, allowing the HTMLauthor to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output was possible in PostScript or GIF formats and the executed code was presented to the user for revision. However, it is not clear if the project is still active.
Currently, a modified version of Rcgi by Mai Zhou (actually, two versions: one with (bitmap) graphics and one without) as well as the original code are available from http://www.ms.uky.edu/~statweb/.
David Firth has written CGIwithR, an R add-on package available from CRAN. It provides some simple extensions to R to facilitate running R scripts through the CGI interface to a web server, and allows submission of data using both GET and POST methods. It is easily installed using Apache under Linux and in principle should run on any platform that supports R and a web server provided that the installer has the necessary security permissions. David’s paper “CGIwithR: Facilities for Processing Web Forms Using R” was published in the Journal of Statistical Software (http://www.jstatsoft.org/v08/i10/). The package is now maintained by Duncan Temple Lang and has a web page athttp://www.omegahat.org/CGIwithR/.
Jeff Horner is working on the R/Apache Integration Project which embeds the R interpreter inside Apache 2 (and beyond). A tutorial and presentation are available from the project web page at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RApacheProject.
Rserve is a project actively developed by Simon Urbanek. It implements a TCP/IP server which allows other programs to use facilities of R. Clients are available from the web site for Java and C++ (and could be written for other languages that support TCP/IP sockets).
OpenStatServer is being developed by a team lead by Greg Warnes; it aims “to provide clean access to computational modules defined in a variety of computational environments (R, SAS, Matlab, etc) via a single well-defined client interface” and to turn computational services into web services.
Two projects use PHP to provide a web interface to R. R_PHP_Online by Steve Chen (though it is unclear if this project is still active) is somewhat similar to the above Rcgi and Rweb. R-php is actively developed by Alfredo Pontillo and Angelo Mineo and provides both a web interface to R and a set of pre-specified analyses that need no R code input.
webbioc is “an integrated web interface for doing microarray analysis using several of the Bioconductor packages” and is designed to be installed at local sites as a shared computing resource.
Rwui is a web application to create user-friendly web interfaces for R scripts. All code for the web interface is created automatically. There is no need for the user to do any extra scripting or learn any new scripting techniques. Rwui can also be found at http://rwui.cryst.bbk.ac.uk.
Finally, the R.rsp package by Henrik Bengtsson introduces “R Server Pages”. Analogous to Java Server Pages, an R server page is typically HTMLwith embedded R code that gets evaluated when the page is requested. The package includes an internal cross-platform HTTP server implemented in Tcl, so provides a good framework for including web-based user interfaces in packages. The approach is similar to the use of the brew package withRapache with the advantage of cross-platform support and easy installation.
Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster
Start the workbench by downloading the package for your operating system (Macintosh or Windows), or via Java Web Start, and you will get access to an instance of R running on one of EBI’s powerful machines. You can install additional packages, upload your own data, work with graphics and collaborate with colleagues, all as if you are running R locally, but unlimited by your machine’s memory, processor or data storage capacity.
Most up-to-date R version built for multicore CPUs
Access to all Bioconductor packages
Access to our computing infrastructure
Fast access to data stored in EBI’s repositories (e.g., public microarray data in ArrayExpress)
Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:
Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.
#This example requires you had previously created a bucket named data_language on your Google Storage and you had uploaded a CSV file named language_id.txt (your data) into this bucket – see for details
Elastic-R is a new portal built using the Biocep-R platform. It enables statisticians, computational scientists, financial analysts, educators and students to use cloud resources seamlessly; to work with R engines and use their full capabilities from within simple browsers; to collaborate, share and reuse functions, algorithms, user interfaces, R sessions, servers; and to perform elastic distributed computing with any number of virtual machines to solve computationally intensive problems.
Also see Karim Chine’s http://biocep-distrib.r-forge.r-project.org/
R for Salesforce.com
At the point of writing this, there seem to be zero R based apps on Salesforce.com This could be a big opportunity for developers as both Apex and R have similar structures Developers could write free code in R and charge for their translated version in Apex on Salesforce.com
Personal Note-Mentioning SAS in an email to a R list is a big no-no in terms of getting a response and love. Same for being careless about which R help list to email (like R devel or R packages or R help)
“As a high tech company, SAS depends on a strong educational system for its long-term success,” said SAS CEO Jim Goodnight. “Beyond that, STEM education – developing skills for a knowledge economy – is critical to American competitiveness. Without emphasis on STEM, we sacrifice innovation and export our knowledge jobs to other countries.”
Goodnight and SAS have been active in education for years. The SAS co-founder and his wife, Ann Goodnight, launched college prep school Cary Academy in 1996, and the SAS inSchool program has developed educational software for schools since the mid-1990s. In 2008, Jim Goodnight made SAS Curriculum Pathways available free to all U.S. educators. The web-based service provides content in English, mathematics, social studies, science and Spanish.
SAS is the only Triangle-based company among the Change the Equation corporate partners, but the group includes several other companies with a significant Raleigh-Durham presence: chief among them IBM (NYSE: IBM), GlaxoSmithKline (NYSE: GSK), and Cisco Systems (Nasdaq: CSCO).
Here is a new book on learning MapReduce and it has a free downloadable version as well.
Data-Intensive Text Processing with MapReduce
Jimmy Lin and Chris Dyer
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.