Rattle Re-Introduced

Latest version of Rattle just went online-

Here is the change log- Dr Graham Williams is also coming out with a book on using Rattle- the R GUI devoted to data mining.

Source-http://cran.r-project.org/web/packages/rattle/index.html

rattle (2.5.42) unstable; urgency=low

  * Update rattle.info() to recursively identify all dependencies,
 report
    their version number and any updates available from CRAN and generate
    command to update packages that have updates available. See
    ?rattle.info for the options.

  * Fix bug causing R Dataset option of the Evaluate window to always
    revert to the first named dataset.

  * Fix bug in transforms where weights were not being handled in
    refreshing of the Data tab.

  * Fix a bug in box plots when trying to label outliers when there aren't
    any.

 -- Graham Williams <Graham.Williams@togaware.com>  Sun, 
19 Sep 2010 05:01:51 +1000

rattle (2.5.41) unstable; urgency=low

  * Use GtkBuilder for Export dialog.

  * Test use of glade vs GtkBuilder on multiple platforms.

  * Rename rattle.info to rattle.version.

  * Add weight column to data tab.

  * Support weights for nnet, multinom, survival.

  * Add weights information to PMML as a PMML Extension.

  * Ensure GtkFrame is available as a data type whilst waiting for 
updated
    RGtk2.

  * Bug fix to packageIsAvailable not reruning any result.

  * Replace destroy with withdraw for plot window as the former has
    started crashing R.

  * Improve Log formatting for various model build commands.

  * Be sure to include the car package for Anova for multinom models.

  * Release pmml 1.2.24: Bug fix glm binomial regression - note as
    classification model.

 -- Graham Williams <Graham.Williams@togaware.com>  Wed, 15 Sep 2010 
14:56:09 +1000
And a video I did of exploring various Rattle options using Camtasia,
 a very useful software for screen capture and video tutorials
from http://www.techsmith.com/download/camtasiatrial.asp
Updated- my video skils being quite bad- I replaced it with another video. 
However Camtasia is the best screen capture video tool
Also , an update Analyticdroid is on hold for now. see- for more details http://rattle.togaware.com/

Graphics Presentations

Here are some Wow Presentations on Design and User Interfaces and  Graphics (including R)- you may have seen some before.

From Dataspora-

A Survey of R Graphics

R Graphics using GGPlot

King of all R graphics-Hadley Wickham

and a rather clever Graphics User Interface presentation

Dark Patterns to Trick People

More on Design Anti Patterns

Back to designing well

Back to Polishing your Graphics with Hadley –

Making NeW R

Tal G in his excellent blog piece talks of “Why R Developers  should not be paid” http://www.r-statistics.com/2010/09/open-source-and-money-why-r-developers-shouldnt-be-paid/

His argument of love is not very original though it was first made by these four guys

I am going to argue that “some” R developers should be paid, while the main focus should be volunteers code. These R developers should be paid as per usage of their packages.

Let me expand.

Imagine the following conversation between Ross Ihaka, Norman Nie and Peter Dalgaard.

Norman- Hey Guys, Can you give me some code- I got this new startup.

Ross Ihaka and Peter Dalgaard- Sure dude. Here is 100,000 lines of code, 2000 packages and 2 decades of effort.

Norman- Thanks guys.

Ross Ihaka- Hey, What you gonna do with this code.

Norman- I will better it. Sell it. Finally beat Jim Goodnight and his **** Proc GLM and **** Proc Reg.

Ross- Okay, but what will you give us? Will you give us some code back of what you improve?

Norman – Uh, let me explain this open core …

Peter D- Well how about some royalty?

Norman- Sure, we will throw parties at all conferences, snacks you know at user groups.

Ross – Hmm. That does not sound fair. (walks away in a huff muttering)-He takes our code, sells it and wont share the code

Peter D- Doesnt sound fair. I am back to reading Hamlet, the great Dane, and writing the next edition of my book. I am glad I wrote a book- Ross didnt even write that.

Norman-Uh Oh. (picks his phone)- Hey David Smith, We need to write some blog articles pronto – these open source guys ,man…

———–I think that sums what has been going on in the dynamics of R recently. If Ross Ihaka and R Gentleman had adopted an open core strategy- meaning you can create packages to R but not share the original where would we all be?

At this point if he is reading this, David Smith , long suffering veteran of open source  flameouts is rolling his eyes while Tal G is wondering if he will publish this on R Bloggers and if so when or something.

Lets bring in another R veteran-  Hadley Wickham who wrote a book on R and also created ggplot. Thats the best quality, most often used graphics package.

In terms of economic utilty to end user- the ggplot package may be as useful if not more as the foreach package developed by Revolution Computing/Analytics.

Now http://cran.r-project.org/web/packages/foreach/index.html says that foreach is licensed under http://www.apache.org/licenses/LICENSE-2.0

However lets come to open core licensing ( read it here http://alampitt.typepad.com/lampitt_or_leave_it/2008/08/open-core-licen.html ) which is where the debate is- Revolution takes code- enhances it (in my opinion) substantially with new formats XDF for better efficieny, web services API, and soon coming next year a GUI (thanks in advance , Dr Nie and guys)

and sells this advanced R code to businesses happy to pay ( they are currently paying much more to DR Goodnight and HIS guys)

Why would any sane customer buy it from Revolution- if he could download exactly the same thing from http://r-project.org

Hence the business need for Revolution Analytics to have an enhanced R- as they are using a product based software model not software as a service model.

If Revolution gives away source code of these new enhanced codes to R core team- how will R core team protect the above mentioned intelectual property- given they have 2 decades experience of giving away free code , and back and forth on just code.

Now Revolution also has a marketing budget- and thats how they sponsor some R Core events, conferences, after conference snacks.

How would people decide if they are being too generous or too stingy in their contribution (compared to the formidable generosity of SAS Institute to its employees, stakeholders and even third party analysts).

Would it not be better- IF Revolution can shift that aspect of relationship to its Research and Development budget than it’s marketing budget- come with some sort of incentive for “SOME” developers – even researchers need grants and assistantships, scholarships, make a transparent royalty formula say 17.5 % of the NEW R sales goes to R PACKAGE Developers pool, which in turn examines usage rate of packages and need/merit before allocation- that would require Revolution to evolve from a startup to a more sophisticated corporate and R Core can use this the same way as John M Chambers software award/scholarship

Dont pay all developers- it would be an insult to many of them – say Prof Harrell creator of HMisc to accept – but can Revolution expand its dev base (and prospect for future employees) by even sponsoring some R Scholarships.

And I am sure that if Revolution opens up some more code to the community- they would the rest of the world and it’s help useful. If it cant trust people like R Gentleman with some source code – well he is a board member.

——————————————————————————————–

Now to sum up some technical discussions on NeW R

1)  An accepted way of benchmarking efficiencies.

2) Code review and incorporation of efficiencies.

3) Multi threading- Multi core usage are trends to be incorporated.

4) GUIs like R Commander E Plugins for other packages, and Rattle for Data Mining to have focussed (or Deducer). This may involve hiring User Interface Designers (like from Apple 😉  who will work for love AND money ( Even the Beatles charge royalty for that song)

5) More support to cloud computing initiatives like Biocep and Elastic R – or Amazon AMI for using cloud computers- note efficiency arguements dont matter if you just use a Chrome Browser and pay 2 cents a hour for an Amazon Instance. Probably R core needs more direct involvement of Google (Cloud OS makers) and Amazon as well as even Salesforce.com (for creating Force.com Apps). Note even more corporates here need to be involved as cloud computing doesnot have any free and open source infrastructure (YET)

_______________________________________________________

Debates will come and go. This is an interesting intellectual debate and someday the liitle guys will win the Revolution-

From Hugh M of Gaping Void-

http://www.gapingvoid.com/Moveable_Type/archives/cat_microsoft_blue_monster_series.html

HOW DOES A SOFTWARE COMPANY MAKE MONEY, IF ALL

SOFTWARE IS FREE?

“If something goes wrong with Microsoft, I can phone Microsoft up and have it fixed. With Open Source, I have to rely on the community.”

And the community, as much as we may love it, is unpredictable. It might care about your problem and want to fix it, then again, it may not. Anyone who has ever witnessed something online go “viral”, good or bad, will know what I’m talking about.

and especially-

http://gapingvoid.com/2007/04/16/how-well-does-open-source-currently-meet-the-needs-of-shareholders-and-ceos/

Source-http://gapingvoidgallery.com/

Kind of sums up why the open core licensing is all about.

Analytics and Journals

Some good journals for reading on analytics-

1) JSS

http://www.jstatsoft.org/

present research that demonstrates the joint evolution of computational and statistical methods and techniques.  Implementations can use languages such as C, C++, S, Fortran, Java, PHP, Python and Ruby or environments such as Mathematica, MATLAB, R, S-PLUS, SAS, Stata, and XLISP-STAT.

There are currently 370 articles, 23 code snippets, 86 book reviews, 4 software reviews, and 7 special volumes in archives

2) R Journal

http://journal.r-project.org/

The  Journal

3) Pharma Programming

http://maney.co.uk/index.php/journals/pha/

Pharmaceutical Programming is the official journal of the Pharmaceutical Users Software Exchange (PhUSE), a non-profit membership society with the objective of educating programmers and their managers working in the pharmaceutical industry. Available both in print and online, Pharmaceutical Programming is an international journal with focus on programming in the regulated environment of the pharmaceutical and life sciences industry.

4) SAS Papers – User Groups

http://www.lexjansen.com/

4569 SAS papers presented
at SGF/SUGI 1996-2010.
1343 SAS papers presented
at PharmaSUG 2000-2010.
1810 SAS papers presented
at NESUG 1997-2009.
1191 SAS papers presented
at SESUG 1999-2009.
463 SAS papers presented
at PhUSE 2005-2009.
787 SAS papers presented
at WUSS 2003-2009.
337 SAS papers presented
at MWSUG 2001, 2004-2009.
188 SAS papers presented
at PNWSUG 2004-2009.
246 SAS papers presented
at SCSUG 2003-2007, 2009.
221 SAS papers related to CDISC.
Easy access to the CDISC Forum.

5) http://analyticsmagazine.com/

Magazine by http://www.informs.org/

6) Data Mining Journals

Academic Journals

Journals relevant to Data Mining

Oracle Open World/ RODM package

From the press release, here comes Oracle Open World. They really have an excellent rock concert in that as well.

.NET and Windows @ Oracle Develop and Oracle OpenWorld 2010

Oracle Develop will again feature a .NET track for Oracle developers. Oracle Develop is suited for all levels of .NET developers, from beginner to advanced. It covers introductory Oracle .NET material, new features, deep dive application tuning, and includes three hours of hands-on labs apply what you learned from the sessions.

To register, go to Oracle Develop registration site.

Oracle OpenWorld will include several sessions on using the Oracle Database on Windows and .NET.

Session schedules and locations for Windows and .NET sessions at Oracle Develop and OpenWorld are now available.

Download: 32-bit ODAC 11.2.0.1.2 for Visual Studio 2010 and .NET Framework 4

With ODAC 11.2.0.1.2, developers can connect to Oracle Database versions 9.2 and higher from Visual Studio 2010 and .NET Framework 4. ODAC components support the full framework, as well as the new .NET Framework Client Profile.

Statement of Direction: Oracle Database and Microsoft Entity Framework

Learn about Oracle’s beta and production plans to support Microsoft Entity Framework with Oracle Database.

Also see http://www.oracle.com/technetwork/articles/datawarehouse/saternos-r-161569.html

for

Data Mining Using the RDOM Package

By Casimir Saternos

Some excerpts-

Open R and enter the following command.

> library(RODM)

This command loads the RODM library and as well the dependent RODBC package. The next step is to make a database connection.

> DB <- RODM_open_dbms_connection(dsn="orcl", uid="dm", pwd="dm")

Subsequent commands use the DB object (an instance of the RODBC class) to connect to the database. The DNS specified in the command is the name you used earlier for the Data Source Name during the ODBC connection configuration. You can view the actual R code being executed by the command by simply typing the function name (without parentheses).

> RODM_open_dbms_connection

And say making a Model in Oracle and R-

> numrows <- length(orange_data[,1])
> orange_data.rows <- length(orange_data[,1])
> orange_data.id <- matrix(seq(1, orange_data.rows),  nrow=orange_data.rows, ncol=1, dimnames= list(NULL, c(“CASE_ID”)))
> orange_data <- cbind(orange_data.id, orange_data)

This adjustment to the data frame then needs to be propagated to the database. You can confirm the change using the sqlColumns function, as listed earlier.

> RODM_create_dbms_table(DB, "orange_data")
> sqlColumns(DB, 'orange_data')$COLUMN_NAME

> glm <- RODM_create_glm_model(
database = DB,
data_table_name = “orange_data”,
case_id_column_name = “CASE_ID”,
target_column_name = “circumference”,
model_name = “GLM_MODEL”,
mining_function = “regression”)

Information about this model can then be obtained by analyzing value returned from the model and stored in the variable named glm.

> glm$model.model_settings
> glm$glm.globals
> $glm.coefficients

Once you have a model, you can apply the model to a new set of data. To begin, create or retrieve sample data in the same format as the training data.

> query<-('select 999 case_id, 1 tree, 120 age, 
32 circumference from dual')

> orange_test<-sqlQuery(DB, query)
> RODM_create_dbms_table(DB, "orange_test")
and 
Finally, the model can be applied to the new data set and the results analyzed.

results <- RODM_apply_model(database = DB, 
data_table_name = "orange_test",
model_name = "GLM_MODEL",
supplemental_cols = "circumference")

When your session is complete, you can clean up objects that were created (if you like) and you should close the database connection:

> RODM_drop_model(database=DB,'GLM_MODEL')
> RODM_drop_dbms_table(DB, "orange_test")
> RODM_drop_dbms_table(DB, "orange_data")
> RODM_close_dbms_connection(DB)

See the full article at http://www.oracle.com/technetwork/articles/datawarehouse/saternos-r-161569.html

More PAWS

Dr Eric Siegel  (interviewed here at https://decisionstats.wordpress.com/2009/07/14/interview_eric-siege/ )

continues his series of excellent analytical conferences-

Oct 19-20 – WASHINGTON DC: PAW Conference & Workshops (pawcon.com/dc)

Oct 28-29 – SAN FRANCISCO: Workshop (businessprediction.com)

Nov 15-16 – LONDON: PAW Conference & Workshop (pawcon.com/london)

March 14-15, 2011 – SAN FRANCISCO: PAW Conference & Workshops

* Register by Sep 30 for PAW London Early-Bird – Save £200
http://pawcon.com/london/register.php

* For the Oct 28-29 workshop, see http://businessprediction.com

———————–

INFORMATION ABOUT THE PAW CONFERENCES:

Predictive Analytics World ( http://pawcon.com ) is the business-focused event for predictive analytics professionals, managers and commercial practitioners, covering today’s commercial deployment of predictive analytics, across industries and across software vendors.

PAW delivers the best case studies, expertise, keynotes, sessions, workshops, exposition, expert panel, live demos, networking coffee breaks, reception, birds-of-a-feather lunches, brand-name enterprise leaders, and industry heavyweights in the business.

Case study presentations cover campaign targeting, churn modeling, next-best-offer, selecting marketing channels, global analytics deployment, email marketing, HR candidate search, and other innovative applications. The Conference agendas cover hot topics such as social data, text mining, search marketing, risk management, uplift (incremental lift) modeling, survey analysis, consumer privacy, sales force optimization and other innovative applications that benefit organizations in new and creative ways.

PAW delivers two rich conference programs in Oct./Nov. with very little content overlap featuring a wealth of speakers with front-line experience. See which one is best for you:

PAW’s DC 2010 (Oct 19-20) program includes over 25 sessions across two tracks – an “All Audiences” and an “Expert/Practitioner” track — so you can witness how predictive analytics is applied at 1-800-FLOWERS, CIBC, Corporate Executive Board, Forrester, LifeLine, Macy’s, MetLife, Miles Kimball, Monster, Paychex, PayPal (eBay), SunTrust, Target, UPMC Health Plan, Xerox, YMCA, and Yahoo!, plus special examples from the U.S. government agencies DoD, DHS, and SSA.

Sign up for event updates in the US http://pawcon.com/signup-us.php
View the agenda at-a-glance: http://pawcon.com/dc/2010/agenda_overview.php
For more: http://pawcon.com/dc
Register: http://pawcon.com/dc/register.php

PAW London 2010 (Nov 15-16) will feature over 20 speakers from 10 countries with case studies from leading enterprises in e-commerce, finance, healthcare, retail, and telecom such as Canadian Automobile Association, Chessmetrics, e-Dialog, Hamburger Sparkasse, Jeevansathi.com (India’s 2nd-largest matrimony portal), Life Line Screening, Lloyds TSB, Naukri.com (India’s number 1 job portal), Overtoom, SABMiller, Univ. of Melbourne, and US Bank, plus special examples from Anheuser-Busch, Disney, HP, HSBC, Pfizer, U.S. SSA, WestWind Foundation and others.

Sign up for event updates in the UK http://pawcon.com/signup-uk.php
View the agenda at-a-glance: http://pawcon.com/london/2010/agenda_overview.php
For more: http://pawcon.com/london
Register: http://pawcon.com/london/register.php

——————————-

PAW San Francisco Save-the-Date and Call-for-Speakers:

March 14-15, 2011
San Francisco Marriott Marquis
San Francisco, CA

PAW call-for-speakers information and submission form: (Due Oct 8)
http://www.predictiveanalyticsworld.com/submit.php

If you wish to receive periodic call-for-speakers notifications regarding Predictive Analytics World, email chair@predictiveanalyticsworld.com with the subject line “call-for-speakers notifications”.

Predictive Analytics World
http://www.predictiveanalyticsworld.com
Washington DC – London – San Francisco

AsterData releases nCluster 4.6

From the press release

Aster Data nCluster 4.6, which includes a column data store, making Aster Data nCluster 4.6 the first platform with a unified SQL-MapReduce analytic framework on a hybrid row and column massively parallel processing (MPP) database management system (DBMS). The unified SQL-MapReduce analytic framework and Aster Data’s suite of 1000+ MapReduce-ready analytic functions, delivers a substantial breakthrough in richer, high performance analytics on large data volumes where data can be stored in either a row or column format.

With Aster Data nCluster 4.6, customers can choose the data format best suited to their needs and benefit from the power of Aster Data’s SQL-MapReduce analytic capabilities, providing maximum query performance by leveraging row-only, column-only, or hybrid storage strategies. Aster Data makes selection of the appropriate storage strategy easy with the new Data Model Express tool that determines the optimal data model based on a customer’s query workloads.  Both row and column stores in Aster Data nCluster 4.6 benefit from platform-level services including Online Precision Scaling™ on commodity hardware, dynamic workload management, and always-on availability, all of which now operate on both row and column stores. All 1000+ MapReduce-ready analytic functions released previously through Aster Data Analytic Foundation — a powerful suite of pre-built MapReduce analytic software building blocks — now run on a hybrid row and column architecture.  Aster Data nCluster 4.6 also includes new pre-built analytic functions, including decision trees and histograms. For custom analytic application development, the Aster Data IDE, Aster Data Developer Express, also fully and seamlessly supports the hybrid row and column store in Aster DatanCluster 4.6.

More advanced analytics infrastructure.