Open Source Analytics

My guest blog at Allanalytics.com is now up

http://www.allanalytics.com/ is the exciting community which looks at the business aspects to the analytics market with a great lineup of pedigree writers.

It is basically a point/counterpoint for and against open source analytics. I feel there is a scope of lot of improvement before open source dominates the world of analytics software like Android, Linux Web Server do in their markets. Part of this reason is – there needs to be more , much more investment in analytics research, development, easier to use interfaces, Big Data integration and rewarding ALL the writers of code regardless of whether the code is proprietary or open source.

A last word- I think open source analytics AND proprietary analytics software will have to learn to live with one another, with game theory dictating their response and counter-response. More competition is good, and open source is an AND option not an OR option to existing status quo.

You can read the full blog discussion at http://www.allanalytics.com/author.asp?section_id=1408&doc_id=233454&piddl_msgorder=thrd#msgs

Hopefully discussion would be more analytical than passionate 🙂 and greater investments in made in analytics by all sides.

 

Featured: PAW & TAW NYC Hotel Reservations Due This Week

Message from PAWCON-

Space is filling up fast at the Hilton New York, host hotel for Predictive Analytics World and Text Analytics World, next month in New York City. Take advantage of the special room rate negotiated for attendees prior to Friday, September 23rd.

Space is limited so be sure to book your room before it’s too late.

You can reserve your room today by calling             212-586-7000       and reference Data Driven Business Week or online at:
http://www.hilton.com/en/hi/groups/personalized/N/NYCNHHH-RMSP-20111015/index.jhtml?WT.mc_id=POG#reservation

MORE INFORMATION:

PAW: http://www.pawcon.com/nyc
PAW REGISTRATION: http://www.pawcon.com/newyork/register.php

TAW: http://www.tawgo.com/nyc
TAW REGISTRATION: http://www.tawgo.com/newyork/2011/registration

View the PAW overview video: www.pawcon.com/newyork/2011/video_about_predictive_analytics_world.php 

Use R for Business- Competition worth $ 20,000 #rstats

All you contest junkies, R lovers and general change the world people, here’s a new contest to use R in a business application

http://www.revolutionanalytics.com/news-events/news-room/2011/revolution-analytics-launches-applications-of-r-in-business-contest.php

REVOLUTION ANALYTICS LAUNCHES “APPLICATIONS OF R IN BUSINESS” CONTEST

$20,000 in Prizes for Users Solving Business Problems with R

 

PALO ALTO, Calif. – September 1, 2011 – Revolution Analytics, the leading commercial provider of R software, services and support, today announced the launch of its “Applications of R in Business” contest to demonstrate real-world uses of applying R to business problems. The competition is open to all R users worldwide and submissions will be accepted through October 31. The Grand Prize winner for the best application using R or Revolution R will receive $10,000.

The bonus-prize winner for the best application using features unique to Revolution R Enterprise – such as itsbig-data analytics capabilities or its Web Services API for R – will receive $5,000. A panel of independent judges drawn from the R and business community will select the grand and bonus prize winners. Revolution Analytics will present five honorable mention prize winners each with $1,000.

“We’ve designed this contest to highlight the most interesting use cases of applying R and Revolution R to solving key business problems, such as Big Data,” said Jeff Erhardt, COO of Revolution Analytics. “The ability to process higher-volume datasets will continue to be a critical need and we encourage the submission of applications using large datasets. Our goal is to grow the collection of online materials describing how to use R for business applications so our customers can better leverage Big Analytics to meet their analytical and organizational needs.”

To enter Revolution Analytics’ “Applications of R in Business” competition Continue reading “Use R for Business- Competition worth $ 20,000 #rstats”

Interview Dan Steinberg Founder Salford Systems

Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )

Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.

 Dan- When I was in graduate school studying econometrics at Harvard,  a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities.  Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants.  Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.

 Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.

 Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.

The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet.  The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size.  The new methodology will substantially outperform truncation.

The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.

 Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.

 Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are  partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.

In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.

 Ajay- I know you have created your own software. But are there other software that you use or liked to use?

 Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.

 Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?

 Dan- First, we have taken great pains to make our software reliable and we make every effort  to avoid problems related to bugs.  Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.

 Ajay- What do you do to relax and unwind?

 Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.

 Biography-

http://www.salford-systems.com/blog/dan-steinberg.html

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.

Congrats to Matt Stromberg- Winner 2 free passes to PAW New York

Here is a big congrats to Matt Stromberg of San Diego for winning 2 free passes to Predictive Analytics World. Each pass can be used for 2 days of the conference, and it is exclusive to that conference alone.

Connect to Matt ?

https://www.facebook.com/profile.php?id=3611395 or http://www.linkedin.com/pub/matt-stromberg/6/a3b/47a

A coincidence- its his birthday today. Happy Birthday Matt and enjoy NY and PAW Con


WINNER- Matt Stromberg

Mgr., Project Management & Business Analytics

Greater San Diego Area 
 

 

#rstats -Basic Data Manipulation using R

Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.
1 Keeping only some variables

Using subset we can keep only the variables we want-

Sitka89 <- subset(Sitka89, select=c(size,Time,treat))

Will keep only the variables we have selected (size,Time,treat).

2 Dropping some variables

Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor

3 Keeping records based on character condition

Titanic.sub1<-subset(Titanic,Sex=="Male")

Note the double equal-to sign
4 Keeping records based on date/time condition

subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04')

5 Converting Date Time Formats into other formats

if the variable dob is “01/04/1977) then following will convert into a date object

z=strptime(dob,”%d/%m/%Y”)

and if the same date is 01Apr1977

z=strptime(dob,"%d%b%Y")

6 Difference in Date Time Values and Using Current Time

The difftime function helps in creating differences in two date time variables.

difftime(time1, time2, units='secs')

or

difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks"))

For current system date time values you can use

Sys.time()

Sys.Date()

This value can be put in the difftime function shown above to calculate age or time elapsed.

7 Keeping records based on numerical condition

Titanic.sub1<-subset(Titanic,Freq >37)

For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset

8 Sorting Data

Sorting A Data Frame in Ascending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ Package)

Sorting a Data Frame in Descending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ -Installed)

9 Transforming a Dataset Structure around a single variable

Using the Reshape2 Package we can use melt and acast functions

library("reshape2")

tDat.m<- melt(tDat)

tDatCast<- acast(tDat.m,Subject~Item)

If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets.

df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide")

10 Type in Data

Using scan() function we can type in data in a list

11 Using Diff for lags and Cum Sum function forCumulative Sums

We can use the diff function to calculate difference between two successive values of a variable.

Diff(Dataset$X)

Cumsum function helps to give cumulative sum

Cumsum(Dataset$X)

> x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20

> x

[1] 20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585

[9] 20.02222 18.99311

> diff(x)

[1] -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296

[8] -0.6536232 -1.0291181 >

cumsum(x)

[1] 20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320

[8] 159.22905 179.25128 198.24438

> diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing

[1] -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064

[8] -1.6827413

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

12 Merging Data

Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is

totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region"))

13 Aggregating and group processing of a variable

We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply.

Refering to the R Online Manual at
[http://stat.ethz.ch/R-manual/R-patched/library/stats/html/aggregate.html]

## Compute the averages for the variables in 'state.x77', grouped

## according to the region (Northeast, South, North Central, West) that

## each state belongs to

aggregate(state.x77, list(Region = state.region), mean)

Using TApply

## tapply(Summary Variable, Group Variable, Function)

Reference

[http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm#tapply]

We can also use specialized packages for data manipulation.

For additional By-group processing you can see the doBy package as well as Plyr package
 for data manipulation.Doby contains a variety of utilities including:
 1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
 2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
 3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
 For example, you might want to fit a model to each spatial location or
 time point in your study, summarise data by panels or collapse high-dimensional arrays
 to simpler summary statistics.