Doing Time Series using a R GUI

The Xerox Star Workstation introduced the firs...
Image via Wikipedia

Until recently I had been thinking that RKWard was the only R GUI supporting Time Series Models-

however Bob Muenchen of http://www.r4stats.com/ was helpful to point out that the Epack Plugin provides time series functionality to R Commander.

Note the GUI helps explore various time series functionality.

Using Bulkfit you can fit various ARMA models to dataset and choose based on minimum AIC

 

> bulkfit(AirPassengers$x)
$res
ar d ma      AIC
[1,]  0 0  0 1790.368
[2,]  0 0  1 1618.863
[3,]  0 0  2 1522.122
[4,]  0 1  0 1413.909
[5,]  0 1  1 1397.258
[6,]  0 1  2 1397.093
[7,]  0 2  0 1450.596
[8,]  0 2  1 1411.368
[9,]  0 2  2 1394.373
[10,]  1 0  0 1428.179
[11,]  1 0  1 1409.748
[12,]  1 0  2 1411.050
[13,]  1 1  0 1401.853
[14,]  1 1  1 1394.683
[15,]  1 1  2 1385.497
[16,]  1 2  0 1447.028
[17,]  1 2  1 1398.929
[18,]  1 2  2 1391.910
[19,]  2 0  0 1413.639
[20,]  2 0  1 1408.249
[21,]  2 0  2 1408.343
[22,]  2 1  0 1396.588
[23,]  2 1  1 1378.338
[24,]  2 1  2 1387.409
[25,]  2 2  0 1440.078
[26,]  2 2  1 1393.882
[27,]  2 2  2 1392.659
$min
ar        d       ma      AIC
2.000    1.000    1.000 1378.338
> ArimaModel.5 <- Arima(AirPassengers$x,order=c(0,1,1),
+ include.mean=1,
+   seasonal=list(order=c(0,1,1),period=12))
> ArimaModel.5
Series: AirPassengers$x
ARIMA(0,1,1)(0,1,1)[12]
Call: Arima(x = AirPassengers$x, order = c(0, 1, 1), seasonal = list(order = c(0,      1, 1), period = 12), include.mean = 1)
Coefficients:
ma1     sma1
-0.3087  -0.1074
s.e.   0.0890   0.0828
sigma^2 estimated as 135.4:  log likelihood = -507.5
AIC = 1021   AICc = 1021.19   BIC = 1029.63
> summary(ArimaModel.5, cor=FALSE)
Series: AirPassengers$x
ARIMA(0,1,1)(0,1,1)[12]
Call: Arima(x = AirPassengers$x, order = c(0, 1, 1), seasonal = list(order = c(0,      1, 1), period = 12), include.mean = 1)
Coefficients:
ma1     sma1
-0.3087  -0.1074
s.e.   0.0890   0.0828
sigma^2 estimated as 135.4:  log likelihood = -507.5
AIC = 1021   AICc = 1021.19   BIC = 1029.63
In-sample error measures:
ME        RMSE         MAE         MPE        MAPE        MASE
0.32355285 11.09952005  8.16242469  0.04409006  2.89713514  0.31563730
Dataset79 <- predar3(ArimaModel.5,fore1=5)

 

And I also found an interesting Ref Sheet for Time Series functions in R-

http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf

and a slightly more exhaustive time series ref card

http://www.statistische-woche-nuernberg-2010.org/lehre/bachelor/datenanalyse/Refcard3.pdf

Also of interest a matter of opinion on issues in Time Series Analysis in R at

http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm

Of course , if I was the sales manager for SAS ETS I would be worried given the increasing capabilities in Time Series in R. But then again some deficiencies in R GUI for Time Series-

1) Layout is not very elegant

2) Not enough documented help (atleast for the Epack GUI- and no integrated help ACROSS packages-)

3) Graphical capabilties need more help documentation to interpret the output (especially in ACF and PACF plots)

More resources on Time Series using R.

http://people.bath.ac.uk/masgs/time%20series/TimeSeriesR2004.pdf

and http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf

and books

http://www.springer.com/economics/econometrics/book/978-0-387-77316-2

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75960-9

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75958-6

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75966-1

Interesting R competition at Reddit

Image representing Reddit as depicted in Crunc...
Image via CrunchBase

Here is an interesting R competition going on at Reddit and it is to help Reddit make a recommendation engine 🙂

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_help_reddit_build_a_recommender_a_public/

by ketralnis

As promised, here is the big dump of voting information that you guys donated to research. Warning: this contains much geekery that may result in discomfort for the nerd-challenged.

I’m trying to use it to build a recommender, and I’ve got some preliminary source code. I’m looking for feedback on all of these steps, since I’m not experienced at machine learning.

Here’s what I’ve done

  • I dumped all of the raw data that we’ll need to generate the public dumps. The queries are the comments in the two .pig files and it took about 52 minutes to do the dump against production. The result of this raw dump looks like:
    $ wc -l *.dump
     13,830,070 reddit_data_link.dump
    136,650,300 reddit_linkvote.dump
         69,489 reddit_research_ids.dump
     13,831,374 reddit_thing_link.dump
    
  • I filtered the list of votes for the list of users that gave us permission to use their data. For the curious, that’s 67,059 users: 62,763 with “public votes” and 6,726 with “allow my data to be used for research”. I’d really like to see that second category significantly increased, and hopefully this project will be what does it. This filtering is done by srrecs_researchers.pig and took 83m55.335s on my laptop.
  • I converted data-dumps that were in our DB schema format to a more useable format using srrecs.pig(about 13min)
  • From that dump I mapped all of the account_ids, link_ids, and sr_ids to salted hashes (using obscure() insrrecs.py with a random seed, so even I don’t know it). This took about 13min on my laptop. The result of this, votes.dump is the file that is actually public. It is a tab-separated file consisting in:
    account_id,link_id,sr_id,dir
    

    There are 23,091,688 votes from 43,976 users over 3,436,063 links in 11,675 reddits. (Interestingly these ~44k users represent almost 17% of our total votes). The dump is 2.2gb uncompressed, 375mb in bz2.

What to do with it

The recommendations system that I’m trying right now turns those votes into a set of affinities. That is, “67% of user #223’s votes on /r/reddit.com are upvotes and 52% on programming). To make these affinities (55m45.107s on my laptop):

 cat votes.dump | ./srrecs.py "affinities_m()" | sort -S200m | ./srrecs.py "affinities_r()" > affinities.dump

Then I turn the affinities into a sparse matrix representing N-dimensional co-ordinates in the vector space of affinities (scaled to -1..1 instead of 0..1), in the format used by R’s skmeans package (less than a minute on my laptop). Imagine that this matrix looks like

          reddit.com pics       programming horseporn  bacon
          ---------- ---------- ----------- ---------  -----
ketralnis -0.5       (no votes) +0.45       (no votes) +1.0
jedberg   (no votes) -0.25      +0.95       +1.0       -1.0
raldi     +0.75      +0.75      +0.7        (no votes) +1.0
...

We build it like:

# they were already grouped by account_id, so we don't have to
# sort. changes to the previous step will probably require this
# step to have to sort the affinities first
cat affinities.dump | ./srrecs.py "write_matrix('affinities.cm', 'affinities.clabel', 'affinities.rlabel')"

I pass that through an R program srrecs.r (if you don’t have R installed, you’ll need to install that, and the packageskmeans like install.packages('skmeans')). This program plots the users in this vector space finding clusters using a sperical kmeans clustering algorithm (on my laptop, takes about 10 minutes with 15 clusters and 16 minutes with 50 clusters, during which R sits at about 220mb of RAM)

# looks for the files created by write_matrix in the current directory
R -f ./srrecs.r

The output of the program is a generated list of cluster-IDs, corresponding in order to the order of user-IDs inaffinities.clabel. The numbers themselves are meaningless, but people in the same cluster ID have been clustered together.

Here are the files

These are torrents of bzip2-compressed files. If you can’t use the torrents for some reason it’s pretty trivial to figure out from the URL how to get to the files directly on S3, but please try the torrents first since it saves us a few bucks. It’s S3 seeding the torrents anyway, so it’s unlikely that direct-downloading is going to go any faster or be any easier.

  • votes.dump.bz2 — A tab-separated list of:
    account_id, link_id, sr_id, direction
    
  • For your convenience, a tab-separated list of votes already reduced to percent-affinities affinities.dump.bz2, formatted:
    account_id, sr_id, affinity (scaled 0..1)
    
  • For your convenience, affinities-matrix.tar.bz2 contains the R CLUTO format matrix files affinities.cm,affinities.clabelaffinities.rlabel

And the code

  • srrecs.pigsrrecs_researchers.pig — what I used to generate and format the dumps (you probably won’t need this)
  • mr_tools.pysrrecs.py — what I used to salt/hash the user information and generate the R CLUTO-format matrix files (you probably won’t need this unless you want different information in the matrix)
  • srrecs.r — the R-code to generate the clusters

Here’s what you can experiment with

  • The code isn’t nearly useable yet. We need to turn the generated clusters into an actual set of recommendations per cluster, preferably ordered by predicted match. We probably need to do some additional post-processing per user, too. (If they gave us an affinity of 0% to /r/askreddit, we shouldn’t recommend it, even if we predicted that the rest of their cluster would like it.)
  • We need a test suite to gauge the accuracy of the results of different approaches. This could be done by dividing the data-set in and using 80% for training and 20% to see if the predictions made by that 80% match.
  • We need to get the whole process to less than two hours, because that’s how often I want to run the recommender. It’s okay to use two or three machines to accomplish that and a lot of the steps can be done in parallel. That said we might just have to accept running it less often. It needs to run end-to-end with no user-intervention, failing gracefully on error
  • It would be handy to be able to idenfity the cluster of just a single user on-the-fly after generating the clusters in bulk
  • The results need to be hooked into the reddit UI. If you’re willing to dive into the codebase, this one will be important as soon as the rest of the process is working and has a lot of room for creativity
  • We need to find the sweet spot for the number of clusters to use. Put another way, how many different types of redditors do you think there are? This could best be done using the aforementioned test-suite and a good-old-fashioned binary search.

Some notes:

  • I’m not attached to doing this in R (I don’t even know much R, it just has a handy prebaked skmeans implementation). In fact I’m not attached to my methods here at all, I just want a good end-result.
  • This is my weekend fun project, so it’s likely to move very slowly if we don’t pick up enough participation here
  • The final version will run against the whole dataset, not just the public one. So even though I can’t release the whole dataset for privacy reasons, I can run your code and a test-suite against it

——————————————————————————————-

 

I am thinking of using Rattle and using the arules package, and running it on the EC2 to get the horsepower.

How else do you think you can tackle a recommendation engine problem.

 

Ajay

 

Which software do we buy? -It depends

Software (novel)
Image via Wikipedia

Often I am asked by clients, friends and industry colleagues on the suitability or unsuitability of particular software for analytical needs.  My answer is mostly-

It depends on-

1) Cost of Type 1 error in purchase decision versus Type 2 error in Purchase Decision. (forgive me if I mix up Type 1 with Type 2 error- I do have some weird childhood learning disabilities which crop up now and then)

Here I define Type 1 error as paying more for a software when there were equivalent functionalities available at lower price, or buying components you do need , like SPSS Trends (when only SPSS Base is required) or SAS ETS, when only SAS/Stat would do.

The first kind is of course due to the presence of free tools with GUI like R, R Commander and Deducer (Rattle does have a 500$ commercial version).

The emergence of software vendors like WPS (for SAS language aficionados) which offer similar functionality as Base SAS, as well as the increasing convergence of business analytics (read predictive analytics), business intelligence (read reporting) has led to somewhat brand clutter in which all softwares promise to do everything at all different prices- though they all have specific strengths and weakness. To add to this, there are comparatively fewer business analytics independent analysts than say independent business intelligence analysts.

2) Type 2 Error- In this case the opportunity cost of delayed projects, business models , or lower accuracy – consequences of buying a lower priced software which had lesser functionality than you required.

To compound the magnitude of error 2, you are probably in some kind of vendor lock-in, your software budget is over because of buying too much or inappropriate software and hardware, and still you could do with some added help in business analytics. The fear of making a business critical error is a substantial reason why open source software have to work harder at proving them competent. This is because writing great software is not enough, we need great marketing to sell it, and great customer support to sustain it.

As Business Decisions are decisions made in the constraints of time, information and money- I will try to create a software purchase matrix based on my knowledge of known softwares (and unknown strengths and weakness), pricing (versus budgets), and ranges of data handling. I will add in basically an optimum approach based on known constraints, and add in flexibility for unknown operational constraints.

I will restrain this matrix to analytics software, though you could certainly extend it to other classes of enterprise software including big data databases, infrastructure and computing.

Noted Assumptions- 1) I am vendor neutral and do not suffer from subjective bias or affection for particular software (based on conferences, books, relationships,consulting etc)

2) All software have bugs so all need customer support.

3) All software have particular advantages , strengths and weakness in terms of functionality.

4) Cost includes total cost of ownership and opportunity cost of business analytics enabled decision.

5) All software marketing people will praise their own software- sometimes over-selling and mis-selling product bundles.

Software compared are SPSS, KXEN, R,SAS, WPS, Revolution R, SQL Server,  and various flavors and sub components within this. Optimized approach will include parallel programming, cloud computing, hardware costs, and dependent software costs.

To be continued-

 

 

 

 

Hearst DataMining Challenge

Check out the Hearst Data Mining Challenge- a new competition-sponsored by DMA, Hearst Magazine, and EXL

THE HEARST CHALLENGE STARTS ON OCTOBER 14TH

CHALLENGE

DESCRIPTION

Over the years, the magazine publishing industry has made significant strides in improving subscription based circulation by developing analytic frameworks that better predict customer response to acquisition and renewal offers. The objective of this contest is to apply the same analytic discipline and effectively predict newsstand locations “response”. Specifically the objective is to predict the number of copies to be placed in each newsstand location to optimize the overall contribution of the newsstand location typically referred to as draw.

Data for the competition is provided by CMG and Experian.

and

RULES

HOW TO ENTER: Beginning October 14th, 2010 at 12:01 AM (ET) throughDecember 3rd, 2010 at 11:59 PM (ET) go to the Hearst Challenge website located at http://www.HearstChallenge.com (the “Site”) and complete and submit the entry form pursuant to the onscreen instructions. Entrants will be provided a historical sample of newsstand location draw, sales and associated location level data to help develop their predictive algorithm. Hearst will in turn hold back two distinct sets of draw/sales data, one to be used as a validation set by the contestant and one to be used as a final contest evaluation set. Entrants may not include any other external variables for the challenge. Additional details will be provided with the data. Entrants will be able to track their performance against the validation set throughout the course of the challenge via a leader tracking board to be made available on the Site. Entries must include the following documentation:

  • Data file with id variables and expected sales values by store and publication
  • The final model/ algorithm code used to score the final data set
  • Any supporting documentation that pertains to the development of the submitted model/algorithm including variable creation. Variables that were used in the model need to be traced through from input to coefficient / node (if using a tree based methodology).

Check out http://www.hearstchallenge.com/index.php for further details.

John M. Chambers Statistical Software Award – 2011

Write code, win cash, and the glory. Deep bow to Father John M Chambers, inventor of S ,for endowing this award for statistical software creation by grads and undergrads.

An effort to be matched by companies like SAS, SPSS which after all came from grad school work. Now back to the competition, I gotta get my homies from U Tenn in a team ( I was a grad student last year though taking this year off due to medico- financial reasons)

John M. Chambers Statistical Software Award – 2011
Statistical Computing Section
American Statistical Association

The Statistical Computing Section of the American Statistical
Association announces the competition for the John M.  Chambers
Statistical Software Award. In 1998 the Association for Computing
Machinery presented its Software System Award to John Chambers for the
design and development of S. Dr. Chambers generously donated his award
to the Statistical Computing Section to endow an annual prize for
statistical software written by an undergraduate or graduate student.
The prize carries with it a cash award of $1000, plus a substantial
allowance for travel to the annual Joint Statistical Meetings where
the award will be presented.

Teams of up to 3 people can participate in the competition, with the
cash award being split among team members. The travel allowance will
be given to just one individual in the team, who will be presented the
award at JSM.  To be eligible, the team must have designed and
implemented a piece of statistical software.
The individual within
the team indicated to receive the travel allowance must have begun the
development while a student, and must either currently be a student,
or have completed all requirements for her/his last degree after
January 1, 2009.  To apply for the award, teams must provide the
following materials:

Current CV’s of all team members.

A letter from a faculty mentor at the academic institution of the
individual indicated to receive the travel award.  The letter
should confirm that the individual had substantial participation in
the development of the software, certify her/his student status
when the software began to be developed (and either the current
student status or the date of degree completion), and briefly
discuss the importance of the software to statistical practice.

A brief, one to two page description of the software, summarizing
what it does, how it does it, and why it is an important
contribution.  If the team member competing for the travel
allowance has continued developing the software after finishing
her/his studies, the description should indicate what was developed
when the individual was a student and what has been added since.

An installable software package with its source code for use by the
award committee. It should be accompanied by enough information to allow
the judges to effectively use and evaluate the software (including
its design considerations.)  This information can be provided in a
variety of ways, including but not limited to a user manual (paper
or electronic), a paper, a URL, and online help to the system.

All materials must be in English.  We prefer that electronic text be
submitted in Postscript or PDF.  The entries will be judged on a
variety of dimensions, including the importance and relevance for
statistical practice of the tasks performed by the software, ease of
use, clarity of description, elegance and availability for use by the
statistical community. Preference will be given to those entries that
are grounded in software design rather than calculation.  The decision
of the award committee is final.

All application materials must be received by 5:00pm EST, Monday,
February 21, 2011 at the address below.  The winner will be announced
in May and the award will be given at the 2011 Joint Statistical
Meetings.

Information on the competition can also be accessed on the website of
the Statistical Computing Section (www.statcomputing.org or see the
ASA website, www.amstat.org for a pointer), including the names and
contributions of previous winners.  Inquiries and application
materials should be emailed or mailed to:

Chambers Software Award
c/o Fei Chen
Avaya Labs
233 Mt Airy Rd.
Basking Ridge, NJ 07920
feic@avaya.com

Protected: Analyzing SAS Institute-WPS Lawsuit

This content is password protected. To view it please enter your password below: