R – Page 34 – DECISION STATS

Jim Goodnight on Open Source- and why he is right -sigh

Image via Wikipedia

Jim Goodnight – grand old man and Godfather of the Cosa Nostra of the BI/Database Analytics software industry said recently on open source in BI (btw R is generally termed in business analytics and NOT business intelligence software so these remarks were more apt to Pentaho and Jaspersoft )

Asked whether open source BI and data integration software from the likes of Jaspersoft, Pentaho and Talend is a growing threat, [Goodnight] said: “We haven’t noticed that a lot. Most of our companies need industrial strength software that has been tested, put through every possible scenario or failure to make sure everything works correctly.”

quotes from Jim Goodnight are courtesy Jason’s story here:
http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

and the Pentaho follow-up reaction is here

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

While you can rage and screech- here is the reality in terms of market share-

From Merv Adrian-‘s excellent article on market shares in BI

http://www.enterpriseirregulars.com/22444/decoding-bi-market-share-numbers-%E2%80%93-play-sudoku-with-analysts/

The first, labeled BI Platforms, is drawn fromGartner Market Share Analysis: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009, published May 2010 , and Gartner Dataquest Market Share: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009.

and

Advanced Analytics category.

and

so whats the performance of Talend, Pentaho and Jaspersoft

From http://www.dbms2.com/category/products-and-vendors/talend/

It seems that Talend’s revenue was somewhat shy of $10 million in 2008.

and Talend itself says

http://www.talend.com/press/Talend-Announces-Record-2009-and-Continues-Growth-in-the-New-Year.php

Additional 2009 highlights include:

Achieved record revenue, more then doubling from 2008. The fourth quarter of 2009 was Talend’s tenth consecutive quarter of growth.
Grew customer base by 140% to over 1,000 customers, up from 420 at the end of 2008. Of these new customers, over 50% are Fortune 1000 companies.
Total downloads reached seven million, with over 300,000 users of the open source products.
Talend doubled its staff, increasing to 200 global employees. Continuing this trend, Talend has already hired 15 people in 2010 to support its rapid growth.

now for Jaspersoft numbers

http://www.dbms2.com/2008/09/14/jaspersoft-numbers/

Highlights include:

Revenue run rate in the double-digit millions.
40% sequential growth most recent quarter. (I didn’t ask whether there was any reason to suspect seasonality.)
130% annual revenue growth run rate.
“Not quite” profitable.
Several hundred commercial subscribers, at an average of $25K annually per, including >100 in Europe.
9,000 paying customers of some kind.
100,000+ total deployments, “very conservatively,” counting OEMs as one deployment each and not double-counting for OEMs’ customers. (Nick said Business Objects quotes 45,000 deployments by the same standards.)
70% of revenue from the mid-market, defined as $100 million – $1 billion revenue. 30% from bigger enterprises. (Hmm. That begs a couple of questions, such as where OEM revenue comes in, and whether <$100 million enterprises were truly a negligible part of revenue.)

and for Pentaho numbers-

http://www.dbms2.com/2009/01/27/introduction-to-pentaho/

and http://www.monash.com/uploads/Pentaho-January-2009.pdf

suggests there are far far away from the top 5-6 vendors in BI

and a special mention for postgreSQL– which is a non Profit but is seriously denting Oracle/MySQL

http://www.postgresql.org/about/

Limit	Value
Maximum Database Size	Unlimited
Maximum Table Size	32 TB
Maximum Row Size	1.6 TB
Maximum Field Size	1 GB
Maximum Rows per Table	Unlimited
Maximum Columns per Table	250 – 1600 depending on column types
Maximum Indexes per Table	Unlimited

and leading vendor is EnterpriseDB which is again IBM-partnering as well as IBM funded

http://www.sramanamitra.com/2009/05/18/enterprise-db/

and

http://www.enterprisedb.com/company/news_events/press_releases/2010_21.do

suggest it is still in early stages.

————————————————————–

So what do we conclude-

1) There is a complete lack of transparency in open source BI market shares as almost all these companies are privately held and do not disclose revenues.

2) What may be a pure play open source company may actually be a company funded by a big BI vendor (like Revolution Analytics is funded among others by Intel-Microsoft) and EnterpriseDB has IBM as an investor.MySQL and Sun of course are bought by Oracle

The degree of control by proprietary vendors on open source vendors is still not disclosed- whether they are holding a stake for strategic reasons or otherwise.

3) None of the Open Source Vendors are even close to a 1 Billion dollar revenue number.

Jim Goodnight is pointing out market reality when he says he has not seen much impact (in terms of market share). As for the rest of his remarks, well he’s got a job to do as CEO and thats talk up his company and trash the competition- which he as been doing for 3 decades and unlikely to change now unless there is severe market share impact. Unless you expect him to notice companies less than 5% of his size in revenue.

http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

SAS vs Open Source (revolutionanalytics.com)
Reducing the Cost of Business Intelligence with Open Source (itexpertvoice.com)
Lunexa and Talend Partner to Drive Adoption of Open Source Data Management Solutions (eon.businesswire.com)
Talend Expands Partnership with Netezza to Advance Enterprise-Scale Data Management Deployments (eon.businesswire.com)
Open Source Business Intelligence: Pentaho and Jaspersoft (r-bloggers.com)
SAS chief says global software sales up 5 pct (reuters.com)
Business Analytics Leader SAS Joins White House Education Effort (eon.businesswire.com)
New Report Details The Rise of Business Intelligence Software (ostatic.com)
After Talend and ExoPlatform, Bonitasoft gets ready to seduce the US market with its open source BMP solution (eu.techcrunch.com)
Talend and Cloudera Announce Technology Partnership to Simplify Processing of Large Scale Data (eon.businesswire.com)

Using PostgreSQL and MySQL databases in R 2.12 for Windows

Air University Library's Index to Military Per... — Image via Wikipedia

If you use Windows for your stats computing and your data is in a database (probably true for almost all corporate business analysts) R 2.12 has provided a unique procedural hitch for you NO BINARIES for packages used till now to read from these databases.

The Readme notes of the release say-

Packages related to many database system must be linked to the exact
version of the database system the user has installed, hence it does
not make sense to provide binaries for packages
	RMySQL, ROracle, ROracleUI, RPostgreSQL
although it is possible to install such packages from sources by
	install.packages('packagename', type='source')
after reading the manual 'R Installation and Administration'.

So how to connect to Databases if the Windows Binary is not available-

So how to connect to PostgreSQL and MySQL databases.

For Postgres databases-

You can update your PostgreSQL databases here-

http://www.postgresql.org/download/windows

Fortunately the RpgSQL package is still available for PostgreSQL

Using the RpgSQL package


library(RpgSQL)

#creating a connection
con <- dbConnect(pgSQL(), user = "postgres", password = "XXXX",dbname="postgres")

#writing a table from a R Dataset
dbWriteTable(con, "BOD", BOD)

# table names are lower cased unless double quoted. Here we write a Select SQL query
dbGetQuery(con, 'select * from "BOD"')

#disconnecting the connection
dbDisconnect(con)

You can also use RODBC package for connecting to your PostgreSQL database but you need to configure your ODBC connections in

Windows Start Panel-

Settings-Control Panel-

Administrative Tools-Data Sources (ODBC)

You should probably see something like this screenshot.

Coming back to R and noting the name of my PostgreSQL DSN from above screenshot-( If not there just click on add-scroll to appropriate database -here PostgreSQL and click on Finish- add in the default values for your database or your own created database values-see screenshot for help with other configuring- and remember to click Test below to check if username and password are working, port is correct etc.

so once the DSN is probably setup in the ODBC (frightening terminology is part of databases)- you can go to R to connect using RODBC package


#loading RODBC

library(RODBC)

#creating a Database connection
# for username,password,database name and DSN name

chan=odbcConnect("PostgreSQL35W","postgres;Password=X;Database=postgres")

#to list all table names

sqlTables(chan)

TABLE_QUALIFIER TABLE_OWNER TABLE_NAME TABLE_TYPE REMARKS
1       postgres      public        bod      TABLE      
 2        postgres      public  database1      TABLE      
 3        postgres      public         tt      TABLE

Now for MySQL databases it is exactly the same code except we download and install the ODBC driver from http://www.mysql.com/downloads/connector/odbc/

and then we run the same configuring DSN as we did for postgreSQL.

After that we use RODBC in pretty much the same way except changing for the default username and password for MySQL and changing the DSN name for the previous step.

channel <- odbcConnect("mysql","jasperdb;Password=XXX;Database=Test")
test2=sqlQuery(channel,"select * from jiuser")
test2
 id  username tenantId   fullname emailAddress  password externallyDefined enabled previousPasswordChangeTime1  1   jasperadmin        1 Jasper Administrator           NA 349AFAADD5C5A2BD477309618DC              NA    01                       
2  2       joe1ser        1             Joe User           NA                 4DD8128D07A               NA    01

odbcClose(channel)

While using RODBC for all databases is a welcome step, perhaps the change release notes for Window Users of R may need to be more substantiative than one given for R 2.12.2

Q&A with PG West Presenter Josh Berkus about PostgreSQL and “Neat Widgets” (blogs.enterprisedb.com)
Oracle MySQL Rival PostgreSQL Updated (pcworld.com)
Postgres folks, consider the 2011 MySQL conference (xaprb.com)
O’Reilly MySQL Conference CfP ends today (xaprb.com)
EnterpriseDB Announces Support for PostgreSQL 9.0; The Best Leverage Against Exploding Enterprise Relational Database Costs (eon.businesswire.com)
PostgreSQL Conference West: 2010 lands in San Francisco November 2nd through 4th (prweb.com)
New Community version: GreenSQL FW: 1.3.0 released (greensql.com)
RPostgreSQL 0.1-7 (dirk.eddelbuettel.com)

JMP Genomics 5 released

Animation of the structure of a section of DNA... — Image via Wikipedia

Close to the launch of JMP9 with it’s R integration comes the announcement of JMP Genomics 5 released. The product brief is available here http://jmp.com/software/genomics/pdf/103112_jmpg5_prodbrief.pdf and it has an interesting mix of features. If you want to try out the features you can see http://jmp.com/software/license.shtml

As per me, I snagged some “new”stuff in this release-

Perform enrichment analysis using functional information from Ingenuity Pathways Analysis.+
New bar chart track allows summarization of reads or intensities.
New color map track displays heat plots of information for individual subjects.
Use a variety of continuous measures for summarization.
Using a common identifier, compare list membership for up tofive groups and display overlaps with Venn diagrams.
Filter or shade segments by mean intensity, with an optionto display segment mean intensity and set a reference valuefor shading.
Adjust intensities or counts for experimental samples using paired or grouped control samples.
Screen paired DNA and RNA intensities for allele-specific expression.
Standardize using a shifting factor and perform log2transformation after standardization.
Use kernel density information in loess and quantile normalization.
Depict partition tree information graphically for standard models with new Tree Viewer
Predictive modeling for survival analysis with Harrell’s assessment method and integration with Cross-Validation Model Comparison.

That’s right- that is incorporating the work of our favorite professor from R Project himself- http://biostat.mc.vanderbilt.edu/wiki/Main/FrankHarrell

Apparently Prof Frank E was quite a SAS coder himself (see http://biostat.mc.vanderbilt.edu/wiki/Main/SasMacros)

Back to JMP Genomics 5-

The JMP software platform provides:

• New integration capabilities let R users leverage JMP’s interactivegraphics to display analytic results.

• Tools for R programmers to build and package user interfaces that let them share customized R analytics with a broader audience.•

A new add-in infrastructure that simplifies the integration of external analytics into JMP.

+ For people in life sciences who like new stats software you can also download a trial version of IPA here at http://www.ingenuity.com/products/IPA/Free-Trial-Software.html

JMP 9 releasing on Oct 12 (r-bloggers.com)
New JMP Software Version Extends Analytic Options (eon.businesswire.com)
Dan Ariely Headlines JMP Analytics Conference (eon.businesswire.com)
Whole Genome Sequencing of Japanese Individual Reveals Wealth of Undiscovered Genetic Variation (prweb.com)
Blog – Ozzy Osbourne’s Genome (technologyreview.com)
SAS Continues to Expand Analytics Options with Additional R Integration (eon.businesswire.com)
Human Genome Sciences Invites Investors to Listen to Webcast of Presentation at JMP Securities Healthcare Conference (eon.businesswire.com)
SAS, JMP Mix Simulation and Analytics to Foster Innovation (eon.businesswire.com)
Using JMP 9 and R together (r-bloggers.com)
Japanese flower has the biggest genome in the world [Mad Genomics] (io9.com)
JMP Customer Herzenberg Lab Wins Computerworld Honor (eon.businesswire.com)

Doing Time Series using a R GUI

Until recently I had been thinking that RKWard was the only R GUI supporting Time Series Models-

however Bob Muenchen of http://www.r4stats.com/ was helpful to point out that the Epack Plugin provides time series functionality to R Commander.

Note the GUI helps explore various time series functionality.

Using Bulkfit you can fit various ARMA models to dataset and choose based on minimum AIC

> bulkfit(AirPassengers$x)

$res

ar d ma AIC

[1,] 0 0 0 1790.368

[2,] 0 0 1 1618.863

[3,] 0 0 2 1522.122

[4,] 0 1 0 1413.909

[5,] 0 1 1 1397.258

[6,] 0 1 2 1397.093

[7,] 0 2 0 1450.596

[8,] 0 2 1 1411.368

[9,] 0 2 2 1394.373

[10,] 1 0 0 1428.179

[11,] 1 0 1 1409.748

[12,] 1 0 2 1411.050

[13,] 1 1 0 1401.853

[14,] 1 1 1 1394.683

[15,] 1 1 2 1385.497

[16,] 1 2 0 1447.028

[17,] 1 2 1 1398.929

[18,] 1 2 2 1391.910

[19,] 2 0 0 1413.639

[20,] 2 0 1 1408.249

[21,] 2 0 2 1408.343

[22,] 2 1 0 1396.588

[23,] 2 1 1 1378.338

[24,] 2 1 2 1387.409

[25,] 2 2 0 1440.078

[26,] 2 2 1 1393.882

[27,] 2 2 2 1392.659

$min

ar d ma AIC

2.000 1.000 1.000 1378.338

> ArimaModel.5 <- Arima(AirPassengers$x,order=c(0,1,1),

+ include.mean=1,

+ seasonal=list(order=c(0,1,1),period=12))

> ArimaModel.5

Series: AirPassengers$x

ARIMA(0,1,1)(0,1,1)[12]

Call: Arima(x = AirPassengers$x, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12), include.mean = 1)

Coefficients:

ma1 sma1

-0.3087 -0.1074

s.e. 0.0890 0.0828

sigma^2 estimated as 135.4: log likelihood = -507.5

AIC = 1021 AICc = 1021.19 BIC = 1029.63

> summary(ArimaModel.5, cor=FALSE)

Series: AirPassengers$x

ARIMA(0,1,1)(0,1,1)[12]

Call: Arima(x = AirPassengers$x, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12), include.mean = 1)

Coefficients:

ma1 sma1

-0.3087 -0.1074

s.e. 0.0890 0.0828

sigma^2 estimated as 135.4: log likelihood = -507.5

AIC = 1021 AICc = 1021.19 BIC = 1029.63

In-sample error measures:

ME RMSE MAE MPE MAPE MASE

0.32355285 11.09952005 8.16242469 0.04409006 2.89713514 0.31563730

Dataset79 <- predar3(ArimaModel.5,fore1=5)

And I also found an interesting Ref Sheet for Time Series functions in R-

http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf

and a slightly more exhaustive time series ref card

http://www.statistische-woche-nuernberg-2010.org/lehre/bachelor/datenanalyse/Refcard3.pdf

Also of interest a matter of opinion on issues in Time Series Analysis in R at

http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm

Of course , if I was the sales manager for SAS ETS I would be worried given the increasing capabilities in Time Series in R. But then again some deficiencies in R GUI for Time Series-

1) Layout is not very elegant

2) Not enough documented help (atleast for the Epack GUI- and no integrated help ACROSS packages-)

3) Graphical capabilties need more help documentation to interpret the output (especially in ACF and PACF plots)

More resources on Time Series using R.

http://people.bath.ac.uk/masgs/time%20series/TimeSeriesR2004.pdf

and http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf

and books

http://www.springer.com/economics/econometrics/book/978-0-387-77316-2

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75960-9

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75958-6

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75966-1

Forecasting with long seasonal periods (r-bloggers.com)
Thinking outside the (graphical) box: Instead of arguing about how best to fix a bar chart, graph it as a time series lineplot instead (stat.columbia.edu)
Plotting Time Series data using ggplot2 (r-bloggers.com)
The ARIMAX model muddle (r-bloggers.com)
Econometrics and R (r-bloggers.com)
How I did it: Lee Baker on winning the tourism forecasting competition (kaggle.com)
American TV does cointegration (r-bloggers.com)
Twitter Predicts the Stock Market (paul.kedrosky.com)

Getting Inside R

Forums and Minerals, the new Internet tools — Image via Wikipedia

I loved the new upgraded design of Inside-R, Revo’s new(?) community.

And promptly shot up a blog application.

What makes Inside- R- slightly better than SDC, Analyticbridge,PlanetR and R _bloggers (with due respects)

Open Id logins (I think thats a new and good step)
Options for automated feed parsing for blogs
More than just a blog aggregator- includes sections on other stuff- thus more like a community than a big feed
Abbreviated feeds- just gives you two-three lines of summary per post than the whole big schmakaround -thats a time saver for me —(D Smith is the only -lonely blogger atm there)
The more the merrier- One more place to read and write R.

btw is the name insider (as in guy who knows inside stuff) or Inside- R (as in get inside the R box)- just kidding. With PlyR, ManipulatR, ApplyR and now Inside R- the pun gets MerrieR

If my blog app gets rejected- these views may change ,grr

R-bloggers for other languages – wanna join? (r-bloggers.com)
National Thank a Blogger Day (mom-101.com)
WoW Insider is available on your Amazon Kindle (wow.joystiq.com)
Revolution Analytics Launches Inside-R.org for the Open Source R Statistics Community (eon.businesswire.com)
Chasing Value: Oracle Insiders Selling as Stock Hits 52-Week High (bloggingstocks.com)
Is it me or does blogging suck up a lot of time? from Gothridge Manor (gothridgemanor.blogspot.com)
Inside-R.org, a new community site for R (revolutionanalytics.com)

Interesting R competition at Reddit

Image representing Reddit as depicted in Crunc... — Image via CrunchBase

Here is an interesting R competition going on at Reddit and it is to help Reddit make a recommendation engine 🙂

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_help_reddit_build_a_recommender_a_public/

by ketralnis

As promised, here is the big dump of voting information that you guys donated to research. Warning: this contains much geekery that may result in discomfort for the nerd-challenged.

I’m trying to use it to build a recommender, and I’ve got some preliminary source code. I’m looking for feedback on all of these steps, since I’m not experienced at machine learning.

Here’s what I’ve done

I dumped all of the raw data that we’ll need to generate the public dumps. The queries are the comments in the two .pig files and it took about 52 minutes to do the dump against production. The result of this raw dump looks like:
```
$ wc -l *.dump
 13,830,070 reddit_data_link.dump
136,650,300 reddit_linkvote.dump
     69,489 reddit_research_ids.dump
 13,831,374 reddit_thing_link.dump
```
I filtered the list of votes for the list of users that gave us permission to use their data. For the curious, that’s 67,059 users: 62,763 with “public votes” and 6,726 with “allow my data to be used for research”. I’d really like to see that second category significantly increased, and hopefully this project will be what does it. This filtering is done by srrecs_researchers.pig and took 83m55.335s on my laptop.
I converted data-dumps that were in our DB schema format to a more useable format using srrecs.pig(about 13min)
From that dump I mapped all of the account_ids, link_ids, and sr_ids to salted hashes (using obscure() insrrecs.py with a random seed, so even I don’t know it). This took about 13min on my laptop. The result of this, votes.dump is the file that is actually public. It is a tab-separated file consisting in:
```
account_id,link_id,sr_id,dir
```
There are 23,091,688 votes from 43,976 users over 3,436,063 links in 11,675 reddits. (Interestingly these ~44k users represent almost 17% of our total votes). The dump is 2.2gb uncompressed, 375mb in bz2.

What to do with it

The recommendations system that I’m trying right now turns those votes into a set of affinities. That is, “67% of user #223’s votes on /r/reddit.com are upvotes and 52% on programming). To make these affinities (55m45.107s on my laptop):

 cat votes.dump | ./srrecs.py "affinities_m()" | sort -S200m | ./srrecs.py "affinities_r()" > affinities.dump

Then I turn the affinities into a sparse matrix representing N-dimensional co-ordinates in the vector space of affinities (scaled to -1..1 instead of 0..1), in the format used by R’s skmeans package (less than a minute on my laptop). Imagine that this matrix looks like

          reddit.com pics       programming horseporn  bacon
          ---------- ---------- ----------- ---------  -----
ketralnis -0.5       (no votes) +0.45       (no votes) +1.0
jedberg   (no votes) -0.25      +0.95       +1.0       -1.0
raldi     +0.75      +0.75      +0.7        (no votes) +1.0
...

We build it like:

# they were already grouped by account_id, so we don't have to
# sort. changes to the previous step will probably require this
# step to have to sort the affinities first
cat affinities.dump | ./srrecs.py "write_matrix('affinities.cm', 'affinities.clabel', 'affinities.rlabel')"

I pass that through an R program srrecs.r (if you don’t have R installed, you’ll need to install that, and the packageskmeans like install.packages('skmeans')). This program plots the users in this vector space finding clusters using a sperical kmeans clustering algorithm (on my laptop, takes about 10 minutes with 15 clusters and 16 minutes with 50 clusters, during which R sits at about 220mb of RAM)

# looks for the files created by write_matrix in the current directory
R -f ./srrecs.r

The output of the program is a generated list of cluster-IDs, corresponding in order to the order of user-IDs inaffinities.clabel. The numbers themselves are meaningless, but people in the same cluster ID have been clustered together.

Here are the files

These are torrents of bzip2-compressed files. If you can’t use the torrents for some reason it’s pretty trivial to figure out from the URL how to get to the files directly on S3, but please try the torrents first since it saves us a few bucks. It’s S3 seeding the torrents anyway, so it’s unlikely that direct-downloading is going to go any faster or be any easier.

votes.dump.bz2 — A tab-separated list of:
```
account_id, link_id, sr_id, direction
```
For your convenience, a tab-separated list of votes already reduced to percent-affinities affinities.dump.bz2, formatted:
```
account_id, sr_id, affinity (scaled 0..1)
```
For your convenience, affinities-matrix.tar.bz2 contains the R CLUTO format matrix files affinities.cm,affinities.clabel, affinities.rlabel

And the code

srrecs.pig, srrecs_researchers.pig — what I used to generate and format the dumps (you probably won’t need this)
mr_tools.py, srrecs.py — what I used to salt/hash the user information and generate the R CLUTO-format matrix files (you probably won’t need this unless you want different information in the matrix)
srrecs.r — the R-code to generate the clusters

Here’s what you can experiment with

The code isn’t nearly useable yet. We need to turn the generated clusters into an actual set of recommendations per cluster, preferably ordered by predicted match. We probably need to do some additional post-processing per user, too. (If they gave us an affinity of 0% to /r/askreddit, we shouldn’t recommend it, even if we predicted that the rest of their cluster would like it.)
We need a test suite to gauge the accuracy of the results of different approaches. This could be done by dividing the data-set in and using 80% for training and 20% to see if the predictions made by that 80% match.
We need to get the whole process to less than two hours, because that’s how often I want to run the recommender. It’s okay to use two or three machines to accomplish that and a lot of the steps can be done in parallel. That said we might just have to accept running it less often. It needs to run end-to-end with no user-intervention, failing gracefully on error
It would be handy to be able to idenfity the cluster of just a single user on-the-fly after generating the clusters in bulk
The results need to be hooked into the reddit UI. If you’re willing to dive into the codebase, this one will be important as soon as the rest of the process is working and has a lot of room for creativity
We need to find the sweet spot for the number of clusters to use. Put another way, how many different types of redditors do you think there are? This could best be done using the aforementioned test-suite and a good-old-fashioned binary search.

Some notes:

I’m not attached to doing this in R (I don’t even know much R, it just has a handy prebaked skmeans implementation). In fact I’m not attached to my methods here at all, I just want a good end-result.
This is my weekend fun project, so it’s likely to move very slowly if we don’t pick up enough participation here
The final version will run against the whole dataset, not just the public one. So even though I can’t release the whole dataset for privacy reasons, I can run your code and a test-suite against it

——————————————————————————————-

I am thinking of using Rattle and using the arules package, and running it on the EC2 to get the horsepower.

How else do you think you can tackle a recommendation engine problem.

Ajay

Blackhat SEO ‘cheats’ Reddit (go.theregister.com)
Reddit drinks Digg’s milkshake (compete.com)
BaconBits, A BitTorrent Tracker for Redditors Only (torrentfreak.com)
Designing algorithmis for Map Reduce (horicky.blogspot.com)

Scoring SAS and SPSS Models in the cloud

Outline of a cloud containing text 'The Cloud' — Image via Wikipedia

An announcement from Zementis and Predixion Software– about using cloud computing for scoring models using PMML. Note R has a PMML package as well which is used by Rattle, data mining GUI for exporting models.

Source- http://www.marketwatch.com/story/predixion-software-introduces-new-product-to-run-sas-and-spss-predictive-models-in-the-cloud-2010-10-19?reflink=MW_news_stmp

——————————————————————————————————–

ALISO VIEJO, Calif., Oct 19, 2010 (BUSINESS WIRE) — Predixion Software today introduced Predixion PMML Connexion(TM), an interface that provides Predixion Insight(TM), the company’s low-cost, self-service in the cloud predictive analytics solution, direct and seamless access to SAS, SPSS (IBM) and other predictive models for use by Predixion Insight customers. Predixion PMML Connexion enables companies to leverage their significant investments in legacy predictive analytics solutions at a fraction of the cost of conventional licensing and maintenance fees.

The announcement was made at the Predictive Analytics World conference in Washington, D.C. where Predixion also announced a strategic partnership with Zementis, Inc., a market leader in PMML-based solutions. Zementis is exhibiting in Booth #P2.

The Predictive Model Markup Language (PMML) standard allows for true interoperability, offering a mature standard for moving predictive models seamlessly between platforms. Predixion has fully integrated this PMML functionality into Predixion Insight, meaning Predixion Insight users can now effortlessly import PMML-based predictive models, enabling information workers to score the models in the cloud from anywhere and publish reports using Microsoft Excel(R) and SharePoint(R). In addition, models can also be written back into SAS, SPSS and other platforms for a truly collaborative, interoperable solution.

“Predixion’s investment in this PMML interface makes perfect business sense as the lion’s share of the models in existence today are created by the SAS and SPSS platforms, creating compelling opportunity to leverage existing investments in predictive and statistical models on a low-cost cloud predictive analytics platform that can be fed with enterprise, line of business and cloud-based data,” said Mike Ferguson, CEO of Intelligent Business Strategies, a leading analyst and consulting firm specializing in the areas of business intelligence and enterprise business integration. “In this economy, Predixion’s low-cost, self-service predictive analytics solutions might be welcome relief to IT organizations chartered with quickly adding additional applications while at the same time cutting costs and staffing.”

“We are pleased to be partnering with Zementis, truly a PMML market leader and innovator,” said Predixion CEO Simon Arkell. “To allow any SAS or SPSS customer to immediately score any of their predictive models in the cloud from within Predixion Insight, compare those models to those created by Predixion Insight, and share the results within Excel and Sharepoint is an exciting step forward for the industry. SAS and SPSS customers are fed up with the high prices they must pay for their business users just to access reports generated by highly skilled PhDs who are burdened by performing routine tasks and thus have become a massive bottleneck. That frustration is now a thing of the past because any information worker can now unlock the power of predictive analytics without relying on experts — for a fraction of the cost and from anywhere they can connect to the cloud,” Arkell said.

Dr. Michael Zeller, Zementis CEO, added, “Our mission is to significantly shorten the time-to-market for predictive models in any industry. We are excited to be contributing to Predixion’s self-service, cloud-based predictive analytics solution set.”

About Predixion Software

Predixion Software develops and markets collaborative predictive analytics solutions in the public and private cloud. Predixion enables self-service predictive analytics, allowing customers to use and analyze large amounts of data to make actionable decisions, all within the familiar environment of Excel and PowerPivot. Predixion customers are achieving immediate results across a multitude of industries including: retail, finance, healthcare, marketing, telecommunications and insurance/risk management.

Predixion Software is headquartered in Aliso Viejo, California with development offices in Redmond, Washington. The company has venture capital backing from established investors including DFJ Frontier, Miramar Venture Partners and Palomar Ventures. For more information please contact us at 949-330-6540, or visit us atwww.predixionsoftware.com.

About Zementis

Zementis, Inc. is a leading software company focused on the operational deployment and integration of predictive analytics and data mining solutions. Its ADAPA(R) decision engine successfully bridges the gap between science and engineering. ADAPA(R) was designed from the ground up to benefit from open standards and to significantly shorten the time-to-market for predictive models in any industry. For more information, please visit www.zementis.com.

Event: Predictive analytics with R, PMML and ADAPA (r-bloggers.com)
Lyzasoft Integrates Low-cost Predictive Analytics into its Data Analytics and Collaboration Platform (eon.businesswire.com)
Predixion Software Introduces Self-Service, Cloud-Based Predictive Analytics Solution (eon.businesswire.com)
SAS Rolls Out Predictive Analytics for Business Users (nytimes.com)
Rattle Re-Introduced (r-bloggers.com)
Predixion Software Finalizes Series A Financing – Raises $5 Million (eon.businesswire.com)
Taking R to the Limit: Large Datasets; Predictive modeling with PMML and ADAPA (r-bloggers.com)
Interview Dean Abbott Abbott Analytics (r-bloggers.com)

Related Articles

Please share:

Related Articles

Please share:

Read rest of the new software here http://jmp.com/software/genomics/pdf/103112_jmpg5_prodbrief.pdf

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

Here’s what I’ve done

What to do with it

Here are the files

And the code

Here’s what you can experiment with

Some notes:

Related Articles

Please share:

Related Articles

Please share: