Medi-ocre

I am surrounded by people
of dazzling brilliance , beauty and mind
Sometimes they are in the room in my face
Sometimes we interact digitally online


I would never be so cunning
So sharp, astute and yet so polite
I feel sometimes like a little cave man
who has stumbled upon the first artificial light

Or like a flattened sunflower
in a field of tall yellow poppy flower
I am bright but still a medium-ochre
In the middle of all that bright golden power

Maybe I will never be a genius
Die unrequited unsung like billions before
Hey I tried to live up to all that potential
But the pretending and defending was too much of a chore

so mediocre and such a medium ochre
my shining shall be twinkly winkly so-s0
it was a blast and atleast we tried
played ,laughed ,partied then died.

(images courtesy-http://sprott.physics.wisc.edu/fractals/carlson/)

Jim Goodnight on Open Source- and why he is right -sigh

Logo Open Source Initiative
Image via Wikipedia

Jim Goodnight – grand old man and Godfather of the Cosa Nostra of the BI/Database Analytics software industry said recently on open source in BI (btw R is generally termed in business analytics and NOT business intelligence software so these remarks were more apt to Pentaho and Jaspersoft )

Asked whether open source BI and data integration software from the likes of Jaspersoft, Pentaho and Talend is a growing threat, [Goodnight] said: “We haven’t noticed that a lot. Most of our companies need industrial strength software that has been tested, put through every possible scenario or failure to make sure everything works correctly.”

quotes from Jim Goodnight are courtesy Jason’s  story here:
http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

and the Pentaho follow-up reaction is here

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

 

 

While you can rage and screech- here is the reality in terms of market share-

From Merv Adrian-‘s excellent article on market shares in BI

http://www.enterpriseirregulars.com/22444/decoding-bi-market-share-numbers-%E2%80%93-play-sudoku-with-analysts/

The first, labeled BI Platforms, is drawn fromGartner Market Share Analysis: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009, published May 2010 , and Gartner Dataquest Market Share: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009.

and

Advanced Analytics category.

and 

so whats the performance of Talend, Pentaho and Jaspersoft

From http://www.dbms2.com/category/products-and-vendors/talend/

It seems that Talend’s revenue was somewhat shy of $10 million in 2008.

and Talend itself says

http://www.talend.com/press/Talend-Announces-Record-2009-and-Continues-Growth-in-the-New-Year.php

Additional 2009 highlights include:

  • Achieved record revenue, more then doubling from 2008. The fourth quarter of 2009 was Talend’s tenth consecutive quarter of growth.
  • Grew customer base by 140% to over 1,000 customers, up from 420 at the end of 2008. Of these new customers, over 50% are Fortune 1000 companies.
  • Total downloads reached seven million, with over 300,000 users of the open source products.
  • Talend doubled its staff, increasing to 200 global employees. Continuing this trend, Talend has already hired 15 people in 2010 to support its rapid growth.

now for Jaspersoft numbers

http://www.dbms2.com/2008/09/14/jaspersoft-numbers/

Highlights include:

  • Revenue run rate in the double-digit millions.
  • 40% sequential growth most recent quarter. (I didn’t ask whether there was any reason to suspect seasonality.)
  • 130% annual revenue growth run rate.
  • “Not quite” profitable.
  • Several hundred commercial subscribers, at an average of $25K annually per, including >100 in Europe.
  • 9,000 paying customers of some kind.
  • 100,000+ total deployments, “very conservatively,” counting OEMs as one deployment each and not double-counting for OEMs’ customers. (Nick said Business Objects quotes 45,000 deployments by the same standards.)
  • 70% of revenue from the mid-market, defined as $100 million – $1 billion revenue. 30% from bigger enterprises. (Hmm. That begs a couple of questions, such as where OEM revenue comes in, and whether <$100 million enterprises were truly a negligible part of revenue.)

and for Pentaho numbers-

http://www.dbms2.com/2009/01/27/introduction-to-pentaho/

and http://www.monash.com/uploads/Pentaho-January-2009.pdf

suggests there are far far away from the top 5-6 vendors in BI

and a special mention  for postgreSQL– which is a non Profit but is seriously denting Oracle/MySQL

http://www.postgresql.org/about/

Limit Value
Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250 – 1600 depending on column types
Maximum Indexes per Table Unlimited

and leading vendor is EnterpriseDB which is again IBM-partnering as well as IBM funded

http://www.sramanamitra.com/2009/05/18/enterprise-db/

and

http://www.enterprisedb.com/company/news_events/press_releases/2010_21.do

suggest it is still in early stages.

————————————————————–

So what do we conclude-

1) There is a complete lack of transparency in open source BI market shares as almost all these companies are privately held and do not disclose revenues.

2) What may be a pure play open source company may actually be a company funded by a big BI vendor (like Revolution Analytics is funded among others by Intel-Microsoft) and EnterpriseDB has IBM as an investor.MySQL and Sun of course are bought by Oracle

The degree of control by proprietary vendors on open source vendors is still not disclosed- whether they are holding a stake for strategic reasons or otherwise.

3) None of the Open Source Vendors are even close to a 1 Billion dollar revenue number.

Jim Goodnight is pointing out market reality when he says he has not seen much impact (in terms of market share). As for the rest of his remarks, well he’s got a job to do as CEO and thats talk up his company and trash the competition- which he as been doing for 3 decades and unlikely to change now unless there is severe market share impact. Unless you expect him to notice companies less than 5% of his size in revenue.

http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

 

Using PostgreSQL and MySQL databases in R 2.12 for Windows

Air University Library's Index to Military Per...
Image via Wikipedia

If you use Windows for your stats computing and your data is in a database (probably true for almost all corporate business analysts) R 2.12 has provided a unique procedural hitch for you NO BINARIES for packages used till now to read from these databases.

The Readme notes of the release say-

Packages related to many database system must be linked to the exact
version of the database system the user has installed, hence it does
not make sense to provide binaries for packages
	RMySQL, ROracle, ROracleUI, RPostgreSQL
although it is possible to install such packages from sources by
	install.packages('packagename', type='source')
after reading the manual 'R Installation and Administration'.

So how to connect to Databases if the Windows Binary is not available-

So how to connect to PostgreSQL and MySQL databases.

For Postgres databases-

You can update your PostgreSQL databases here-

http://www.postgresql.org/download/windows

Fortunately the RpgSQL package is still available for PostgreSQL

  • Using the RpgSQL package

library(RpgSQL)

#creating a connection
con <- dbConnect(pgSQL(), user = "postgres", password = "XXXX",dbname="postgres")

#writing a table from a R Dataset
dbWriteTable(con, "BOD", BOD)

# table names are lower cased unless double quoted. Here we write a Select SQL query
dbGetQuery(con, 'select * from "BOD"')

#disconnecting the connection
dbDisconnect(con)

You can also use RODBC package for connecting to your PostgreSQL database but you need to configure your ODBC connections in

Windows Start Panel-

Settings-Control Panel-

Administrative Tools-Data Sources (ODBC)

You should probably see something like this screenshot.

Coming back to R and noting the name of my PostgreSQL DSN from above screenshot-( If not there just click on add-scroll to appropriate database -here PostgreSQL and click on Finish- add in the default values for your database or your own created database values-see screenshot for help with other configuring- and remember to click Test below to check if username and password are working, port is correct etc.

so once the DSN is probably setup in the ODBC (frightening terminology is part of databases)- you can go to R to connect using RODBC package


#loading RODBC

library(RODBC)

#creating a Database connection
# for username,password,database name and DSN name

chan=odbcConnect("PostgreSQL35W","postgres;Password=X;Database=postgres")

#to list all table names

sqlTables(chan)

TABLE_QUALIFIER TABLE_OWNER TABLE_NAME TABLE_TYPE REMARKS
1       postgres      public        bod      TABLE      
 2        postgres      public  database1      TABLE      
 3        postgres      public         tt      TABLE

Now for MySQL databases it is exactly the same code except we download and install the ODBC driver from http://www.mysql.com/downloads/connector/odbc/

and then we run the same configuring DSN as we did for postgreSQL.

After that we use RODBC in pretty much the same way except changing for the default username and password for MySQL and changing the DSN name for the previous step.

channel <- odbcConnect("mysql","jasperdb;Password=XXX;Database=Test")
test2=sqlQuery(channel,"select * from jiuser")
test2
 id  username tenantId   fullname emailAddress  password externallyDefined enabled previousPasswordChangeTime1  1   jasperadmin        1 Jasper Administrator           NA 349AFAADD5C5A2BD477309618DC              NA    01                       
2  2       joe1ser        1             Joe User           NA                 4DD8128D07A               NA    01
odbcClose(channel)
While using RODBC for all databases is a welcome step, perhaps the change release notes for Window Users of R may need to be more substantiative than one given for R 2.12.2

JMP Genomics 5 released

Animation of the structure of a section of DNA...
Image via Wikipedia

Close to the launch of JMP9 with it’s R integration comes the announcement of JMP Genomics 5 released. The product brief is available here http://jmp.com/software/genomics/pdf/103112_jmpg5_prodbrief.pdf and it has an interesting mix of features. If you want to try out the features you can see http://jmp.com/software/license.shtml

As per me, I snagged some “new”stuff in this release-

  • Perform enrichment analysis using functional information from Ingenuity Pathways Analysis.+
  • New bar chart track allows summarization of reads or intensities.
  • New color map track displays heat plots of information for individual subjects.
  • Use a variety of continuous measures for summarization.
  • Using a common identifier, compare list membership for up tofive groups and display overlaps with Venn diagrams.
  • Filter or shade segments by mean intensity, with an optionto display segment mean intensity and set a reference valuefor shading.
  • Adjust intensities or counts for experimental samples using paired or grouped control samples.
  • Screen paired DNA and RNA intensities for allele-specific expression.
  • Standardize using a shifting factor and perform log2transformation after standardization.
  • Use kernel density information in loess and quantile normalization.
  • Depict partition tree information graphically for standard models with new Tree Viewer
  • Predictive modeling for survival analysis with Harrell’s assessment method and integration with Cross-Validation Model Comparison.

That’s right- that is incorporating the work of our favorite professor from R Project himself- http://biostat.mc.vanderbilt.edu/wiki/Main/FrankHarrell

Apparently Prof Frank E was quite a SAS coder himself (see http://biostat.mc.vanderbilt.edu/wiki/Main/SasMacros)

Back to JMP Genomics 5-

The JMP software platform provides:

• New integration capabilities let R users leverage JMP’s interactivegraphics to display analytic results.

• Tools for R programmers to build and package user interfaces that let them share customized R analytics with a broader audience.•

A new add-in infrastructure that simplifies the integration of external analytics into JMP.

 

+ For people in life sciences who like new stats software you can also download a trial version of IPA here at http://www.ingenuity.com/products/IPA/Free-Trial-Software.html

The Best and Worst Graphs Ever

From http://www.math.yorku.ca/SCS/Gallery/ a great selection

March of the Pies

and the renowned best graph ever

and the most lieing factor or distorted graph

Amazon goes free for users next month

Amazon Web Services logo
Image via Wikipedia

Amazon EC2 and company announced a free year long tier for new users-you cant beat free 🙂

http://aws.amazon.com/free/

AWS Free Usage Tier

To help new AWS customers get started in the cloud, AWS is introducing a new free usage tier. Beginning November 1, new AWScustomers will be able to run a free Amazon EC2 Micro Instance for a year, while also leveraging a new free usage tier for Amazon S3, Amazon Elastic Block Store, Amazon Elastic Load Balancing, and AWSdata transfer. AWS’s free usage tier can be used for anything you want to run in the cloud: launch new applications, test existing applications in the cloud, or simply gain hands-on experience with AWS.

Below are the highlights of AWS’s new free usage tiers. All are available for one year (except Amazon SimpleDB, SQS, and SNS which are free indefinitely):

Sign Up Now

AWS’s free usage tier startsNovember 1, 2010. A valid creditcard is required to sign up.
See offer terms.

AWS Free Usage Tier (Per Month):

In addition to these services, the AWS Management Console is available at no charge to help you build and manage your application on AWS.

* These free tiers are only available to new AWS customers and are available for 12 months following your AWSsign-up date. When your free usage expires or if your application use exceeds the free usage tiers, you simply pay standard, pay-as-you-go service rates (see each service page for full pricing details). Restrictions apply; see offer terms for more details.

** These free tiers do not expire after 12 months and are available to both existing and new AWS customers indefinitely.

The new AWS free usage tier applies to participating services across all AWS regions: US – N. Virginia, US – N. California, EU – Ireland, and APAC – Singapore. Your free usage is calculated each month across all regions and automatically applied to your bill – free usage does not accumulate.

 

Doing Time Series using a R GUI

The Xerox Star Workstation introduced the firs...
Image via Wikipedia

Until recently I had been thinking that RKWard was the only R GUI supporting Time Series Models-

however Bob Muenchen of http://www.r4stats.com/ was helpful to point out that the Epack Plugin provides time series functionality to R Commander.

Note the GUI helps explore various time series functionality.

Using Bulkfit you can fit various ARMA models to dataset and choose based on minimum AIC

 

> bulkfit(AirPassengers$x)
$res
ar d ma      AIC
[1,]  0 0  0 1790.368
[2,]  0 0  1 1618.863
[3,]  0 0  2 1522.122
[4,]  0 1  0 1413.909
[5,]  0 1  1 1397.258
[6,]  0 1  2 1397.093
[7,]  0 2  0 1450.596
[8,]  0 2  1 1411.368
[9,]  0 2  2 1394.373
[10,]  1 0  0 1428.179
[11,]  1 0  1 1409.748
[12,]  1 0  2 1411.050
[13,]  1 1  0 1401.853
[14,]  1 1  1 1394.683
[15,]  1 1  2 1385.497
[16,]  1 2  0 1447.028
[17,]  1 2  1 1398.929
[18,]  1 2  2 1391.910
[19,]  2 0  0 1413.639
[20,]  2 0  1 1408.249
[21,]  2 0  2 1408.343
[22,]  2 1  0 1396.588
[23,]  2 1  1 1378.338
[24,]  2 1  2 1387.409
[25,]  2 2  0 1440.078
[26,]  2 2  1 1393.882
[27,]  2 2  2 1392.659
$min
ar        d       ma      AIC
2.000    1.000    1.000 1378.338
> ArimaModel.5 <- Arima(AirPassengers$x,order=c(0,1,1),
+ include.mean=1,
+   seasonal=list(order=c(0,1,1),period=12))
> ArimaModel.5
Series: AirPassengers$x
ARIMA(0,1,1)(0,1,1)[12]
Call: Arima(x = AirPassengers$x, order = c(0, 1, 1), seasonal = list(order = c(0,      1, 1), period = 12), include.mean = 1)
Coefficients:
ma1     sma1
-0.3087  -0.1074
s.e.   0.0890   0.0828
sigma^2 estimated as 135.4:  log likelihood = -507.5
AIC = 1021   AICc = 1021.19   BIC = 1029.63
> summary(ArimaModel.5, cor=FALSE)
Series: AirPassengers$x
ARIMA(0,1,1)(0,1,1)[12]
Call: Arima(x = AirPassengers$x, order = c(0, 1, 1), seasonal = list(order = c(0,      1, 1), period = 12), include.mean = 1)
Coefficients:
ma1     sma1
-0.3087  -0.1074
s.e.   0.0890   0.0828
sigma^2 estimated as 135.4:  log likelihood = -507.5
AIC = 1021   AICc = 1021.19   BIC = 1029.63
In-sample error measures:
ME        RMSE         MAE         MPE        MAPE        MASE
0.32355285 11.09952005  8.16242469  0.04409006  2.89713514  0.31563730
Dataset79 <- predar3(ArimaModel.5,fore1=5)

 

And I also found an interesting Ref Sheet for Time Series functions in R-

http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf

and a slightly more exhaustive time series ref card

http://www.statistische-woche-nuernberg-2010.org/lehre/bachelor/datenanalyse/Refcard3.pdf

Also of interest a matter of opinion on issues in Time Series Analysis in R at

http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm

Of course , if I was the sales manager for SAS ETS I would be worried given the increasing capabilities in Time Series in R. But then again some deficiencies in R GUI for Time Series-

1) Layout is not very elegant

2) Not enough documented help (atleast for the Epack GUI- and no integrated help ACROSS packages-)

3) Graphical capabilties need more help documentation to interpret the output (especially in ACF and PACF plots)

More resources on Time Series using R.

http://people.bath.ac.uk/masgs/time%20series/TimeSeriesR2004.pdf

and http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf

and books

http://www.springer.com/economics/econometrics/book/978-0-387-77316-2

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75960-9

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75958-6

http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75966-1