Database – Page 9 – DECISION STATS

AsterData partners with Tableau

Tableau which has been making waves recntly with its great new data visualization tool announced a partner with my old friends at AsterData. Its really cool piece of data vis and very very fast on the desktop- so I can imagine what speed it can help with AsterData’s MPP Row and Column Zingbang AND Parallel Analytical Functions

Tableau and AsterData also share the common Stanfordian connection (but it seems software is divided quite equally between Stanford, Hardvard Dropouts and North Carolina )

It remains to be seen in this announcement how much each company can leverage the partnership or whether it turns like the SAS Institute- AsterData partnership last year or whether it is just to announce connectors in their software to talk to each other.

See a Tableau vis at

http://public.tableausoftware.com/views/geographyofdiabetes/Dashboard2?:embed=yes&:toolbar=yes

AsterData remains the guys with the potential but I would be wrong to say MapReduce–SQL is as hot in December 2010 as it was in June 2009- and the elephant in the room would be Hadoop. That and Google’s continued shyness from encashing its principal comptency of handling Big Data (but hush – I signed a NDA with the Google Prediction API– so things maaaay change very rapidly on ahem that cloud)

Disclaimer- AsterData was my internship sponsor during my winter training while at Univ of Tenn.

Aster Data and Cloudera Partner to Couple Industry-leading Analytical Database and Hadoop Solutions (prweb.com)
Aster Data Hosting Big Data Insights Summit 2010, Chicago to Highlight Innovative New Solutions and Trends in Big Data Management and Advanced Analytics (prweb.com)
Partnering with Cloudera (dbms2.com)
Exclusive Interview: Quentin Gallivan, Aster Data (arnoldit.com)
Aster Data Raises Another $30 Million To Help Manage Big Data (techcrunch.com)
WikiLeaks Cable Gate: the Visualizations and the Infographics (infosthetics.com)

Complex Event Processing- SASE Language

Image via Wikipedia

Complex Event Processing (CEP- not to be confused by Circular Probability Error) is defined processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time.

Software supporting CEP are-

Oracle http://www.oracle.com/us/technologies/soa/service-oriented-architecture-066455.html

Oracle CEP is a Java application server for the development and deployment of high-performance event driven applications. It can detect patterns in the flow of events and message payloads, often based on filtering, correlation, and aggregation across event sources, and includes industry leading temporal and ordering capabilities. It supports ultra-high throughput (1 million/sec++) and microsecond latency.

Tibco is also trying to get into this market (it claims to have a 40 % market share in the public CEP market 😉 though probably they have not measured the DoE and DoD as worthy of market share yet

– see webcast by TIBCO ‘s head here http://www.tibco.com/products/business-optimization/complex-event-processing/default.jsp

and product info here-http://www.tibco.com/products/business-optimization/complex-event-processing/businessevents/default.jsp

TIBCO is the undisputed leader in complex event processing (CEP) software with over 40 percent market share, according to a recent IDC Study.

A good explanation of how social media itself can be used as an analogy for CEP is given in this SAS Global Paper

http://support.sas.com/resources/papers/proceedings10/040-2010.pdf

You can see a report on Predictive Analytics and Data Mining in q1 2010 also from SAS’s website at –http://www.sas.com/news/analysts/forresterwave-predictive-analytics-dm-104388-0210.pdf

A very good explanation on architecture involved is given by SAS CTO Keith Collins here on SAS’s Knowledge Exchange site,

http://www.sas.com/knowledge-exchange/risk/four-ways-divide-conquer.html

What it is: Methods 1 through 3 look at historical data and traditional architectures with information stored in the warehouse. In this environment, it often takes months of data cleansing and preparation to get the data ready to analyze. Now, what if you want to make a decision or determine the effect of an action in real time, as a sale is made, for instance, or at a specific step in the manufacturing process. With streaming data architectures, you can look at data in the present and make immediate decisions. The larger flood of data coming from smart phones, online transactions and smart-grid houses will continue to increase the amount of data that you might want to analyze but not keep. Real-time streaming, complex event processing (CEP) and analytics will all come together here to let you decide on the fly which data is worth keeping and which data to analyze in real time and then discard.

When you use it: Radio-frequency identification (RFID) offers a good user case for this type of architecture. RFID tags provide a lot of information, but unless the state of the item changes, you don’t need to keep warehousing the data about that object every day. You only keep data when it moves through the door and out of the warehouse.

The same concept applies to a customer who does the same thing over and over. You don’t need to keep storing data for analysis on a regular pattern, but if they change that pattern, you might want to start paying attention.

Figure 4: Traditional architecture vs. streaming architecture

In academia here is something called SASE Language

A rich declarative event language
Formal semantics of the event language
Theorectical underpinnings of CEP
An efficient automata-based implementation

http://sase.cs.umass.edu/

and

http://avid.cs.umass.edu/sase/index.php?page=navleft_1col

Financial Services

The query below retrieves the total trading volume of Google stocks in the 4 hour period after some bad news occurred.

PATTERN SEQ(News a, Stock+ b[ ])WHERE   [symbol]    AND	a.type = 'bad'    AND	b[i].symbol = 'GOOG' WITHIN  4 hoursHAVING  b[b.LEN].volume < 80%*b[1].volumeRETURN  sum(b[ ].volume)

The next query reports a one-hour period in which the price of a stock increased from 10 to 20 and its trading volume stayed relatively stable.

PATTERN	SEQ(Stock+ a[])WHERE 	 [symbol]   AND	  a[1].price = 10   AND	  a[i].price > a[i-1].price   AND	  a[a.LEN].price = 20            WITHIN  1 hourHAVING	avg(a[].volume) ≥ a[1].volumeRETURN	a[1].symbol, a[].price

The third query detects a more complex trend: in an hour, the volume of a stock started high, but after a period of price increasing or staying relatively stable, the volume plummeted.

PATTERN SEQ(Stock+ a[], Stock b)WHERE 	 [symbol]   AND	  a[1].volume > 1000   AND	  a[i].price > avg(a[…i-1].price))   AND	  b.volume < 80% * a[a.LEN].volume           WITHIN  1 hourRETURN	a[1].symbol, a[].(price,volume), b.(price,volume)

(note from Ajay-

I was not really happy about the depth of resources on CEP available online- there seem to be missing bits and pieces in both open source, academic and corporate information- one reason for this is the obvious military dual use of this technology- like feeds from Satellite, Audio Scans, etc)

Quant Capital Deploys Aleri CEP Technology for Algorithmic Trading (eon.businesswire.com)
Sybase Positions Itself as a Clear Market Leader in CEP and Strengthens Its Real-Time Analytics Platform with Acquisition of Aleri Assets (eon.businesswire.com)
Event Processing in Action (i-programmer.info)

Stuff I like to Read to Kush: Kush's Blog

I am putting together a list of top 500 Blogs on –

Enterprise Software
Business Intelligence
Companies selling Enterprise software I like (SAP,Oracle,IBM, SAS, Salesforce)
Languages I like (SAS,Python,R,SPSS)

Some additional points-

I like YCombinator‘s Hacker News– so the auto parsed links are like that on main page. They lead to original websites.
Comments are disabled, feed is jumbled, only 40 word excerpts are shown.
Intent is also to show open source blogs and enterprise blogs at same time (regardless of advertising by vendors 😉 )
If your blog feed is there, I will keep it there – either dont write or dont use RSS if you dont want to share
If your blog feed is not there, it is probably not there for a reason.
No ads will be shown NOW or FOREVER on that site.

And after all that noise- you can see Kush’s Blog –http://www.kushohri.com/

Full Hacker News database for download (posts, comments, points, date, username) (api.ihackernews.com)
SAS scores big digital marketing win, announces social media engagement solution (customerthink.com)
Magnitude 6.3 earthquake shakes Afghan Kush (foxnews.com)
New Study Finds Retailers Impatient for Insights, Placing New Expectations on BI Solutions (prweb.com)

Cisco SocialMiner

Image via Wikipedia

A new product from Cisco to mine social media for analytics on sentiment-

http://www.cisco.com/en/US/products/ps11349/index.html

Cisco SocialMiner is a social media customer care solution that can help you proactively respond to customers and prospects communicating through public social media networks like Twitter, Facebook, or other public forums or blogging sites. By providing social media monitoring, queuing, and workflow to organize customer posts on social media networks and deliver them to your social media customer care team, your company can respond to customers in real time using the same social network they are using.

Cisco SocialMiner provides:

The ability to configure multiple campaigns to search for customer postings on the public social web about your company’s products, services, or area of expertise
Filtering of social contacts based on preconfigured campaign filters to focus campaign searches
Routing of social contacts to skilled customer care representatives in the contact center or to experts in the enterprise–multiple people can work together to handle responses to customer postings through shared work queues
Detailed metrics for social media customer care activities, campaign reports, and team reports

With Cisco SocialMiner, your company can listen and respond to customer conversations originating in the social web. Being proactive can help your company enhance its service, improve customer loyalty, garner new customers, and protect your brand.

Table 1. Features and Benefits of Cisco SocialMiner 8.5

Feature	Benefits
Product Baseline Features
Social media feeds	• Feeds are configurable sources to capture public social contacts that contain specific words, terms, or phrases. • Feeds enable you to search for information on the public social web about your company’s products, services, or area of expertise. • Cisco SocialMiner supports the following types of feeds: • Really Simple Syndication (RSS) • Facebook • Twitter
Campaign management	• Groups feeds into campaigns to organize all posting activity related to a product category or business objective • Produces metrics on campaign activity • Provides the ability to configure multiple campaigns to search for customer postings on specific products or services • Groups social contacts for handling by the social media customer care team • Enables filtering of social contacts based on preconfigured campaign filters to focus campaign searches
Route and queue social contacts	• Enables routing of social contacts to skilled customer care representatives in the contact center • Draws on expertise in the enterprise by allowing multiple people in the enterprise to work together to handle responses to customer postings through shared work queues • Enables automated distribution of work to improve efficiency and effectiveness of social media engagement
Tagging	• Allows work to be routed to the appropriate team by grouping each post or social contact into different categories; for example, a post can be marked with the “customer_support” tag; this post will then appear on a customer support agent’s queue for processing
Social media customer care metrics	• Provides detailed metrics on social media customer care activities, campaign reports, and team reports • Measures work and results • Manages to service-level goals • Supports brand management • Optimizes staffing • Includes dashboarding of social media posting activity when Cisco Unified Intelligence Center is used
Reporting for social contacts	• Provides a reporting database that can be accessed using any reporting tool, including Cisco Unified Intelligence Center • Enables customer care management to accurately report on and track social media interactions by the contact center
OpenSocial-compliant gadgets Representational State Transfer (REST) application programming interfaces (APIs)	• Provides flexible user interface options • Enables extensive opportunities for customization
Optional integration with full suite of Cisco Collaboration tools	• Allows you to take advantage of the full suite of Cisco Collaboration tools, including Cisco Quad, Cisco Show and Share, and Cisco Pulse technology, to help your social media customer care team quickly find answers to help customers efficiently and effectively • Easy to maintain with existing IT personnel
Operating Environment
Cisco Unified Computing System^™(UCS) C-Series or B-Series Servers	• Requires a Cisco UCS C-Series or B-Series Server. • Server consolidation means lower cost per server with Cisco UCS Servers.
Architecture
Scalability	• One server supports up to 30 simultaneous social media customer care users and 10,000 social contacts per hour.
Management
Cisco Unified Real-Time Monitoring Tool (RTMT)	• Operational management is enhanced through integration with the Cisco Unified RTMT, providing consistent application monitoring across Cisco Unified Communications Solutions.
Simple Network Management Protocol (SNMP)	• SNMP with an associated MIB is supported through the Cisco Voice Operating System (VOS).
Reporting
Cisco Unified Intelligence Center	• Create customizable reports of social media customer care events using Cisco Unified Intelligence Center (purchased separately).

Jim Goodnight on Open Source- and why he is right -sigh

Image via Wikipedia

Jim Goodnight – grand old man and Godfather of the Cosa Nostra of the BI/Database Analytics software industry said recently on open source in BI (btw R is generally termed in business analytics and NOT business intelligence software so these remarks were more apt to Pentaho and Jaspersoft )

Asked whether open source BI and data integration software from the likes of Jaspersoft, Pentaho and Talend is a growing threat, [Goodnight] said: “We haven’t noticed that a lot. Most of our companies need industrial strength software that has been tested, put through every possible scenario or failure to make sure everything works correctly.”

quotes from Jim Goodnight are courtesy Jason’s story here:
http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

and the Pentaho follow-up reaction is here

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

While you can rage and screech- here is the reality in terms of market share-

From Merv Adrian-‘s excellent article on market shares in BI

http://www.enterpriseirregulars.com/22444/decoding-bi-market-share-numbers-%E2%80%93-play-sudoku-with-analysts/

The first, labeled BI Platforms, is drawn fromGartner Market Share Analysis: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009, published May 2010 , and Gartner Dataquest Market Share: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009.

and

Advanced Analytics category.

and

so whats the performance of Talend, Pentaho and Jaspersoft

From http://www.dbms2.com/category/products-and-vendors/talend/

It seems that Talend’s revenue was somewhat shy of $10 million in 2008.

and Talend itself says

http://www.talend.com/press/Talend-Announces-Record-2009-and-Continues-Growth-in-the-New-Year.php

Additional 2009 highlights include:

Achieved record revenue, more then doubling from 2008. The fourth quarter of 2009 was Talend’s tenth consecutive quarter of growth.
Grew customer base by 140% to over 1,000 customers, up from 420 at the end of 2008. Of these new customers, over 50% are Fortune 1000 companies.
Total downloads reached seven million, with over 300,000 users of the open source products.
Talend doubled its staff, increasing to 200 global employees. Continuing this trend, Talend has already hired 15 people in 2010 to support its rapid growth.

now for Jaspersoft numbers

http://www.dbms2.com/2008/09/14/jaspersoft-numbers/

Highlights include:

Revenue run rate in the double-digit millions.
40% sequential growth most recent quarter. (I didn’t ask whether there was any reason to suspect seasonality.)
130% annual revenue growth run rate.
“Not quite” profitable.
Several hundred commercial subscribers, at an average of $25K annually per, including >100 in Europe.
9,000 paying customers of some kind.
100,000+ total deployments, “very conservatively,” counting OEMs as one deployment each and not double-counting for OEMs’ customers. (Nick said Business Objects quotes 45,000 deployments by the same standards.)
70% of revenue from the mid-market, defined as $100 million – $1 billion revenue. 30% from bigger enterprises. (Hmm. That begs a couple of questions, such as where OEM revenue comes in, and whether <$100 million enterprises were truly a negligible part of revenue.)

and for Pentaho numbers-

http://www.dbms2.com/2009/01/27/introduction-to-pentaho/

and http://www.monash.com/uploads/Pentaho-January-2009.pdf

suggests there are far far away from the top 5-6 vendors in BI

and a special mention for postgreSQL– which is a non Profit but is seriously denting Oracle/MySQL

http://www.postgresql.org/about/

Limit	Value
Maximum Database Size	Unlimited
Maximum Table Size	32 TB
Maximum Row Size	1.6 TB
Maximum Field Size	1 GB
Maximum Rows per Table	Unlimited
Maximum Columns per Table	250 – 1600 depending on column types
Maximum Indexes per Table	Unlimited

and leading vendor is EnterpriseDB which is again IBM-partnering as well as IBM funded

http://www.sramanamitra.com/2009/05/18/enterprise-db/

and

http://www.enterprisedb.com/company/news_events/press_releases/2010_21.do

suggest it is still in early stages.

————————————————————–

So what do we conclude-

1) There is a complete lack of transparency in open source BI market shares as almost all these companies are privately held and do not disclose revenues.

2) What may be a pure play open source company may actually be a company funded by a big BI vendor (like Revolution Analytics is funded among others by Intel-Microsoft) and EnterpriseDB has IBM as an investor.MySQL and Sun of course are bought by Oracle

The degree of control by proprietary vendors on open source vendors is still not disclosed- whether they are holding a stake for strategic reasons or otherwise.

3) None of the Open Source Vendors are even close to a 1 Billion dollar revenue number.

Jim Goodnight is pointing out market reality when he says he has not seen much impact (in terms of market share). As for the rest of his remarks, well he’s got a job to do as CEO and thats talk up his company and trash the competition- which he as been doing for 3 decades and unlikely to change now unless there is severe market share impact. Unless you expect him to notice companies less than 5% of his size in revenue.

http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

SAS vs Open Source (revolutionanalytics.com)
Reducing the Cost of Business Intelligence with Open Source (itexpertvoice.com)
Lunexa and Talend Partner to Drive Adoption of Open Source Data Management Solutions (eon.businesswire.com)
Talend Expands Partnership with Netezza to Advance Enterprise-Scale Data Management Deployments (eon.businesswire.com)
Open Source Business Intelligence: Pentaho and Jaspersoft (r-bloggers.com)
SAS chief says global software sales up 5 pct (reuters.com)
Business Analytics Leader SAS Joins White House Education Effort (eon.businesswire.com)
New Report Details The Rise of Business Intelligence Software (ostatic.com)
After Talend and ExoPlatform, Bonitasoft gets ready to seduce the US market with its open source BMP solution (eu.techcrunch.com)
Talend and Cloudera Announce Technology Partnership to Simplify Processing of Large Scale Data (eon.businesswire.com)

Using PostgreSQL and MySQL databases in R 2.12 for Windows

Air University Library's Index to Military Per... — Image via Wikipedia

If you use Windows for your stats computing and your data is in a database (probably true for almost all corporate business analysts) R 2.12 has provided a unique procedural hitch for you NO BINARIES for packages used till now to read from these databases.

The Readme notes of the release say-

Packages related to many database system must be linked to the exact
version of the database system the user has installed, hence it does
not make sense to provide binaries for packages
	RMySQL, ROracle, ROracleUI, RPostgreSQL
although it is possible to install such packages from sources by
	install.packages('packagename', type='source')
after reading the manual 'R Installation and Administration'.

So how to connect to Databases if the Windows Binary is not available-

So how to connect to PostgreSQL and MySQL databases.

For Postgres databases-

You can update your PostgreSQL databases here-

http://www.postgresql.org/download/windows

Fortunately the RpgSQL package is still available for PostgreSQL

Using the RpgSQL package


library(RpgSQL)

#creating a connection
con <- dbConnect(pgSQL(), user = "postgres", password = "XXXX",dbname="postgres")

#writing a table from a R Dataset
dbWriteTable(con, "BOD", BOD)

# table names are lower cased unless double quoted. Here we write a Select SQL query
dbGetQuery(con, 'select * from "BOD"')

#disconnecting the connection
dbDisconnect(con)

You can also use RODBC package for connecting to your PostgreSQL database but you need to configure your ODBC connections in

Windows Start Panel-

Settings-Control Panel-

Administrative Tools-Data Sources (ODBC)

You should probably see something like this screenshot.

Coming back to R and noting the name of my PostgreSQL DSN from above screenshot-( If not there just click on add-scroll to appropriate database -here PostgreSQL and click on Finish- add in the default values for your database or your own created database values-see screenshot for help with other configuring- and remember to click Test below to check if username and password are working, port is correct etc.

so once the DSN is probably setup in the ODBC (frightening terminology is part of databases)- you can go to R to connect using RODBC package


#loading RODBC

library(RODBC)

#creating a Database connection
# for username,password,database name and DSN name

chan=odbcConnect("PostgreSQL35W","postgres;Password=X;Database=postgres")

#to list all table names

sqlTables(chan)

TABLE_QUALIFIER TABLE_OWNER TABLE_NAME TABLE_TYPE REMARKS
1       postgres      public        bod      TABLE      
 2        postgres      public  database1      TABLE      
 3        postgres      public         tt      TABLE

Now for MySQL databases it is exactly the same code except we download and install the ODBC driver from http://www.mysql.com/downloads/connector/odbc/

and then we run the same configuring DSN as we did for postgreSQL.

After that we use RODBC in pretty much the same way except changing for the default username and password for MySQL and changing the DSN name for the previous step.

channel <- odbcConnect("mysql","jasperdb;Password=XXX;Database=Test")
test2=sqlQuery(channel,"select * from jiuser")
test2
 id  username tenantId   fullname emailAddress  password externallyDefined enabled previousPasswordChangeTime1  1   jasperadmin        1 Jasper Administrator           NA 349AFAADD5C5A2BD477309618DC              NA    01                       
2  2       joe1ser        1             Joe User           NA                 4DD8128D07A               NA    01

odbcClose(channel)

While using RODBC for all databases is a welcome step, perhaps the change release notes for Window Users of R may need to be more substantiative than one given for R 2.12.2

Q&A with PG West Presenter Josh Berkus about PostgreSQL and “Neat Widgets” (blogs.enterprisedb.com)
Oracle MySQL Rival PostgreSQL Updated (pcworld.com)
Postgres folks, consider the 2011 MySQL conference (xaprb.com)
O’Reilly MySQL Conference CfP ends today (xaprb.com)
EnterpriseDB Announces Support for PostgreSQL 9.0; The Best Leverage Against Exploding Enterprise Relational Database Costs (eon.businesswire.com)
PostgreSQL Conference West: 2010 lands in San Francisco November 2nd through 4th (prweb.com)
New Community version: GreenSQL FW: 1.3.0 released (greensql.com)
RPostgreSQL 0.1-7 (dirk.eddelbuettel.com)

Interesting Interview with Quentin G,AsterData

Here is an interesting interview with Quentin G, CEO AsterData, Marketing trumpeting aside apart-the insights on the whats next vision thing are quite good.

Sourcehttp://www.arnoldit.com/search-wizards-speak/aster-data.html

As you look down the road, what are the three major challenges you see for vendors who keep trying to solve big data and other “now” problems with old tools?

Old tools and traditional architectures cannot scale effectively to handle massive data volumes that reach 100’s of terabytes nor can they effectively process large data volumes in a high performance manner. Further, they are restricted to what SQL querying allows. The three challenges I have noted are:

First, performance, specifically, poor performance on large data volumes and heavy workloads: The pre-existing systems rely on storing data in a traditional DBMS or data warehouse and then extracting a sample of data to a separate processing tier. This greatly restricts data insights and analytics as only a sample of data is analyzed and understood. As more data is stored in these systems they suffer from performance degradation as more users try to access the system concurrently. Additionally moving masses of data out of the traditional DBMS to a separate processing tier adds latency and slows down analytics and response times. This pre-existing architecture greatly limits performance especially as data sizes grow.

Second, limited analytics: Pre-existing systems rely mostly on SQL for data querying and analysis. SQL poses several limitations and is not suited for ad hoc querying, deep data exploration and a range of other analytics. MapReduce overcomes the limitations of SQL and SQL-MapReduce in particular opens up a new class of analytics that cannot be achieved with SQL alone.

And, third, limitations of types of data that can be stored and analyzed: Traditional systems are not designed for non-relational or unstructured data. New solutions such as Aster Data’s are designed from the ground up to handle both relational and non-relational data. Organizations want to store and process a range of data types and do this in a single platform. New solutions allow for different data types to be handled in a single platform whereas pre-existing architectures and solutions are specialized around a single data type or format – this restricts the diversity of analytics that can be performed on these systems.

Read the whole interview at –http://www.arnoldit.com/search-wizards-speak/aster-data.html

Speaking of which- there is a new webinar by Merv Adrian (interview on Decisionstats) and Colin White-