Tableau which has been making waves recntly with its great new data visualization tool announced a partner with my old friends at AsterData. Its really cool piece of data vis and very very fast on the desktop- so I can imagine what speed it can help with AsterData’s MPP Row and Column Zingbang AND Parallel Analytical Functions
Tableau and AsterData also share the common Stanfordian connection (but it seems software is divided quite equally between Stanford, Hardvard Dropouts and North Carolina )
It remains to be seen in this announcement how much each company can leverage the partnership or whether it turns like the SAS Institute- AsterData partnership last year or whether it is just to announce connectors in their software to talk to each other.
AsterData remains the guys with the potential but I would be wrong to say MapReduce–SQL is as hot in December 2010 as it was in June 2009- and the elephant in the room would be Hadoop. That and Google’s continued shyness from encashing its principal comptency of handling Big Data (but hush – I signed a NDA with the Google Prediction API– so things maaaay change very rapidly on ahem that cloud)
Disclaimer- AsterData was my internship sponsor during my winter training while at Univ of Tenn.
Complex Event Processing (CEP- not to be confused by Circular Probability Error) is defined processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time.
Oracle CEP is a Java application server for the development and deployment of high-performance event driven applications. It can detect patterns in the flow of events and message payloads, often based on filtering, correlation, and aggregation across event sources, and includes industry leading temporal and ordering capabilities. It supports ultra-high throughput (1 million/sec++) and microsecond latency.
Tibco is also trying to get into this market (it claims to have a 40 % market share in the public CEP market 😉 though probably they have not measured the DoE and DoD as worthy of market share yet
What it is: Methods 1 through 3 look at historical data and traditional architectures with information stored in the warehouse. In this environment, it often takes months of data cleansing and preparation to get the data ready to analyze. Now, what if you want to make a decision or determine the effect of an action in real time, as a sale is made, for instance, or at a specific step in the manufacturing process. With streaming data architectures, you can look at data in the present and make immediate decisions. The larger flood of data coming from smart phones, online transactions and smart-grid houses will continue to increase the amount of data that you might want to analyze but not keep. Real-time streaming, complex event processing (CEP) and analytics will all come together here to let you decide on the fly which data is worth keeping and which data to analyze in real time and then discard.
When you use it: Radio-frequency identification (RFID) offers a good user case for this type of architecture. RFID tags provide a lot of information, but unless the state of the item changes, you don’t need to keep warehousing the data about that object every day. You only keep data when it moves through the door and out of the warehouse.
The same concept applies to a customer who does the same thing over and over. You don’t need to keep storing data for analysis on a regular pattern, but if they change that pattern, you might want to start paying attention.
Figure 4: Traditional architecture vs. streaming architecture
In academia here is something called SASE Language
The query below retrieves the total trading volume of Google stocks in the 4 hour period after some bad news occurred.
PATTERN SEQ(News a, Stock+ b[ ])WHERE [symbol] AND a.type = 'bad' AND b[i].symbol = 'GOOG' WITHIN 4 hoursHAVING b[b.LEN].volume < 80%*b[1].volumeRETURN sum(b[ ].volume)
The next query reports a one-hour period in which the price of a stock increased from 10 to 20 and its trading volume stayed relatively stable.
PATTERN SEQ(Stock+ a[])WHERE [symbol] AND a[1].price = 10 AND a[i].price > a[i-1].price AND a[a.LEN].price = 20 WITHIN 1 hourHAVING avg(a[].volume) ≥ a[1].volumeRETURN a[1].symbol, a[].price
The third query detects a more complex trend: in an hour, the volume of a stock started high, but after a period of price increasing or staying relatively stable, the volume plummeted.
PATTERN SEQ(Stock+ a[], Stock b)WHERE [symbol] AND a[1].volume > 1000 AND a[i].price > avg(a[…i-1].price)) AND b.volume < 80% * a[a.LEN].volume WITHIN 1 hourRETURN a[1].symbol, a[].(price,volume), b.(price,volume)
(note from Ajay-
I was not really happy about the depth of resources on CEP available online- there seem to be missing bits and pieces in both open source, academic and corporate information- one reason for this is the obvious military dual use of this technology- like feeds from Satellite, Audio Scans, etc)
Cisco SocialMiner is a social media customer care solution that can help you proactively respond to customers and prospects communicating through public social media networks like Twitter, Facebook, or other public forums or blogging sites. By providing social media monitoring, queuing, and workflow to organize customer posts on social media networks and deliver them to your social media customer care team, your company can respond to customers in real time using the same social network they are using.
Cisco SocialMiner provides:
The ability to configure multiple campaigns to search for customer postings on the public social web about your company’s products, services, or area of expertise
Filtering of social contacts based on preconfigured campaign filters to focus campaign searches
Routing of social contacts to skilled customer care representatives in the contact center or to experts in the enterprise–multiple people can work together to handle responses to customer postings through shared work queues
Detailed metrics for social media customer care activities, campaign reports, and team reports
With Cisco SocialMiner, your company can listen and respond to customer conversations originating in the social web. Being proactive can help your company enhance its service, improve customer loyalty, garner new customers, and protect your brand.
Table 1. Features and Benefits of Cisco SocialMiner 8.5
Feature
Benefits
Product Baseline Features
Social media feeds
• Feeds are configurable sources to capture public social contacts that contain specific words, terms, or phrases.
• Feeds enable you to search for information on the public social web about your company’s products, services, or area of expertise.
• Cisco SocialMiner supports the following types of feeds:
• Groups feeds into campaigns to organize all posting activity related to a product category or business objective
• Produces metrics on campaign activity
• Provides the ability to configure multiple campaigns to search for customer postings on specific products or services
• Groups social contacts for handling by the social media customer care team
• Enables filtering of social contacts based on preconfigured campaign filters to focus campaign searches
Route and queue social contacts
• Enables routing of social contacts to skilled customer care representatives in the contact center
• Draws on expertise in the enterprise by allowing multiple people in the enterprise to work together to handle responses to customer postings through shared work queues
• Enables automated distribution of work to improve efficiency and effectiveness of social media engagement
Tagging
• Allows work to be routed to the appropriate team by grouping each post or social contact into different categories; for example, a post can be marked with the “customer_support” tag; this post will then appear on a customer support agent’s queue for processing
Social media customer care metrics
• Provides detailed metrics on social media customer care activities, campaign reports, and team reports
• Measures work and results
• Manages to service-level goals
• Supports brand management
• Optimizes staffing
• Includes dashboarding of social media posting activity when Cisco Unified Intelligence Center is used
Reporting for social contacts
• Provides a reporting database that can be accessed using any reporting tool, including Cisco Unified Intelligence Center
• Enables customer care management to accurately report on and track social media interactions by the contact center
OpenSocial-compliant gadgets
Representational State Transfer (REST) application programming interfaces (APIs)
• Provides flexible user interface options
• Enables extensive opportunities for customization
Optional integration with full suite of Cisco Collaboration tools
• Allows you to take advantage of the full suite of Cisco Collaboration tools, including Cisco Quad, Cisco Show and Share, and Cisco Pulse technology, to help your social media customer care team quickly find answers to help customers efficiently and effectively
• Requires a Cisco UCS C-Series or B-Series Server.
• Server consolidation means lower cost per server with Cisco UCS Servers.
Architecture
Scalability
• One server supports up to 30 simultaneous social media customer care users and 10,000 social contacts per hour.
Management
Cisco Unified Real-Time Monitoring Tool (RTMT)
• Operational management is enhanced through integration with the Cisco Unified RTMT, providing consistent application monitoring across Cisco Unified Communications Solutions.
Simple Network Management Protocol (SNMP)
• SNMP with an associated MIB is supported through the Cisco Voice Operating System (VOS).
Reporting
Cisco Unified Intelligence Center
• Create customizable reports of social media customer care events using Cisco Unified Intelligence Center (purchased separately).
Jim Goodnight – grand old man and Godfather of the Cosa Nostra of the BI/Database Analytics software industry said recently on open source in BI (btw R is generally termed in business analytics and NOT business intelligence software so these remarks were more apt to Pentaho and Jaspersoft )
Asked whether open source BI and data integration software from the likes of Jaspersoft, Pentaho and Talend is a growing threat, [Goodnight] said: “We haven’t noticed that a lot. Most of our companies need industrial strength software that has been tested, put through every possible scenario or failure to make sure everything works correctly.”
The first, labeled BI Platforms, is drawn fromGartner Market Share Analysis: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009, published May 2010 , and Gartner Dataquest Market Share: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009.
and
Advanced Analytics category.
and
so whats the performance of Talend, Pentaho and Jaspersoft
Achieved record revenue, more then doubling from 2008. The fourth quarter of 2009 was Talend’s tenth consecutive quarter of growth.
Grew customer base by 140% to over 1,000 customers, up from 420 at the end of 2008. Of these new customers, over 50% are Fortune 1000 companies.
Total downloads reached seven million, with over 300,000 users of the open source products.
Talend doubled its staff, increasing to 200 global employees. Continuing this trend, Talend has already hired 15 people in 2010 to support its rapid growth.
40% sequential growth most recent quarter. (I didn’t ask whether there was any reason to suspect seasonality.)
130% annual revenue growth run rate.
“Not quite” profitable.
Several hundred commercial subscribers, at an average of $25K annually per, including >100 in Europe.
9,000 paying customers of some kind.
100,000+ total deployments, “very conservatively,” counting OEMs as one deployment each and not double-counting for OEMs’ customers. (Nick said Business Objects quotes 45,000 deployments by the same standards.)
70% of revenue from the mid-market, defined as $100 million – $1 billion revenue. 30% from bigger enterprises. (Hmm. That begs a couple of questions, such as where OEM revenue comes in, and whether <$100 million enterprises were truly a negligible part of revenue.)
1) There is a complete lack of transparency in open source BI market shares as almost all these companies are privately held and do not disclose revenues.
2) What may be a pure play open source company may actually be a company funded by a big BI vendor (like Revolution Analytics is funded among others by Intel-Microsoft) and EnterpriseDB has IBM as an investor.MySQL and Sun of course are bought by Oracle
The degree of control by proprietary vendors on open source vendors is still not disclosed- whether they are holding a stake for strategic reasons or otherwise.
3) None of the Open Source Vendors are even close to a 1 Billion dollar revenue number.
Jim Goodnight is pointing out market reality when he says he has not seen much impact (in terms of market share). As for the rest of his remarks, well he’s got a job to do as CEO and thats talk up his company and trash the competition- which he as been doing for 3 decades and unlikely to change now unless there is severe market share impact. Unless you expect him to notice companies less than 5% of his size in revenue.
If you use Windows for your stats computing and your data is in a database (probably true for almost all corporate business analysts) R 2.12 has provided a unique procedural hitch for you NO BINARIES for packages used till now to read from these databases.
The Readme notes of the release say-
Packages related to many database system must be linked to the exact
version of the database system the user has installed, hence it does
not make sense to provide binaries for packages
RMySQL, ROracle, ROracleUI, RPostgreSQL
although it is possible to install such packages from sources by
install.packages('packagename', type='source')
after reading the manual 'R Installation and Administration'.
So how to connect to Databases if the Windows Binary is not available-
So how to connect to PostgreSQL and MySQL databases.
Fortunately the RpgSQL package is still available for PostgreSQL
Using the RpgSQL package
library(RpgSQL)
#creating a connection
con <- dbConnect(pgSQL(), user = "postgres", password = "XXXX",dbname="postgres")
#writing a table from a R Dataset
dbWriteTable(con, "BOD", BOD)
# table names are lower cased unless double quoted. Here we write a Select SQL query
dbGetQuery(con, 'select * from "BOD"')
#disconnecting the connection
dbDisconnect(con)
You can also use RODBC package for connecting to your PostgreSQL database but you need to configure your ODBC connections in
Windows Start Panel-
Settings-Control Panel-
Administrative Tools-Data Sources (ODBC)
You should probably see something like this screenshot.
Coming back to R and noting the name of my PostgreSQL DSN from above screenshot-( If not there just click on add-scroll to appropriate database -here PostgreSQL and click on Finish- add in the default values for your database or your own created database values-see screenshot for help with other configuring- and remember to click Test below to check if username and password are working, port is correct etc.
so once the DSN is probably setup in the ODBC (frightening terminology is part of databases)- you can go to R to connect using RODBC package
#loading RODBC
library(RODBC)
#creating a Database connection
# for username,password,database name and DSN name
chan=odbcConnect("PostgreSQL35W","postgres;Password=X;Database=postgres")
#to list all table names
sqlTables(chan)
TABLE_QUALIFIER TABLE_OWNER TABLE_NAME TABLE_TYPE REMARKS
1 postgres public bod TABLE
2 postgres public database1 TABLE
3 postgres public tt TABLE
and then we run the same configuring DSN as we did for postgreSQL.
After that we use RODBC in pretty much the same way except changing for the default username and password for MySQL and changing the DSN name for the previous step.
channel <- odbcConnect("mysql","jasperdb;Password=XXX;Database=Test")
test2=sqlQuery(channel,"select * from jiuser")
test2
id username tenantId fullname emailAddress password externallyDefined enabled previousPasswordChangeTime1 1 jasperadmin 1 Jasper Administrator NA 349AFAADD5C5A2BD477309618DC NA 01
2 2 joe1ser 1 Joe User NA 4DD8128D07A NA 01
odbcClose(channel)
While using RODBC for all databases is a welcome step, perhaps the change release notes for Window Users of R may need to be more substantiative than one given for R 2.12.2
Here is an interesting interview with Quentin G, CEO AsterData, Marketing trumpeting aside apart-the insights on the whats next vision thing are quite good.
As you look down the road, what are the three major challenges you see for vendors who keep trying to solve big data and other “now” problems with old tools?
Old tools and traditional architectures cannot scale effectively to handle massive data volumes that reach 100’s of terabytes nor can they effectively process large data volumes in a high performance manner. Further, they are restricted to what SQL querying allows. The three challenges I have noted are:
First, performance, specifically, poor performance on large data volumes and heavy workloads: The pre-existing systems rely on storing data in a traditional DBMS or data warehouse and then extracting a sample of data to a separate processing tier. This greatly restricts data insights and analytics as only a sample of data is analyzed and understood. As more data is stored in these systems they suffer from performance degradation as more users try to access the system concurrently. Additionally moving masses of data out of the traditional DBMS to a separate processing tier adds latency and slows down analytics and response times. This pre-existing architecture greatly limits performance especially as data sizes grow.
Second, limited analytics: Pre-existing systems rely mostly on SQL for data querying and analysis. SQL poses several limitations and is not suited for ad hoc querying, deep data exploration and a range of other analytics. MapReduce overcomes the limitations of SQL and SQL-MapReduce in particular opens up a new class of analytics that cannot be achieved with SQL alone.
And, third, limitations of types of data that can be stored and analyzed: Traditional systems are not designed for non-relational or unstructured data. New solutions such as Aster Data’s are designed from the ground up to handle both relational and non-relational data. Organizations want to store and process a range of data types and do this in a single platform. New solutions allow for different data types to be handled in a single platform whereas pre-existing architectures and solutions are specialized around a single data type or format – this restricts the diversity of analytics that can be performed on these systems.