oracle – Page 4 – DECISION STATS

LibreOffice Stable Release launched

Non Oracle Open Office completes important milestone- from the press release

The Document Foundation launches LibreOffice 3.3

The first stable release of the free office suite is available for download

The Internet, January 25, 2011 – The Document Foundation launches LibreOffice 3.3, the first stable release of the free office suite developed by the community. In less than four months, the number of developers hacking LibreOffice has grown from less than twenty in late September 2010, to well over one hundred today. This has allowed us to release ahead of the aggressive schedule set by the project.

Not only does it ship a number of new and original features, LibreOffice 3.3 is also a significant achievement for a number of reasons:

– the developer community has been able to build their own and independent process, and get up and running in a very short time (with respect to the size of the code base and the project’s strong ambitions);

– thanks to the high number of new contributors having been attracted into the project, the source code is quickly undergoing a major clean-up to provide a better foundation for future development of LibreOffice;

– the Windows installer, which is going to impact the largest and most diverse user base, has been integrated into a single build containing all language versions, thus reducing the size for download sites from 75 to 11GB, making it easier for us to deploy new versions more rapidly and lowering the carbon footprint of the entire infrastructure.

Caolán McNamara from RedHat, one of the developer community leaders, comments, “We are excited: this is our very first stable release, and therefore we are eager to get user feedback, which will be integrated as soon as possible into the code, with the first enhancements being released in February. Starting from March, we will be moving to a real time-based, predictable, transparent and public release schedule, in accordance with Engineering Steering Committee’s goals and users’ requests”. The LibreOffice development roadmap is available at http://wiki.documentfoundation.org/ReleasePlan

LibreOffice 3.3 brings several unique new features. The 10 most-popular among community members are, in no particular order:

the ability to import and work with SVG files;
an easy way to format title pages and their numbering in Writer;
a more-helpful Navigator Tool for Writer;
improved ergonomics in Calc for sheet and cell management;
and Microsoft Works and Lotus Word Pro document import filters.

In addition, many great extensions are now bundled, providing

PDF import,

a slide-show presenter console,

a much improved report builder, and more besides.

A more-complete and detailed list of all the new features offered by LibreOffice 3.3 is viewable on the following web page: http://www.libreoffice.org/download/new-features-and-fixes/

LibreOffice 3.3 also provides all the new features of OpenOffice.org 3.3, such as new custom properties handling; embedding of standard PDF fonts in PDF documents; new Liberation Narrow font; increased document protection in Writer and Calc; auto decimal digits for “General” format in Calc; 1 million rows in a spreadsheet; new options for CSV import in Calc; insert drawing objects in Charts; hierarchical axis labels for Charts; improved slide layout handling in Impress; a new easier-to-use print interface; more options for changing case; and colored sheet tabs in Calc. Several of these new features were contributed by members of the LibreOffice team prior to the formation of The Document Foundation.

LibreOffice hackers will be meeting at FOSDEM in Brussels on February 5 and 6, and will be presenting their work during a one-day workshop on February 6, with speeches and hacking sessions coordinated by several members of the project.

The home of The Document Foundation is at http://www.documentfoundation.org

The home of LibreOffice is at http://www.libreoffice.org where the download page has been redesigned by the community to be more user-friendly.

*** About The Document Foundation

The Document Foundation has the mission of facilitating the evolution of the OOo Community into a new, open, independent, and meritocratic organization within the next few months. An independent Foundation is a better reflection of the values of our contributors, users and supporters, and will enable a more effective, efficient and transparent community. TDF will protect past investments by building on the achievements of the first decade, will encourage wide participation within the community, and will co-ordinate activity across the community.

*** Media Contacts for TDF

Florian Effenberger (Germany)

Mobile: +49 151 14424108 – E-mail: floeff@documentfoundation.org

Olivier Hallot (Brazil)

Mobile: +55 21 88228812 – E-mail: olivier.hallot@documentfoundation.org

Charles H. Schulz (France)

Mobile: +33 6 98655424 – E-mail: charles.schulz@documentfoundation.org

Italo Vignoli (Italy)

Mobile: +39 348 5653829 – E-mail: italo.vignoli@documentfoundation.org

LibreOffice now default Office Suite in Ubuntu 11.04 (omgubuntu.co.uk)
Ubuntu 11.04 switches to LibreOffice in latest daily builds (downloadsquad.switched.com)
Ubuntu Ditches OpenOffice For LibreOffice (informationweek.com)
Ubuntu opts for LibreOffice over Oracle’s OpenOffice (zdnet.com)
LibreOffice Is Taking Shape With Third Beta (pcworld.com)
Ubuntu 11 Switches To Libre Office (lockergnome.com)

Handling time and date in R

One of the most frustrating things I had to do while working as financial business analysts was working with Data Time Formats in Base SAS. The syntax was simple enough and SAS was quite good with handing queries to the Oracle data base that the client was using, but remembering the different types of formats in SAS language was a challenge (there was a date9. and date6 and mmddyy etc )

Data and Time variables are particularly important variables in financial industry as almost everything is derived variable from the time (which varies) while other inputs are mostly constants. This includes interest as well as late fees and finance fees.

In R, date and time are handled quite simply-

Use the strptime( dataset, format) function to convert the character into string

For example if the variable dob is “01/04/1977) then following will convert into a date object

z=strptime(dob,”%d/%m/%Y”)

and if the same date is 01Apr1977

z=strptime(dob,"%d%b%Y")

does the same

For troubleshooting help with date and time, remember to enclose the formats

%d,%b,%m and % Y in the same exact order as the original string- and if there are any delimiters like ” -” or “/” then these delimiters are entered in exactly the same order in the format statement of the strptime

Sys.time() gives you the current date-time while the function difftime(time1,time2) gives you the time intervals( say if you have two columns as date-time variables)

What are the various formats for inputs in date time?

%a: Abbreviated weekday name in the current locale. (Also matches full name on input.)
%A: Full weekday name in the current locale. (Also matches abbreviated name on input.)
%b: Abbreviated month name in the current locale. (Also matches full name on input.)
%B: Full month name in the current locale. (Also matches abbreviated name on input.)
%c: Date and time. Locale-specific on output, "%a %b %e %H:%M:%S %Y" on input.
%d: Day of the month as decimal number (01–31).
%H: Hours as decimal number (00–23).
%I: Hours as decimal number (01–12).
%j: Day of year as decimal number (001–366).
%m: Month as decimal number (01–12).
%M: Minute as decimal number (00–59).
%p: AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales.
%S: Second as decimal number (00–61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).

%U: Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%w: Weekday as decimal number (0–6, Sunday is 0).
%W: Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%x: Date. Locale-specific on output, "%y/%m/%d" on input.
%X: Time. Locale-specific on output, "%H:%M:%S" on input.
%y: Year without century (00–99). Values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 POSIX standard, but it does also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
%Y: Year with century.
%z: Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC.
%Z: (output only.) Time zone as a character string (empty if not available).; Also to read the helpful documentation (especially for time zone level, and leap year seconds and differences); http://stat.ethz.ch/R-manual/R-patched/library/base/html/difftime.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/strptime.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/Ops.Date.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/Dates.html

Mark-up Your Events Online with Microformats (seomoz.org)
How do you convert octal numbers to decimal numbers (wiki.answers.com)
The Rollover of Doom: a Trap for Good Programmers (esr.ibiblio.org)
Formatting Dates, Times and Numbers in ASP.NET (4guysfromrolla.com)
JavaScript Date Format (stevenlevithan.com)
Rcpp 0.9.0 and RcppClassic 0.9.0 (dirk.eddelbuettel.com)
Comparing times and dates in Ruby (nofluffjuststuff.com)
C#: Programatically Convert between ASCII, Decimal, and Hexidecimal (lockergnome.com)
Scale and Scalability: Rethinking the Most Overused IT System Selling Point for the Cloud Era (itexpertvoice.com)
Coding Horror: A Visual Explanation of SQL Joins (codinghorror.com)

The Year 2010

My annual traffic to this blog was almost 99,000 . Add in additional views on networking sites plus the 400 plus RSS readers- so I can say traffic was 1,20,000 for 2010. Nice. Thanks for reading and hope it was worth your time. (this is a long post and will take almost 440 secs to read but the summary is just given)

My intent is either to inform you, give something useful or atleast something interesting.

see below-

	Jan	Feb	Mar	Apr	May	Jun

2010	6,311	4,701	4,922	5,463	6,493	4,271

Jul	Aug	Sep	Oct	Nov	Dec	Total

5,041

5,403

17,913

16,430

11,723

10,096

98,767

Sandro Saita from http://www.dataminingblog.com/ just named me for an award on his blog (but my surname is ohRi , Sandro left me without an R- What would I be without R :)) ).

Aw! I am touched. Google for “Data Mining Blog” and Sandro is the best that it is in data mining writing.

”

DMR People Award 2010
There are a lot of active people in the field of data mining. You can discuss with them on forums. You can read their blogs. You can also meet them in events such as PAW or KDD. Among the people I follow on a regular basis, I have elected:

Ajay Ori

He has been very active in 2010, especially on his blog . Good work Ajay and continue sharing your experience with us!”

What did I write in 2010- stuff.

What did you read on this blog- well thats the top posts list.

2009-12-31 to Today

Title		Views
Home page		21,150
Top 10 Graphical User Interfaces in Statistical Software		6,237
Wealth = function (numeracy, memory recall)		2,014
Matlab-Mathematica-R and GPU Computing		1,946
The Top Statistical Softwares (GUI)		1,405
About DecisionStats		1,352
Using Facebook Analytics (Updated)		1,313
Test drive a Chrome notebook.		1,170
Top ten RRReasons R is bad for you ?		1,157
Libre Office		1,151
Interview Hadley Wickham R Project Data Visualization Guru		1,007
Using Red R- R with a Visual Interface		854
SAS Institute files first lawsuit against WPS- Episode 1		790
Interview Professor John Fox Creator R Commander		764
R Package Creating		754
Windows Azure vs Amazon EC2 (and Google Storage)		726
Norman Nie: R GUI and More		716
Startups for Geeks		682
Google Maps – Jet Ski across Pacific Ocean		670
Not so AWkward after all: R GUI RKWard		579
Red R 1.8- Pretty GUI		570
Parallel Programming using R in Windows		569
R is an epic fail or is it just overhyped		559
Enterprise Linux rises rapidly:New Report		537
Rapid Miner- R Extension		518
Creating a Blog Aggregator for free		504
So which software is the best analytical software? Sigh- It depends		473
Revolution R for Linux		465
John Sall sets JMP 9 free to tango with R		460

So how do people come here –

well I guess I owe Tal G for almost 9000 views ( incidentally I withdrew posting my blog from R- Bloggers and Analyticbridge blogs – due to SEO keyword reasons and some spam I was getting see (below))

http://r-bloggers.com is still the CAT’s whiskers and I read it a lot.

I still dont know who linked my blog to a free sex movie site with 400 views but I have a few suspects.

2009-12-31 to Today

Referrer	Views
r-bloggers.com	9,131
Reddit	3,829
rattle.togaware.com	1,500
Twitter	1,254
Google Reader	1,215
linkedin.com	717
freesexmovie.irwanaf.com	422
analyticbridge.com	341
Google	327
coolavenues.com	322
Facebook	317
kdnuggets.com	298
dataminingblog.com	278
en.wordpress.com	185
google.co.in	151
xianblog.wordpress.com	130
inside-r.org	124
decisionstats.com	119
ifreestores.com	117
bits.blogs.nytimes.com	108

–

Still reading this post- gosh let me sell you some advertising. It is only $100 a month (yes its a recession)

Advertisers are treated on First in -Last out (FILO)

I have been told I am obsessed with SEO , but I dont care much for search engines apart from Google, and yes SEO is an interesting science (they should really re name it GEO or Google Engine Optimization)

Apparently Hadley Wickham and Donald Farmer are big keywords for me so I should be more respectful I guess.

Search Terms for 365 days ending 2010-12-31 (Summarized)

2009-12-31 to Today

Search	Views
libre office	925
facebook analytics	798
test drive a chrome notebook	467
test drive a chrome notebook.	215
r gui	203
data mining	163
wps sas lawsuit	158
wordle.net	133
wps sas	123
google maps jet ski	123
test drive chrome notebook	96
sas wps	89
sas wps lawsuit	85
chrome notebook test drive	83
decision stats	83
best statistics software	74
hadley wickham	72
google maps jetski	72
libreoffice	70
doug savage	65
hive tutorial	58
funny india	56
spss certification	52
donald farmer microsoft	51
best statistical software	49

What about outgoing links? Apparently I need to find a way to ask Google to pay me for the free advertising I gave their chrome notebook launch. But since their search engine and browser is free to me, guess we are even steven.

Clicks for 365 days ending 2010-12-31 (Summarized)

2009-12-31 to Today

URL	Clicks
rattle.togaware.com	378
facebook.com/Decisionstats	355
rapid-i.com/content/view/182/196	319
services.google.com/fb/forms/cr48basic	313
red-r.org	228
decisionstats.wordpress.com/2010/05/07/the-top-statistical-softwares-gui	199
teamwpc.co.uk/products/wps	162
r4stats.com/popularity	148
r-statistics.com/2010/04/r-and-the-google-summer-of-code-2010-accepted-students-and-projects	138
socserv.mcmaster.ca/jfox/Misc/Rcmdr	138
spss.com/certification	116
learnr.wordpress.com	114
dudeofdata.com/decisionstats	108
r-project.org	107
documentfoundation.org/faq	104
goo.gl/maps/UISY	100
inside-r.org/download	96
en.wikibooks.org/wiki/R_Programming	92
nytimes.com/external/readwriteweb/2010/12/07/07readwriteweb-report-google-offering-chrome-notebook-test-11919.html	92
sourceforge.net/apps/mediawiki/rkward/index.php?title=Main_Page	92
analyticdroid.togaware.com	88
yeroon.net/ggplot2	87

so in 2010,

SAS remained top daddy in business analytics,

R made revolutionary strides in terms of new packages,

JMP launched a new version,

SPSS got integrated with Cognos,

Oracle sued Google and did build a great Data Mining GUI,

Libre Office gave you a non Oracle Open office ( or open even more office)

2011 looks like a fun year. Have safe partying .

IBM SPSS 19 Now Available to the Global Academic Community via e-academy’s OnTheHub eStore (prweb.com)
ACM Data Mining Camp 3 (revolutionanalytics.com)
Accessing R from Python using RPy2 (r-bloggers.com)
Mining of Massive Data Sets (kinlane.com)
5 FeedBurner Alternatives You Should Know About (techie-buzz.com)
Uncertainty, Risk, Statistics and Data Mining (zyxo.wordpress.com)
‘Data Mining’ Gains Traction in Education (edreformer.com)
If you cut your RSS short I will ignore your post (chrisabraham.com)
Solar trends for 2011 (cleanbreak.ca)

Choosing R for business – What to consider?

Image via Wikipedia

Additional features in R over other analytical packages-

1) Source Code is given to ensure complete custom solution and embedding for a particular application. Open source code has an advantage that is extensively peer- reviewed in Journals and Scientific Literature. This means bugs will found, shared and corrected transparently.

2) Wide literature of training material in the form of books is available for the R analytical platform.

3) Extensively the best data visualization tools in analytical software (apart from Tableau Software ‘s latest version). The extensive data visualization available in R is of the form a variety of customizable graphs, as well as animation. The principal reason third-party software initially started creating interfaces to R is because the graphical library of packages in R is more advanced as well as rapidly getting more features by the day.

4) Free in upfront license cost for academics and thus budget friendly for small and large analytical teams.

5) Flexible programming for your data environment. This includes having packages that ensure compatibility with Java, Python and C++.

6) Easy migration from other analytical platforms to R Platform. It is relatively easy for a non R platform user to migrate to R platform and there is no danger of vendor lock-in due to the GPL nature of source code and open community.

Statistics are numbers that tell (descriptive), advise ( prescriptive) or forecast (predictive). Analytics is a decision-making help tool. Analytics on which no decision is to be made or is being considered can be classified as purely statistical and non analytical. Thus ease of making a correct decision separates a good analytical platform from a not so good analytical platform. The distinction is likely to be disputed by people of either background- and business analysis requires more emphasis on how practical or actionable the results are and less emphasis on the statistical metrics in a particular data analysis task. I believe one clear reason between business analytics is different from statistical analysis is the cost of perfect information (data costs in real world) and the opportunity cost of delayed and distorted decision-making.

Specific to the following domains R has the following costs and benefits

Business Analytics
- R is free per license and for download
- It is one of the few analytical platforms that work on Mac OS
- It’s results are credibly established in both journals like Journal of Statistical Software and in the work at LinkedIn, Google and Facebook’s analytical teams.
- It has open source code for customization as per GPL
- It also has a flexible option for commercial vendors like Revolution Analytics (who support 64 bit windows) as well as bigger datasets
- It has interfaces from almost all other analytical software including SAS,SPSS, JMP, Oracle Data Mining, Rapid Miner. Existing license holders can thus invoke and use R from within these software
- Huge library of packages for regression, time series, finance and modeling
- High quality data visualization packages
- Data Mining
  - R as a computing platform is better suited to the needs of data mining as it has a vast array of packages covering standard regression, decision trees, association rules, cluster analysis, machine learning, neural networks as well as exotic specialized algorithms like those based on chaos models.
  - Flexibility in tweaking a standard algorithm by seeing the source code
  - The RATTLE GUI remains the standard GUI for Data Miners using R. It was created and developed in Australia.
  - Business Dashboards and Reporting
  - Business Dashboards and Reporting are an essential piece of Business Intelligence and Decision making systems in organizations. R offers data visualization through GGPLOT, and GUI like Deducer and Red-R can help even non R users create a metrics dashboard
    - For online Dashboards- R has packages like RWeb, RServe and R Apache- which in combination with data visualization packages offer powerful dashboard capabilities.
    - R can be combined with MS Excel using the R Excel package – to enable R capabilities to be imported within Excel. Thus a MS Excel user with no knowledge of R can use the GUI within the R Excel plug-in to use powerful graphical and statistical capabilities.

Additional factors to consider in your R installation-

There are some more choices awaiting you now-
1) Licensing Choices-Academic Version or Free Version or Enterprise Version of R

2) Operating System Choices-Which Operating System to choose from? Unix, Windows or Mac OS.

3) Operating system sub choice- 32- bit or 64 bit.

4) Hardware choices-Cost -benefit trade-offs for additional hardware for R. Choices between local ,cluster and cloud computing.

5) Interface choices-Command Line versus GUI? Which GUI to choose as the default start-up option?

6) Software component choice- Which packages to install? There are almost 3000 packages, some of them are complimentary, some are dependent on each other, and almost all are free.

7) Additional Software choices- Which additional software do you need to achieve maximum accuracy, robustness and speed of computing- and how to use existing legacy software and hardware for best complementary results with R.

1) Licensing Choices-
You can choose between two kinds of R installations – one is free and open source from http://r-project.org The other R installation is commercial and is offered by many vendors including Revolution Analytics. However there are other commercial vendors too.

Commercial Vendors of R Language Products-
1) Revolution Analytics http://www.revolutionanalytics.com/
2) XL Solutions- http://www.experience-rplus.com/
3) Information Builder – Webfocus RStat -Rattle GUI http://www.informationbuilders.com/products/webfocus/PredictiveModeling.html
4) Blue Reference- Inference for R http://inferenceforr.com/default.aspx

Choosing Operating System
1. 1. Windows

Windows remains the most widely used operating system on this planet. If you are experienced in Windows based computing and are active on analytical projects- it would not make sense for you to move to other operating systems. This is also based on the fact that compatibility problems are minimum for Microsoft Windows and the help is extensively documented. However there may be some R packages that would not function well under Windows- if that happens a multiple operating system is your next option.

1. 1. 1. Enterprise R from Revolution Analytics- Enterprise R from Revolution Analytics has a complete R Development environment for Windows including the use of code snippets to make programming faster. Revolution is also expected to make a GUI available by 2011. Revolution Analytics claims several enhancements for it’s version of R including the use of optimized libraries for faster performance.
  2. MacOS

Reasons for choosing MacOS remains its considerable appeal in aesthetically designed software- but MacOS is not a standard Operating system for enterprise systems as well as statistical computing. However open source R claims to be quite optimized and it can be used for existing Mac users. However there seem to be no commercially available versions of R available as of now for this operating system.

1. 1. Linux

1. 1. 1. Ubuntu
    2. Red Hat Enterprise Linux
    3. Other versions of Linux

Linux is considered a preferred operating system by R users due to it having the same open source credentials-much better fit for all R packages and it’s customizability for big data analytics.

Ubuntu Linux is recommended for people making the transition to Linux for the first time. Ubuntu Linux had an marketing agreement with revolution Analytics for an earlier version of Ubuntu- and many R packages can installed in a straightforward way as Ubuntu/Debian packages are available. Red Hat Enterprise Linux is officially supported by Revolution Analytics for it’s enterprise module. Other versions of Linux popular are Open SUSE.

1. 1. Multiple operating systems-
    1. Virtualization vs Dual Boot-

You can also choose between having a VMware VM Player for a virtual partition on your computers that is dedicated to R based computing or having operating system choice at the startup or booting of your computer. A software program called wubi helps with the dual installation of Linux and Windows.

64 bit vs 32 bit – Given a choice between 32 bit versus 64 bit versions of the same operating system like Linux Ubuntu, the 64 bit version would speed up processing by an approximate factor of 2. However you need to check whether your current hardware can support 64 bit operating systems and if so- you may want to ask your Information Technology manager to upgrade atleast some operating systems in your analytics work environment to 64 bit operating systems.

Hardware choices- At the time of writing this book, the dominant computing paradigm is workstation computing followed by server-client computing. However with the introduction of cloud computing, netbooks, tablet PCs, hardware choices are much more flexible in 2011 than just a couple of years back.

Hardware costs are a significant cost to an analytics environment and are also remarkably depreciated over a short period of time. You may thus examine your legacy hardware, and your future analytical computing needs- and accordingly decide between the various hardware options available for R.
Unlike other analytical software which can charge by number of processors, or server pricing being higher than workstation pricing and grid computing pricing extremely high if available- R is well suited for all kinds of hardware environment with flexible costs. Given the fact that R is memory intensive (it limits the size of data analyzed to the RAM size of the machine unless special formats and /or chunking is used)- it depends on size of datasets used and number of concurrent users analyzing the dataset. Thus the defining issue is not R but size of the data being analyzed.

1. Local Computing- This is meant to denote when the software is installed locally. For big data the data to be analyzed would be stored in the form of databases.
  1. Server version- Revolution Analytics has differential pricing for server -client versions but for the open source version it is free and the same for Server or Workstation versions.
  2. Workstation
2. Cloud Computing- Cloud computing is defined as the delivery of data, processing, systems via remote computers. It is similar to server-client computing but the remote server (also called cloud) has flexible computing in terms of number of processors, memory, and data storage. Cloud computing in the form of public cloud enables people to do analytical tasks on massive datasets without investing in permanent hardware or software as most public clouds are priced on pay per usage. The biggest cloud computing provider is Amazon and many other vendors provide services on top of it. Google is also coming for data storage in the form of clouds (Google Storage), as well as using machine learning in the form of API (Google Prediction API)
  1. Amazon
  2. Google
  3. Cluster-Grid Computing/Parallel processing- In order to build a cluster, you would need the RMpi and the SNOW packages, among other packages that help with parallel processing.
3. How much resources
  1. RAM-Hard Disk-Processors- for workstation computing
  2. Instances or API calls for cloud computing
Interface Choices
1. Command Line
2. GUI
3. Web Interfaces
Software Component Choices
1. R dependencies
2. Packages to install
3. Recommended Packages
Additional software choices
1. Additional legacy software
2. Optimizing your R based computing
3. Code Editors
  1. Code Analyzers
  2. Libraries to speed up R

citation- R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

(Note- this is a draft in progress)

Are You a Computer Science Student? Want To Be a Data Scientist? Then Check This Out (readwriteweb.com)
Why Use R? (r-bloggers.com)

Complex Event Processing- SASE Language

Image via Wikipedia

Complex Event Processing (CEP- not to be confused by Circular Probability Error) is defined processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time.

Software supporting CEP are-

Oracle http://www.oracle.com/us/technologies/soa/service-oriented-architecture-066455.html

Oracle CEP is a Java application server for the development and deployment of high-performance event driven applications. It can detect patterns in the flow of events and message payloads, often based on filtering, correlation, and aggregation across event sources, and includes industry leading temporal and ordering capabilities. It supports ultra-high throughput (1 million/sec++) and microsecond latency.

Tibco is also trying to get into this market (it claims to have a 40 % market share in the public CEP market 😉 though probably they have not measured the DoE and DoD as worthy of market share yet

– see webcast by TIBCO ‘s head here http://www.tibco.com/products/business-optimization/complex-event-processing/default.jsp

and product info here-http://www.tibco.com/products/business-optimization/complex-event-processing/businessevents/default.jsp

TIBCO is the undisputed leader in complex event processing (CEP) software with over 40 percent market share, according to a recent IDC Study.

A good explanation of how social media itself can be used as an analogy for CEP is given in this SAS Global Paper

http://support.sas.com/resources/papers/proceedings10/040-2010.pdf

You can see a report on Predictive Analytics and Data Mining in q1 2010 also from SAS’s website at –http://www.sas.com/news/analysts/forresterwave-predictive-analytics-dm-104388-0210.pdf

A very good explanation on architecture involved is given by SAS CTO Keith Collins here on SAS’s Knowledge Exchange site,

http://www.sas.com/knowledge-exchange/risk/four-ways-divide-conquer.html

What it is: Methods 1 through 3 look at historical data and traditional architectures with information stored in the warehouse. In this environment, it often takes months of data cleansing and preparation to get the data ready to analyze. Now, what if you want to make a decision or determine the effect of an action in real time, as a sale is made, for instance, or at a specific step in the manufacturing process. With streaming data architectures, you can look at data in the present and make immediate decisions. The larger flood of data coming from smart phones, online transactions and smart-grid houses will continue to increase the amount of data that you might want to analyze but not keep. Real-time streaming, complex event processing (CEP) and analytics will all come together here to let you decide on the fly which data is worth keeping and which data to analyze in real time and then discard.

When you use it: Radio-frequency identification (RFID) offers a good user case for this type of architecture. RFID tags provide a lot of information, but unless the state of the item changes, you don’t need to keep warehousing the data about that object every day. You only keep data when it moves through the door and out of the warehouse.

The same concept applies to a customer who does the same thing over and over. You don’t need to keep storing data for analysis on a regular pattern, but if they change that pattern, you might want to start paying attention.

Figure 4: Traditional architecture vs. streaming architecture

In academia here is something called SASE Language

A rich declarative event language
Formal semantics of the event language
Theorectical underpinnings of CEP
An efficient automata-based implementation

http://sase.cs.umass.edu/

and

http://avid.cs.umass.edu/sase/index.php?page=navleft_1col

Financial Services

The query below retrieves the total trading volume of Google stocks in the 4 hour period after some bad news occurred.

PATTERN SEQ(News a, Stock+ b[ ])WHERE   [symbol]    AND	a.type = 'bad'    AND	b[i].symbol = 'GOOG' WITHIN  4 hoursHAVING  b[b.LEN].volume < 80%*b[1].volumeRETURN  sum(b[ ].volume)

The next query reports a one-hour period in which the price of a stock increased from 10 to 20 and its trading volume stayed relatively stable.

PATTERN	SEQ(Stock+ a[])WHERE 	 [symbol]   AND	  a[1].price = 10   AND	  a[i].price > a[i-1].price   AND	  a[a.LEN].price = 20            WITHIN  1 hourHAVING	avg(a[].volume) ≥ a[1].volumeRETURN	a[1].symbol, a[].price

The third query detects a more complex trend: in an hour, the volume of a stock started high, but after a period of price increasing or staying relatively stable, the volume plummeted.

PATTERN SEQ(Stock+ a[], Stock b)WHERE 	 [symbol]   AND	  a[1].volume > 1000   AND	  a[i].price > avg(a[…i-1].price))   AND	  b.volume < 80% * a[a.LEN].volume           WITHIN  1 hourRETURN	a[1].symbol, a[].(price,volume), b.(price,volume)

(note from Ajay-

I was not really happy about the depth of resources on CEP available online- there seem to be missing bits and pieces in both open source, academic and corporate information- one reason for this is the obvious military dual use of this technology- like feeds from Satellite, Audio Scans, etc)

Quant Capital Deploys Aleri CEP Technology for Algorithmic Trading (eon.businesswire.com)
Sybase Positions Itself as a Clear Market Leader in CEP and Strengthens Its Real-Time Analytics Platform with Acquisition of Aleri Assets (eon.businesswire.com)
Event Processing in Action (i-programmer.info)

Quantifying Analytics ROI

I had a brief twitter exchange with Jim Davis, Chief Marketing Officer, SAS Institute on Return of Investment on Business Analytics Projects for customers. I have interviewed Jim Davis before last year https://decisionstats.com/2009/06/05/interview-jim-davis-sas-institute/

Now Jim Davis is a big guy, and he is rushing from the launch of SAS Institute’s Social Media Analytics in Japan- to some arguably difficult flying conditions in time to be home in America for Thanksgiving. That and and I have not been much of a good Blog Boy recently, more swayed by love of open source, than love of software per se. I love equally, given I am bad at both equally.

Anyways, Jim’s contention ( http://twitter.com/Davis_Jim ) was customers should go in business analytics only if there is Positive Return on Investment. I am quoting him here-

What is important is that there be a positive ROI on each and every BA project. Otherwise don’t do it.

That’s not the marketing I was taught in my business school- basically it was sell, sell, sell.

However I see most BI sales vendors also go through -let me meet my sales quota for this quarter- and quantifying customer ROI is simple maths than predictive analytics but there seems to be some information assymetry in it.

Here is a paper from North Western University on ROI in IT projects-.

View this document on Scribd

but overall it would be in the interest of customers and Business Analytics Vendors to publish aggregated ROI.

The opponents to this transparency in ROI would be market leaders in market share, who have trapped their customers by high migration costs (due to complexity) or contractually.

A recent study listed Oracle having a large percentage of unhappy customers who would still renew!, SAP had problems when it raised prices for licensing arbitrarily (that CEO is now CEO of HP and dodging legal notices from Oracle).

Indeed Jim Davis’s famous unsettling call for focusing on Business Analytics,as Business Intelligence is dead- that call has been implemented more aggressively by IBM in analytical acquisitions than even SAS itself which has been conservative about inorganic growth. Quantifying ROI, should theoretically aid open source software the most (since they are cheapest in up front licensing) or newer technologies like MapReduce /Hadoop (since they are quite so fast)- but I think that market has a way of factoring in these things- and customers are not as foolish neither as unaware of costs versus benefits of migration.

The contrary to this is Business Analytics and Business Intelligence are imperfect markets with duo-poly or big players thriving in absence of customer regulation.

You get more protection as a customer of $20 bag of potato chips, than as a customer of a $200,000 software. Regulators are wary to step in to ensure ROI fairness (since most bright techies are qither working for private sector, have their own startup or invested in startups)- who in Govt understands Analytics and Intelligence strong enough to ensure vendor lock-ins are not done, and market flexibility is done. It is also a lower choice for embattled regulators to ensure ROI on enterprise software unlike the aggressiveness they have showed in retail or online software.

Who will Analyze the Analysts and who can quantify the value of quants (or penalize them for shoddy quantitative analytics)- is an interesting phenomenon we expect to see more of.

SAS Turns out in Force for Internet Summit 2010 (eon.businesswire.com)
From Web Analytics & SEM To Business Intelligence (searchengineland.com)
Twitter Rolls Out Analytics Tool (hubspot.com)
Social Media ROI: Customer Engagement, Brand Interactivity, And Revenue (lockergnome.com)
Drowning in data? Start with a clear strategy to effectively measure your marketing. (kilgannonsays.wordpress.com)
Twitter Testing Analytics Tool (informationweek.com)

Interview James Dixon Pentaho

Here is an interview with James Dixon the founder of Pentaho, self confessed Chief Geek and CTO. Pentaho has been growing very rapidly and it makes open source Business Intelligence solutions- basically the biggest chunk of enterprise software market currently.

Ajay- How would you describe Pentaho as a BI product for someone who is completely used to traditional BI vendors (read non open source). Do the Oracle lawsuits over Java bother you from a business perspective?

James-

Pentaho has a full suite of BI software:

* ETL: Pentaho Data Integration

* Reporting: Pentaho Reporting for desktop and web-based reporting

* OLAP: Mondrian ROLAP engine, and Analyzer or Jpivot for web-based OLAP client

* Dashboards: CDF and Dashboard Designer

* Predictive Analytics: Weka

* Server: Pentaho BI Server, handles web-access, security, scheduling, sharing, report bursting etc

We have all of the standard BI functionality.

The Oracle/Java issue does not bother me much. There are a lot of software companies dependent on Java. If Oracle abandons Java a lot resources will suddenly focus on OpenJDK. It would be good for OpenJDK and might be the best thing for Java in the long term.

Ajay- What parts of Pentaho’s technology do you personally like the best as having an advantage over other similar proprietary packages.

Describe the latest Pentaho for Hadoop offering and Hadoop/HIVE ‘s advantage over say Map Reduce and SQL.

James- The coolest thing is that everything is pluggable:

* ETL: New data transformation steps can be added. New orchestration controls (job entries) can be added. New perspectives can be added to the design UI. New data sources and destinations can be added.

* Reporting: New content types and report objects can be added. New data sources can be added.

* BI Server: Every factory, engine, and layer can be extended or swapped out via configuration. BI components can be added. New visualizations can be added.

This means it is very easy for Pentaho, partners, customers, and community member to extend our software to do new things.

In addition every engine and component can be fully embedded into a desktop or web-based application. I made a youtube video about our philosophy: http://www.youtube.com/watch?v=uMyR-In5nKE

Our Hadoop offerings allow ETL developers to work in a familiar graphical design environment, instead of having to code MapReduce jobs in Java or Python.

90% of the Hadoop use cases we hear about are transformation/reporting/analysis of structured/semi-structured data, so an ETL tool is perfect for these situations.

Using Pentaho Data Integration reduces implementation and maintenance costs significantly. The fact that our ETL engine is Java and is embeddable means that we can deploy the engine to the Hadoop data nodes and transform the data within the nodes.

Ajay- Do you think the combination of recession, outsourcing,cost cutting, and unemployment are a suitable environment for companies to cut technology costs by going out of their usual vendor lists and try open source for a change /test projects.

Jamie- Absolutely. Pentaho grew (downloads, installations, revenue) throughout the recession. We are on target to do 250% of what we did last year, while the established vendors are flat in terms of new license revenue.

Ajay- How would you compare the user interface of reports using Pentaho versus other reporting software. Please feel free to be as specific.

James- We have all of the everyday, standard reporting features covered.

Over the years the old tools, like Crystal Reports, have become bloated and complicated.

We don’t aim to have 100% of their features, because we’d end us just as complicated.

The 80:20 rule applies here. 80% of the time people only use 20% of their features.

We aim for 80% feature parity, which should cover 95-99% of typical use cases.

Ajay- Could you describe the Pentaho integration with R as well as your relationship with Weka. Jaspersoft already has a partnership with Revolution Analytics for RevoDeployR (R on a web server)-

Any R plans for Pentaho as well?

James- The feature set of R and Weka overlap to a small extent – both of them include basic statistical functions. Weka is focused on predictive models and machine learning, whereas R is focused on a full suite of statistical models. The creator and main Weka developer is a Pentaho employee. We have integrated R into our ETL tool. (makes me happy 🙂 )

(probably not a good time to ask if SAS integration is done as well for a big chunk of legacy base SAS/ WPS users)

About-

As “Chief Geek” (CTO) at Pentaho, James Dixon is responsible for Pentaho’s architecture and technology roadmap. James has over 15 years of professional experience in software architecture, development and systems consulting. Prior to Pentaho, James held key technical roles at AppSource Corporation (acquired by Arbor Software which later merged into Hyperion Solutions) and Keyola (acquired by Lawson Software). Earlier in his career, James was a technology consultant working with large and small firms to deliver the benefits of innovative technology in real-world environments.

Reducing the Cost of Business Intelligence with Open Source (itexpertvoice.com)
Pentaho brings BI, integration to Hadoop (newstatesman.com)
Interview: Pentaho Exec Doug Johnson on His New Role as CFO (ostatic.com)
SAS vs Open Source (revolutionanalytics.com)

Related Articles

Please share:

Related Articles

Please share:

2009-12-31 to Today

2009-12-31 to Today

Search Terms for 365 days ending 2010-12-31 (Summarized)

2009-12-31 to Today

Clicks for 365 days ending 2010-12-31 (Summarized)

2009-12-31 to Today

Related Articles

Please share:

Related Articles

Please share:

Financial Services

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share: