October 2009 – Page 6 – DECISION STATS

Interview Michael Zeller,CEO Zementis on PMML

Here is a topic specific interview with Micheal Zeller of Zementis on PMML, the de facto standard for data mining.

Ajay- What is PMML?

Mike- The Predictive Model Markup Language (PMML) is the leading standard for statistical and data mining models and supported by all leading analytics vendors and organizations. With PMML, it is straightforward to develop a model on one system using one application and deploy the model on another system using another application. PMML reduces complexity and bridges the gap between development and production deployment of predictive analytics.

PMML is governed by the Data Mining Group (DMG), an independent, vendor led consortium that develops data mining standards

Ajay- Why can PMML help any business?

Mike– PMML ensures business agility with respect to data mining, predictive analytics, and enterprise decision management. It provides one standard, one deployment process, across all applications, projects and business divisions. In this way, business stakeholders, analytic scientists, and IT are finally speaking the same language.

In the current global economic crisis more than ever, a company must become more efficient and optimize business processes to remain competitive. Predictive analytics is widely regarded as the next logical step, implementing more intelligent, real-time decisions across the enterprise.

However, the deployment of decisions based on predictive models and statistical algorithms has been a hurdle for many companies. Typically, it has been a complex, costly process to get such models integrated into operational systems. With the PMML standard, this no longer is the case. PMML simply eliminates the deployment complexity for predictive models.

A standard also provides choices among vendors, allowing us to implement best-of-breed solutions, and creating a common knowledge framework for internal teams – analytics, IT, and business – as well external vendors and consultants. In general, having a solid standard is a sign of a mature analytics industry, creating more options for users and, most importantly, propelling the total analytics market to the next level.

Ajay- Can PMML help your existing software in analytics and BI?

Mike- PMML has been widely accepted among vendors, almost all major analytics and business intelligence vendors already support the standard. If you have any such software package in-house, you most likely have PMML at your disposal already.

For example, you can develop your models in any of the tools that support PMML, e.g., SPSS, SAS, Microstrategy, or IBM, and then deploy that model in ADAPA, which is the Zementis decision engine. Or you can even choose from various open source tools, like R and KNIME.

PMML_Now

Ajay- How does Zementis and ADAPA and PMML fit?

Mike- Zementis has been a avid supporter of the PMML standard and is very active in the development of the standard. We contributed to the PMML package for the open source R Project. Furthermore, we created a free PMML Converter tool which helps users to validate and correct PMML files from various vendors and convert legacy PMML files to the latest version of the standard.

Most prominently with ADAPA, Zementis launched the first cloud-computing scoring engine on the Amazon EC2 cloud. ADAPA is a highly scalable deployment, integration and execution platform for PMML-based predictive models. Not only does it give you all the benefits of being fully standards-based, using PMML and web services, but it also leverages the cloud for scalability and cost-effectiveness.

By being a Software as a Service (SaaS) application on Amazon EC2, ADAPA provides extreme flexibility, from casual usage which only costs a few dollars a month all the way to high-volume mission critical enterprise decision management which users can seamlessly launch in the United States or in European data centers.

Ajay- What are some examples where PMML helped companies save money?

Mike- For any consulting company focused on developing predictive analytics models for clients, PMML provides tremendous benefits, both for clients and service provider. In standardizing on PMML, it defines a clear deliverable – a PMML model – which clients can deploy instantly. No fixed requirements on which specific tools to choose for development or deployment, it is only important that the model adheres to the PMML standard which becomes the common interface between the business partners. This eliminates miscommunication and lowers the overall project cost. Another example is where a company has taken advantage of the capability to move models instantly from development to operational deployment. It allows them to quickly update models based on market conditions, say in the area of risk management and fraud detection, or to roll out new marketing campaigns.

Personally, I think the biggest opportunities are still ahead of us as more and more businesses embrace operational predictive analytics. The true value of PMML is to facilitate a real-time decision environment where we leverage predictive models in every business process, at every customer touch point and on-demand to maximize value

Ajay- Where can I find more information about PMML?

Mike- First there is the Data Mining Group (DMG) web site at http://www.dmg.org

I strongly encourage any company that has a significant interest in predictive analytics to become a member and help drive the development of the standard.

We also created a knowledge base of PMML-related information at http://www.predictive-analytics.info and there is a PMML interest group on Linked

In http://www.linkedin.com/groupRegistration?gid=2328634

This group is more geared toward a general discussion forum for business benefits and end-user questions, and it is a great way to get started with PMML.

Last but not least, the Zementis web site at http://www.zementis.com

It contains various PMML example files, the PMML Converter tool, as well links to PMML resource pages on the web.

For more on Michael Zeller and Zementis read his earlier interview at https://decisionstats.wordpress.com/2009/02/03/interview-michael-zeller-ceozementis-2/

Interview Ken O Connor Business Intelligence Consultant

Here is an interview with an industry veteran of Business Intelligence, Ken O Connor.

Ajay- Describe your career journey across the full development cycle of Business Intelligence.

Ken- I started my career in the early 80’s in the airline industry, where I worked as an application programmer and later as a systems programmer. I took a computer science degree by night. The airline industry was one of the first to implement computer systems in the ‘60s, and the legacy of being an early adaptor was that airline reservation systems were developed in Assembler. Remarkable as it sounds now, as application programmers, we wrote our own file access methods. Even more remarkable, as systems programmers, we modified the IBM supplied Operating System, originally known as the Airline Control Program (ACP), later renamed as Transaction Processing Facility (TPF). The late ‘80s saw the development of Global “Computer Reservations Systems” (CRS systems) including AMADEUS and GALILEO. I moved from Aer Lingus, a small Irish airline, to work in London on the British Airways systems, to enable the British Airways systems share information and communicate with the new Global CRS systems.

I learnt very important lessons during those years.

* The criticality of standards

* The drive for interoperability of systems

* The drive towards information sharing

* The drive away from bespoke development

In the 90’s I returned to Dublin, where I worked as an independent consultant with IBM on many data intensive projects. On one project I was lead developer in the IBM Dublin Laboratory on the development of the Data Replication tool called “Data Propagator NonRelational”. This tool automatically propagates updates made on IMS databases to DB2 databases. On this project, we successfully piloted using the Cleanroom Development Method, as part of IBM’s derive towards Six Sigma quality.

In the past 15 years I have moved away from IT towards the business. I describe myself as a Hybrid. I believe there is a serious communications gap between business users and IT, and this is a frequent cause of project failures. I seek to bridge that gap. I ensure that requirements are clear, measurable, testable, and capable of being easily understood and signed off by business owners.

One of my favorite programmes was Euro Changeover, This was a hugely data intensive programme. It was the largest changeover undertaken by European Financial Institutions. I worked as an independent consultant with the IBM Euro Centre of Competence. I developed changeover strategies for a number of Irish Enterprises, and was the End to End IT changeover process owner in a major Irish bank. Every application and every data store holding currency sensitive data (not just amounts, but currency signs etc.) had to be converted at exactly the same time to ensure that all systems successfully switched to euro processing on 1st January 2002.

I learnt many, many lasting lessons about data the hard way on Euro Changeover programmes, such as:

* The extent to which seemingly separate applications share operational data – often without the knowledge of the owning application.

* The extent to which business users use (abuse) data fields to hold information never intended for the data field.

* The critical distinction between the underlying data (in a data store) and the information displayed to a business user.

I have worked primarily on what I call “End of food chain” projects and programmes, such as Single View of Customer, data migrations, and data population of repositories for BASEL II and Anti Money Laundering (AML) systems. Business Intelligence is another example of an “End of food chain” project. “End of food-chain” projects share the following characteristics:

* Dependent on existing data

* No control over the quality of existing data they depend on

* No control over the data entry processes by which the data they require is captured.

* The data required may have been captured many years previously.

Recently, I have shared my experience of “Enterprise wide data issues” in a series of posts on my blog, together with a process for assessing the status of those issues within an Enterprise (more details). In my experience, the success of a Business Intelligence programme and the ease with which an Enterprise completes “End of food chain” data dependent programmes directly depends on the status of the common Enterprise Wide data issues I have identified.

Ajay -Describe the educational scene for science graduates in Ireland. What steps do you think governments and universities can do to better teach science and keep young people excited about it?

Ken- I am not in a position to comment on the educational scene for science graduates in Ireland. However, I can say that currently there are insufficient numbers of school children studying science in primary and 2nd level education. There is a need to excite young people about science. There is a need for more interactive science museums, like W5 in Belfast which is hugely successful. Kids love to get involved, and practical science can be great fun.

Ajay- What are some of the key trends in business intelligence that you have seen-

Ken- Since the earliest days of my career, I have seen an ever increasing move towards standards based interoperability of systems, and interchange of data. This has accelerated dramatically in recent years. This is the good news. Further good news is the drive towards the use of external reference databases to verify the accuracy of data, at point of data entry (See blog post on Upstream prevention by Henrik Liliendahl Sørensen). One example of this drive is cloud based verification services from new companies like Ireland based Clavis Technology.

The harsh reality is that “Old hardware goes into museums, while old software goes into production every night”. Enterprises have invested vast amounts of money in legacy applications over decades. These legacy systems access legacy data in legacy data stores. This legacy data will continue to pose challenges in the delivery of Business Intelligence to the Business community that needs it. These challenges will continue to provide opportunities for Data Quality professionals.

Ajay- What is going to be the next fundamental change in this industry in your opinion?

Ken- The financial crisis will result in increased regulatory requirements. This will be good news for the Business Intelligence / Data Quality industry. In time, it will no longer be sufficient to provide the regulator with ‘just’ the information requested. The regulator will want to see the process by which the information was gathered; the process controls, and evidence of the quality the underlying data from which the information was derived. This move will result in funding for Data Governance programmes, which will lead to increased innovation in our industry.

Ajay- Describe your startup Map My Business, your target customer and your vision for it.

Ken- I started MapMyBusiness.com as a “recession buster”. Ireland was hit particularly hard by the financial crisis. I had become over dependent on the financial services industry, and a blanket ban on the use of external consultants left me with no option but to reinvent myself. MapMyBusiness.com helps small businesses to attract clients, by getting them on Google page one. Having been burnt by an over dependence on one industry, my vision is to diversify. I believe that Data Governance is industry independent, and I am focussing on increasing my customer base for my Data Governance consultancy skills, via my company Professional IT Personnel Ltd.

Ajay- What do you do when not working with customers or blogging on your website?

Ken- I try to achieve a reasonable work/life balance. I am married with two children aged 12 and 10, and like to spend time with them, especially outdoors, walking, hiking, playing tennis etc. I am involved in my community, lobbying for improved cycling infrastructure in our area (more details). Ireland, like most countries, is facing an obesity epidemic, due to an increasingly sedentary lifestyle. Too many people get little or no exercise, and don’t have the time, willpower, or perhaps money, to regularly work out in a gym. By including “Active Travel” in our daily lives – by walking or cycling to schools and local amenities, we can get enough physical exercise to prevent obesity, and obesity related health problems. We need to make our cities, towns and villages more pedestrian and cyclist friendly, to encourage “active travel”. My voluntary work in this area introduced me to mapping (see example), and enabled me to set up MapMyBusiness.com.

Biography-

Ken O’Connor is an independent IT Consultant with almost 30 years of work experience. He specialises in Data: Data Migration, Data Population, Data Governance, Data Quality, Data Profiling…His company is called Professional IT Personnel Ltd.

Ken started his blog (Ken O’ Connor Data Consultant) to share his experience and to learn from the experience of others. Dylan Jones, editor of dataqualitypro, describe Ken as a “grizzled veteran”, with almost 30 years experience across the full development lifecycle.

Protected: Newer version of Alternative SAS / WPS 2.4 launched

New York Diner

In a New York Thai restaurant
I dine alone being new to York town
Borrowing conversation from left and right
Bringing no conversation of my own in the fading twilight

As bubbles slowly bubble from a sparkling dollar five glass
I watch from shadows as pretty people come and go as they say excuse me and quickly pass

I am an odd ball I know
Brown monkey nowhere to go
The waiters give me a look best called quizzical
What on the napkin do I scribble

Will the fellow eat and clear in peace start by giving chicken panang a nibble

Will I pay up this after all is west Harlem
Asians don’t tip they have been before on this trip

And I drink and devour
Dinner fine and dine

Watching conversation sparkle up
As sparkling wine goes down
I nod I say people are just the same
Appearances change but they play the same old games

Up when happy when sad they are down
Every big big city every new yet old town

We drink different wines
But then think similar thoughts

Daily joys and same different struggles
That our love and life bought

Wine brings heat to our face
Letting my jacket slip a bit
The waiter slips me a seen it all look
Are you. Serious he thinks you silly twit

Leaving all pretences
I chug wine like we chug beer
Expensive to my sponsors
But hey it brings me cheer

Ole lady on my left
Drunk college chicks on my right
Smart dame right across room
Cute Thai waitress completes a pleasant sight

Chug chug chug
We drink sparkling wine
Eating and being merry
Old wine makes new troubles all fine

Now thinking deeper-

In the middle of urban sub arcana
Face to face verbal smacks in your space
Comes a concept called Americana

Passionate adjectives and superlative passions
Americana is an euphemism for monetary nirvana

Nasal voices on my right
Deep bass slightly in front
To my left a wavering voice wavers
Aromatic cacophony my ears take the brunt

Wine slipping down slowly
But hey rising so fast
After effects may disappear soon
But the mellow pleasure promises to last

The Big Data Event- Why am I here?

I am here braving New York’s cold weather, as I prepare for this evening’s events. If you follow this blog closely ( including the poems) ,it is a welcome change— New York is a nice city people are friendly if you ask them nicely and the bus is a great way to watch the city – best of all I like the crowds which I have grown used while living in India.

Why Am I here?

Because the topics that are discussed here are cutting edge to the point that I cannot find anyone willing to teach me Hadoop and Map-Reduce while in University and at the same time teach me statistics on them as well ( as in how do we do a K Means clustering on a 1 terabyte dataset).

I asked the organizers on what makes the event special ( every event promises special Mojo after all).

This is what they said-

What is the unique value proposition of the event that will help developers and both current and potential customers-

The essence of the event is to explore new innovations in massively-parallel processing data warehousing technology and how it can help companies gain more insight from their data. Applications include fraud detection, behavioral targeting, social network analysis, better predictions/forecasting, bioinformatics, etc. We are exploring how MapReduce and Hadoop can be integrated into the enterprise IT system to help evolve data warehousing/BI/data mining

and to put it even more nicely’

““The industry’s first big data event, Big Data Summit ‘09, being held this evening in New York City, will showcase Hadoop’s fit with MPP data warehouses. Aster Data will be presenting alongside Colin White, President and Founder of BI Research, Mike Brown of comScore Inc., and Jonathan Goldman, who represents LinkedIn.”

That’s good enough for me to drop into Roosevelt Hotel on East 45th Street at around 6 pm for some reluctant networking ( read: beers). 5 years ago whie working for GE , I used to run queries using SAS on a 147 million row database (the size of the DB) and wait 3 hours for it to come back. Today that much data fits very snugly in my laptop. How soon will we have Terabyte level personal computing, and Petabyte level business computing and the challenges it poses to standard statistical assumptions and synching of hardware and software- Big Big Data is an interesting area to watch.

Interview Shawn Kung Sr Director Aster Data

Here is an interview with Shawn Kung, Senior Director of Product Management at Aster Data. Shawn explains the difference between the various database technologies, Aster’s rising appeal to its unique technological approach and touches upon topics of various other interests as well to people in the BI and technology space.

Ajay -Describe your career journey from a high school student of science till today .Do you think science is a more lucrative career?

Shawn: My career journey has spanned over a decade in several Silicon Valley technology companies. In both high school and my college studies at Princeton, I had a fervent interest in math and quantitative economics. Silicon Valley drew me to companies like upstart procurement software maker Ariba and database giant Oracle. I continued my studies by returning to get a Master’s in Management Science at Stanford before going on to lead core storage systems for nearly 5 years at NetApp and subsequently Aster.

Science (whether it is math, physics, economics, or the hard engineering sciences) provides a solid foundation. It teaches you to think and test your assumptions – those are valuable skills that can lead to a both a financially lucrative and personally inspiring career.

Ajay- How would you describe the difference between Map Reduce and Hadoop and Oracle and SAS, DBMS and Teradata and Aster Data products to a class of undergraduate engineers ?

Shawn: Let’s start with the database guys – Oracle and Teradata. They focus on structured data – data that has a logical schema and is manipulated via a standards-based structured query language (SQL). Oracle tries to be everything to everyone – it does OLTP (low-latency transactions like credit card or stock trade execution apps) and some data warehousing (typically summary reporting). Oracle’s data warehouse is not known for large-scale data warehousing and is more often used for back-office reporting.

Teradata is focused on data warehousing and scales very well, but is extremely expensive – it runs on high-end custom hardware and takes a mainframe approach to data processing. This approach makes less sense as commodity hardware becomes more compute-rich and better software comes along to support large-scale MPP data warehousing.

SAS is very different – it’s not a relational database. It really offers an application platform for data analysis, specifically data mining. Unlike Oracle and Teradata which is used by SQL developers and managed by DBAs, SAS is typically run in business units by data analysts – for example a quantitative marketing analyst, a statistician/mathematician, or a savvy engineer with a data mining/math background. SAS is used to try to find patterns, understand behaviors, and offer predictive analytics that enable businesses to identify trends and make smarter decisions than their competitors.

Hadoop offers an open-source framework for large-scale data processing. MapReduce is a component of Hadoop, which also contains multiple other modules including a distributed filesystem (HDFS). MapReduce offers a programming paradigm for distributed computing (a parallel data flow processing framework).

Both Hadoop and MapReduce are catered toward the application developer or programmer. It’s not catered for enterprise data centers or IT. If you have a finite project in a line of business and want to get it done, Hadoop offers a low-cost way to do this. For example, if you want to do large-scale data munging like aggregations, transformations, manipulations of unstructured data – Hadoop offers a solution for this without compromising on the performance of your main data warehouse. Once the data munging is finished, the post-processed data set can be loaded into a database for interactive analysis or analytics. It is a great combination of big data technologies for certain use-cases.

Aster takes a very unique approach. Our Aster nCluster software offers the best of all worlds – we offer the potential for deep analytics of SAS, the low-cost scalability and parallel processing of Hadoop/MapReduce, and the structured data advantages (schema, SQL, ACID compliance and transactional integrity, indexes, etc) of a relational database like Teradata and Oracle. Often, we find complementary approaches and therefore view SAS and Hadoop/MapReduce as synergistic to a complete solution. Data warehouses like Teradata and Oracle tend to be more competitive.

Ajay- What exciting products have you launched so far and what makes them unique both from a technical developer perspective and a business owner perspective

Shawn: Aster was the first-to-market to offer In-Database MapReduce, which provides the standards and familiarity of SQL and databases with the analytic power of MapReduce. This is very unique as it offers technical developers and application programmers to write embedded procedural algorithms once, upload it, and allow business analysts or IT folks (SQL developers, DBAs, etc) to invoke these SQL-MapReduce functions forever.

It is highly polymorphic (re-usable), highly fault-tolerant, highly flexible (any language – Java, Python, Ruby, Perl, R statistical language, C# in the .NET world, etc) and natively massively parallel – all of which differentiate these SQL extensions from traditional dumb user-defined functions (UDFs).

Ajay- “I am happy with my databases and I don’t need too much diversity or experimentation in my systems”, says a CEO to you.

How do you convince him using quantitative numbers and not marketing adjectives?

Shawn: Aster has dozens of production customers including big-names like MySpace, LinkedIn, Akamai, Full Tilt Poker, comScore, and several yet-to-be-named retail and financial service accounts. We have quantified proof points that show orders of magnitude improvements in scalability, performance, and analytic insights compared to incumbent or competitor solutions. Our highly referenceable customers would be happy to discuss their positive experiences with the CEO.

But taking a step back, there’s a fundamental concept that this CEO needs to first understand. The world is changing – data growth is proliferating due to the digitization of so many applications and the emergence of unstructured data and new data types. Like the book “Competing on Analytics”, the world is shifting to a paradigm where companies that don’t take risks and push the limits on analytics will die like the dinosaurs.

IDC is projecting 10x+ growth in data over the next few years to zetabytes of aggregate data driven by digitization (Internet, digital television, RFID, etc). The data is there and in order to compete effectively and understand your customers more intimately, you need a large-scale analytics solution like the one Aster nCluster offers. If you hold off on experimentation and innovation, it will be too late by the time you realize you have a problem at hand.

Ajay- How important is work life balance for you?

Shawn: Very important. I hang out with my wife most weekends – we do a lot of outdoors activities like hiking and gardening. In Silicon Valley, it’s all too easy to get caught up in the rush of things. Taking breaks, especially during the weekend, is important to recharge and re-energize to be as productive as possible.

Ajay- Are you looking for college interns and new hires what makes aster exciting for you so you are pumped up every day to go to work?

Shawn: We’re always looking for smart, innovative, and entrepreneurial new college grads and interns, especially on the technical side. So if you are a computer science major or recent grad or graduate student, feel free to contact us for opportunities.

What makes Aster exciting is 2 things –

first, the people. Everyone is very smart and innovative so you learn a tremendous amount, which is personally gratifying and professionally useful long-term.

Second, Aster is changing the world!

Distributed systems computing focused on big data processing and analytics – these are massive game-changers that will fundamentally change the landscape in data warehousing and analytics. Traditional databases have been a oligopoly for over a generation – they haven’t been challenged and so the 1970’s based technology has stuck around. The emergence of big data and low-cost commodity hardware has created a unique opportunity to carve out a brand new market…

what gets me pumped every day is I have the ability to contribute to a pioneer that is quickly becoming Silicon Valley’s next great success story!

Biography-

Over the past decade, Shawn has led product management for some of Silicon Valley’s most successful and innovative technology companies. Most recently, he spent nearly 5 years at Network Appliance leading Core Systems storage product management, where he oversaw the development of high availability software and Storage Systems hardware products that grew in annual revenue from $200M to nearly $800M. Prior to NetApp, Shawn held senior product management and corporate strategy roles at Oracle Corporation and Ariba Inc.

Shawn holds an M.S. in Management Science and engineering from Stanford University, where he was awarded the Valentine Fellowship (endowed by Don Valentine of Sequoia Capital). He also received a B.A. with high honors from Princeton University.

About Aster

Aster Data Systems is a proven leader in high-performance database systems for data warehousing and analytics – the first DBMS to tightly integrate SQL with MapReduce – providing deep insights on data analyzed on clusters of low-cost commodity hardware. The AsternCluster database cost-effectively powers frontline analytic applications for companies such as MySpace, aCerno (an Akamai company), and ShareThis.

Running on low-cost off-the-shelf hardware, and providing ‘hands-free’ administration, Aster enables enterprises to meet their data warehousing needs within their budget. Aster is headquartered in San Carlos, California and is backed by Sequoia Capital, JAFCO Ventures, IVP, Cambrian Ventures, and First-Round Capital, as well as industry visionaries including David Cheriton and Ron Conway.

Please share:

Please share:

Please share:

Please share:

Please share: