Here is an interview with James G Kobielus, who is the Senior Program Director, Product Marketing, Big Data Analytics Solutions at IBM. Special thanks to Payal Patel Cudia of IBM’s communication team,for helping with the logistics for this.
Ajay -What are the specific parts of the IBM Platform that deal with the three layers of Big Data -variety, velocity and volume
James-Well first of all, let’s talk about the IBM Information Management portfolio. Our big data platform addresses the three layers of big data to varying degrees either together in a product , or two out of the three or even one of the three aspects. We don’t have separate products for the variety, velocity and volume separately.
Let us define these three layers-Volume refers to the hundreds of terabytes and petabytes of stored data inside organizations today. Velocity refers to the whole continuum from batch to real time continuous and streaming data.
Variety refers to multi-structure data from structured to unstructured files, managed and stored in a common platform analyzed through common tooling.
For Volume-IBM has a highly scalable Big Data platform. This includes Netezza and Infosphere groups of products, and Watson-like technologies that can support petabytes volume of data for analytics. But really the support of volume ranges across IBM’s Information Management portfolio both on the database side and the advanced analytics side.
For real time Velocity, we have real time data acquisition. We have a product called IBM Infosphere, part of our Big Data platform, that is specifically built for streaming real time data acquisition and delivery through complex event processing. We have a very rich range of offerings that help clients build a Hadoop environment that can scale.
Our Hadoop platform is the most real time capable of all in the industry. We are differentiated by our sheer breadth, sophistication and functional depth and tooling integrated in our Hadoop platform. We are differentiated by our streaming offering integrated into the Hadoop platform. We also offer a great range of modeling and analysis tools, pretty much more than any other offering in the Big Data space.
Attached- Jim’s slides from Hadoop World
Ajay- Any plans for Mahout for Hadoop
Jim- I cant speak about product plans. We have plans but I cant tell you anything more. We do have a feature in Big Insights called System ML, a library for machine learning.
Ajay- How integral are acquisitions for IBM in the Big Data space (Netezza,Cognos,SPSS etc). Is it true that everything that you have in Big Data is acquired or is the famous IBM R and D contributing here . (see a partial list of IBM acquisitions at at http://www.ibm.com/investor/strategy/acquisitions.wss )
Jim- We have developed a lot on our own. We have the deepest R and D of anybody in the industry in all things Big Data.
For example – Watson has Big Insights Hadoop at its core. Apache Hadoop is the heart and soul of Big Data (see http://www-01.ibm.com/software/data/infosphere/hadoop/ ). A great deal that makes Big Insights so differentiated is that not everything that has been built has been built by the Hadoop community.
We have built additions out of the necessity for security, modeling, monitoring, and governance capabilities into BigInsights to make it truly enterprise ready. That is one example of where we have leveraged open source and we have built our own tools and technologies and layered them on top of the open source code.
Yes of course we have done many strategic acquisitions over the last several years related to Big Data Management and we continue to do so. This quarter we have done 3 acquisitions with strong relevance to Big Data. One of them is Vivisimo (http://www-03.ibm.com/press/us/en/pressrelease/37491.wss ).
Vivisimo provides federated Big Data discovery, search and profiling capabilities to help you figure out what data is out there,what is relevance of that data to your data science project- to help you answer the question which data should you bring in your Hadoop Cluster.
We also did Varicent , which is more performance management and we did TeaLeaf , which is a customer experience solution provider where customer experience management and optimization is one of the hot killer apps for Hadoop in the cloud. We have done great many acquisitions that have a clear relevance to Big Data.
Netezza already had a massively parallel analytics database product with an embedded library of models called Netezza Analytics, and in-database capabilties to massively parallelize Map Reduce and other analytics management functions inside the database. In many ways, Netezza provided capabilities similar to that IBM had provided for many years under the Smart Analytics Platform (http://www-01.ibm.com/software/data/infosphere/what-is-advanced-analytics/ ) .
There is a differential between Netezza and ISAS.
ISAS was built predominantly in-house over several years . If you go back a decade ago IBM acquired Ascential Software , a product portfolio that was the heart and soul of IBM InfoSphere Information Manager that is core to our big Data platform. In addition to Netezza, IBM bought SPSS two years back. We already had data mining tools and predictive modeling in the InfoSphere portfolio, but we realized we needed to have the best of breed, SPSS provided that and so IBM acquired them.
Cognos– We had some BI reporting capabilities in the InfoSphere portfolio that we had built ourselves and also acquired for various degrees from prior acquisitions. But clearly Cognos was one of the best BI vendors , and we were lacking such a rich tool set in our product in visualization and cubing and so for that reason we acquired Cognos.
There is also Unica – which is a marketing campaign optimization which in many ways is a killer app for Hadoop. Projects like that are driving many enterprises.
Ajay- How would you rank order these acquisitions in terms of strategic importance rather than data of acquisition or price paid.
Jim-Think of Big Data as an ecosystem that has components that are fitted to particular functions for data analytics and data management. Is the database the core, or the modeling tool the core, or the governance tools the core, or is the hardware platform the core. Everything is critically important. We would love to hear from you what you think have been most important. Each acquisition has helped play a critical role to build the deepest and broadest solution offering in Big Data. We offer the hardware, software, professional services, the hosting service. I don’t think there is any validity to a rank order system.
Ajay-What are the initiatives regarding open source that Big Data group have done or are planning?
Jim- What we are doing now- We are very much involved with the Apache Hadoop community. We continue to evolve the open source code that everyone leverages.. We have built BigInsights on Apache Hadoop. We have the closest, most up to date in terms of version number to Apache Hadoop ( Hbase,HDFS, Pig etc) of all commercial distributions with our BigInsights 1.4 .
We have an R library integrated with BigInsights . We have a R library integrated with Netezza Analytics. There is support for R Models within the SPSS portfolio. We already have a fair amount of support for R across the portfolio.
Ajay- What are some of the concerns (privacy,security,regulation) that you think can dampen the promise of Big Data.
Jim- There are no showstoppers, there is really a strong momentum. Some of the concerns within the Hadoop space are immaturity of the technology, the immaturity of some of the commercial offerings out there that implement Hadoop, the lack of standardization for formal sense for Hadoop.
There is no Open Standards Body that declares, ratifies the latest version of Mahout, Map Reduce, HDFS etc. There is no industry consensus reference framework for layering these different sub projects. There are no open APIs. There are no certifications or interoperability standards or organizations to certify different vendors interoperability around a common API or framework.
The lack of standardization is troubling in this whole market. That creates risks for users because users are adopting multiple Hadoop products. There are lots of Hadoop deployments in the corporate world built around Apache Hadoop (purely open source). There may be no assurance that these multiple platforms will interoperate seamlessly. That’s a huge issue in terms of just magnifying the risk. And it increases the need for the end user to develop their own custom integrated code if they want to move data between platforms, or move map-reduce jobs between multiple distributions.
Also governance is a consideration. Right now Hadoop is used for high volume ETL on multi structured and unstructured data sources, or Hadoop is used for exploratory sand boxes for data scientists. These are important applications that are a majority of the Hadoop deployments . Some Hadoop deployments are stand alone unstructured data marts for specific applications like sentiment analysis like.
Hadoop is not yet ready for data warehousing. We don’t see a lot of Hadoop being used as an alternative to data warehouses for managing the single version of truth of system or record data. That day will come but there needs to be out there in the marketplace a broader range of data governance mechanisms , master data management, data profiling products that are mature that enterprises can use to make sure their data inside their Hadoop clusters is clean and is the single version of truth. That day has not arrived yet.
One of the great things about IBM’s acquisition of Vivisimo is that a piece of that overall governance picture is discovery and profiling for unstructured data , and that is done very well by Vivisimo for several years.
What we will see is vendors such as IBM will continue to evolve security features inside of our Hadoop platform. We will beef up our data governance capabilities for this new world of Hadoop as the core of Big Data, and we will continue to build up our ability to integrate multiple databases in our Hadoop platform so that customers can use data from a bit of Hadoop,some data from a bit of traditional relational data warehouse, maybe some noSQL technology for different roles within a very complex Big Data environment.
That latter hybrid deployment model is becoming standard across many enterprises for Big Data. A cause for concern is when your Big Data deployment has a bit of Hadoop, bit of noSQL, bit of EDW, bit of in-memory , there are no open standards or frameworks for putting it all together for a unified framework not just for interoperability but also for deployment.
There needs to be a virtualization or abstraction layer for unified access to all these different Big Data platforms by the users/developers writing the queries, by administrators so they can manage data and resources and jobs across all these disparate platforms in a seamless unified way with visual tooling. That grand scenario, the virtualization layer is not there yet in any standard way across the big data market. It will evolve, it may take 5-10 years to evolve but it will evolve.
So, that’s the concern that can dampen some of the enthusiasm for Big Data Analytics.
You can read more about Jim at http://www.linkedin.com/pub/james-kobielus/6/ab2/8b0 or
follow him on Twitter at http://twitter.com/jameskobielus
You can read more about IBM Big Data at http://www-01.ibm.com/software/data/bigdata/
ACM, the Association for Computing Machinery www.acm.org, is the world’s largest educational and scientific computing society, uniting computing educators, researchers and professionals to inspire dialogue, share resources and address the field’s challenges. )
- the volume of data that is available is growing at a rate we have never seen before. Cisco has measured an 8 fold increase in the volume of IP traffic over the last 5 years and predicts that we will reach the zettabyte of data over IP in 2016
- more of the data is becoming publicly available. This isn’t only on social networks such as Facebook and twitter, but joins a more general trend involving open research initiatives and open government programs
- the desired time to get meaningful results is going down dramatically. In the past 5 years we have seen the half life of data on Facebook, defined as the amount of time that half of the public reactions to any given post (likes, shares., comments) take place, go from about 12 hours to under 3 hours currently
- our access to the net is always on via mobile device. You are always connected.
- the CPU and GPU capabilities of mobile devices is huge (an iPhone has 10 times the compute power of a Cray-1 and more graphics capabilities than early SGI workstations)
- thought leadership by tracking content that your readership is interested in via TrendSpottr you can be seen as a thought leader on the subject by being one of the first to share trending content on a given subject. I personally do this on my Facebook page (http://www.facebook.com/alain.chesnais) and have seen my klout score go up dramatically as a result
- brand marketing to be able to know when something is trending about your brand and take advantage of it as it happens.
- competitive analysis to see what is being said about two competing elements. For instance, searching TrendSpottr for “Obama OR Romney” gives you a very good understanding about how social networks are reacting to each politician. You can also do searches like “$aapl OR $msft OR $goog” to get a sense of what is the current buzz for certain hi tech stocks.
- understanding your impact in real time to be able to see which of the content that you are posting is trending the most on social media so that you can highlight it on your main page. So if all of your content is hosted on common domain name (ourbrand.com), searching for ourbrand.com will show you the most active of your site’s content. That can easily be set up by putting a TrendSpottr widget on your front page
Ajay- What are some of the privacy guidelines that you keep in mind- given the fact that you collect individual information but also have government agencies as potential users.
Prior to his election as ACM president, Chesnais was vice president from July 2008 – June 2010 as well as secretary/treasurer from July 2006 – June 2008. He also served as president of ACM SIGGRAPH from July 2002 – June 2005 and as SIG Governing Board Chair from July 2000 – June 2002.
As a French citizen now residing in Canada, he has more than 20 years of management experience in the software industry. He joined the local SIGGRAPH Chapter in Paris some 20 years ago as a volunteer and has continued his involvement with ACM in a variety of leadership capacities since then.
TrendSpottr is a real-time viral search and predictive analytics service that identifies the most timely and trending information for any topic or keyword. Our core technology analyzes real-time data streams and spots emerging trends at their earliest acceleration point — hours or days before they have become “popular” and reached mainstream awareness.
TrendSpottr serves as a predictive early warning system for news and media organizations, brands, government agencies and Fortune 500 companies and helps them to identify emerging news, events and issues that have high viral potential and market impact. TrendSpottr has partnered with HootSuite, DataSift and other leading social and big data companies.
If you bleed red,white and blue and know some geo-spatial analysis ,social network analysis and some supervised and unsupervised learning (and unlearning)- here is a chance for you to put your skills for an awesome project
For this challenge, Darpa will lodge a selected six to eight teams at George Mason University and provide them with an initial $10,000 for equipment and access to unclassified data sets including “ground-level video of human activity in both urban and rural environments; high-resolution wide-area LiDAR of urban and mountainous terrain, wide-area airborne full motion video; and unstructured amateur photos and videos, such as would be taken from an adversary’s cell phone.” However, participants are encouraged to use any open sourced, legal data sets they want. (In the hackathon spirit, we would encourage the consumption of massive quantities of pizza and Red Bull, too.)
DARPA Innovation House Project
Proposals must be one to three pages. Team resumes of any length must be attached and do not count against the page limit. Proposals must have 1-inch margins, use a font size of at least 11, and be delivered in Microsoft Word or Adobe PDF format.
Proposals must be emailed to InnovationHouse@c4i.gmu.edu by 4:00PM ET on Tuesday, July 31, 2012.
Proposals must have a Title and contain at least the following sections with the following contents.
- Team Members
Each team member must be listed with name, email and phone.
The Lead Developer should be indicated.
The statement “All team members are proposed as Key Personnel.” must be included.
- Capability Description
The description should clearly explain what capability the software is designed to provide the user, how it is proposed to work, and what data it will process.
In addition, a clear argument should be made as to why it is a novel approach that is not incremental to existing methods in the field.
- Proposed Phase 1 Demonstration
This section should clearly explain what will be demonstrated at the end of Session I. The description should be expressive, and as concrete as possible about the nature of the designs and software the team intends to produce in Session I.
- Proposed Phase 2 Demonstration
This section should clearly explain how the final software capability will be demonstrated as quantitatively as possible (for example, positing the amount of data that will be processed during the demonstration), how much time that will take, and the nature of the results the processing aims to achieve.
In addition, the following sections are optional.
- Technical Approach
The technical approach section amplifies the Capability Description, explaining proposed algorithms, coding practices, architectural designs and/or other technical details.
- Team Qualifications
Team qualifications should be included if the team?s experience base does not make it obvious that it has the potential to do this level of software development. In that case, this section should make a credible argument as to why the team should be considered to have a reasonable chance of completing its goals, especially under the tight timelines described.
Other sections may be included at the proposers? discretion, provided the proposal does not exceed three pages.
One of those cool conferences that is on my bucket list- this time in Hungary (That’s a nice place)
But I am especially interested in seeing how far Radoop has come along !
Disclaimer- Rapid Miner has been a Decisionstats.com sponsor for many years. It is also a very cool software but I like the R Extension facility even more!
and not very expensive too compared to other User Conferences in Europe!-
Information about Registration
- Early Bird registration until July 20th, 2012.
- Normal registration from July 21st, 2012 until August 13th, 2012.
- Latest registration from August 14th, 2012 until August 24th, 2012.
- Students have to provide a valid Student ID during registration.
- The Dinner is included in the All Days and in the Conference packages.
- All prices below are net prices. Value added tax (VAT) has to be added if applicable.
Prices for Regular Visitors
Days and Event
Early Bird Rate
(Training / Development 1)
|190 Euro||230 Euro||280 Euro|
|Wednesday + Thursday
|290 Euro||350 Euro||420 Euro|
(Training / Development 2 and Exam)
|190 Euro||230 Euro||280 Euro|
|610 Euro||740 Euro||900 Euro|
Prices for Authors and Students
In case of students, please note that you will have to provide a valid student ID during registration.
Days and Event
Early Bird Rate
(Training / Development 1)
|90 Euro||110 Euro||140 Euro|
|Wednesday + Thursday
|140 Euro||170 Euro||210 Euro|
(Training / Development 2 and Exam)
|90 Euro||110 Euro||140 Euro|
|290 Euro||350 Euro||450 Euro|
09:00 – 10:30
Ingo Mierswa; Rapid-I
NeurophRM: Integration of the Neuroph framework into RapidMiner
|To be announced (Invited Talk)
To be announced
Extending RapidMiner with Recommender Systems Algorithms
Implementation of User Based Collaborative Filtering in RapidMiner
|Parallel Training / Workshop Session
10:30 – 12:30
Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner
Customers’ LifeStyle Targeting on Big Data using Rapid Miner
Robust GPGPU Plugin Development for RapidMiner
Image Mining Extension – Year After
Incorporating R Plots into RapidMiner Reports
An Octave Extension for RapidMiner
12:30 – 13:30
13:30 – 15:00
|Parallel Training / Workshop Session
Application of RapidMiner in Steel Industry Research and Development
A Comparison of Data-driven Models for Forecast River Flow
Portfolio Optimization Using Local Linear Regression Ensembles in Rapid Miner
Processing Data Streams with the RapidMiner Streams-Plugin
Automated Creation of Corpuses for the Needs of Sentiment Analysis
News from the Rapid-I Labs
This short session demonstrates the latest developments from the Rapid-I lab and will let you how you can build powerful analysis processes and routines by using those RapidMiner tools.
15:00 – 17:00
|Book Presentation and Game Show
Data Mining for the Masses: A New Textbook on Data Mining for Everyone
Matthew North presents his new book “Data Mining for the Masses” introducing data mining to a broader audience and making use of RapidMiner for practical data mining problems.
Get some Coffee for free – Writing Operators with RapidMiner Beans
Meta-Modeling Execution Times of RapidMiner operators
Social Event (Conference Dinner)
Social Event (Visit of Bar District)
This is a short introductory training course for users who are not yet familiar with RapidMiner or only have a few experiences with RapidMiner so far. The topics of this training session include
- Basic Usage
- User Interface
- Creating and handling RapidMiner repositories
- Starting a new RapidMiner project
- Operators and processes
- Loading data from flat files
- Storing data, processes, and results
- Predictive Models
- Linear Regression
- Naïve Bayes
- Decision Trees
- Basic Data Transformations
- Changing names and roles
- Handling missing values
- Changing value types by discretization and dichotimization
- Normalization and standardization
- Filtering examples and attributes
- Scoring and Model Evaluation
- Applying models
- Splitting data
- Evaluation methods
- Performance criteria
- Visualizing Model Performance
This is a short introductory training course for users who already know some basic concepts of RapidMiner and data mining and have already used the software before, for example in the first training on Tuesday. The topics of this training session include
- Advanced Data Handling
- Balancing data
- Joins and Aggregations
- Detection and removal of outliers
- Dimensionality reduction
- Control process execution
- Remember process results
- Recall process results
- Using branches and conditions
- Exception handling
- Definition of macros
- Usage of macros
- Definition of log values
- Clearing log tables
- Transforming log tables to data
Want to exchange ideas with the developers of RapidMiner? Or learn more tricks for developing own operators and extensions? During our development workshops on Tuesday and Friday, we will build small groups of developers each working on a small development project around RapidMiner. Beginners will get a comprehensive overview of the architecture of RapidMiner before making the first steps and learn how to write own operators. Advanced developers will form groups with our experienced developers, identify shortcomings of RapidMiner and develop a new extension which might be presented during the conference already. Unfinished work can be continued in the second workshop on Friday before results might be published on the Marketplace or can be taken home as a starting point for new custom operators.
While SAS language has a beautifully designed ODS (Output Delivery System) for saving output from certain analysis in excel files (and html and others), in R one can simply use the object, put it in a write.table and save it a csv file using the file parameter within write.table.
As a business analytics consultant, the output from a Proc Means, Proc Freq (SAS) or a summary/describe/table command (in R) is to be presented as a final report. Copying and pasting is not feasible especially for large amounts of text, or remote computers.
Using the following we can simple save the output in R
#We shifted the directory, so we can save output without putting the entire path again and again for each step.
#I have found the summary command most useful for initial analysis and final display (particularly during the data munging step)
# I assigned a new object to the analysis step (summary), it could also be summary,names, describe (HMisc) or table (for frequency analysis),
Note: This is for basic beginners in R using it for business analytics dealing with large number of variables.
If you have a large number of files in a local directory to be read in R, you can avoid typing the entire path again and again by modifying the file parameter in the read.table and changing the working directory to that folder
and so on…
maybe there is a better approach somewhere on Stack Overflow or R help, but this will work just as well.
you can then merge the objects created ajayt1 and ajayt2… (to be continued)
I love GUIs (graphical user interfaces)- they might be TCL/TK based or GTK based or even QT based. As a researcher they help me with faster coding, as a consultant they help with faster transition of projects from startup to handover stage and as an R instructor helps me get people to learn R faster.
I wish Python had some GUIs though 😉
from the open access journal of statistical software-
JSS Special Volume 49: Graphical User Interfaces for R
Pedro M. Valero-Mora, Ruben Ledesma
Vol. 49, Issue 1, Jun 2012
Submitted 2012-06-03, Accepted 2012-06-03
Ya-Shan Cheng, Chien-Yu Peng
Vol. 49, Issue 2, Jun 2012
Submitted 2010-12-31, Accepted 2011-06-29
Joris J. Snellenburg, Sergey Laptenok, Ralf Seger, Katharine M. Mullen, Ivo H. M. van Stokkum
Vol. 49, Issue 3, Jun 2012
Submitted 2011-01-20, Accepted 2011-09-16
Marcel Austenfeld, Wolfram Beyschlag
Vol. 49, Issue 4, Jun 2012
Submitted 2011-01-05, Accepted 2012-02-20
Byron C. Wallace, Issa J. Dahabreh, Thomas A. Trikalinos, Joseph Lau, Paul Trow, Christopher H. Schmid
Vol. 49, Issue 5, Jun 2012
Submitted 2010-11-01, Accepted 2012-12-20
Bei Huang, Dianne Cook, Hadley Wickham
Vol. 49, Issue 6, Jun 2012
Submitted 2011-01-20, Accepted 2012-04-16
John Fox, Marilia S. Carvalho
Vol. 49, Issue 7, Jun 2012
Submitted 2010-12-26, Accepted 2011-12-28
Vol. 49, Issue 8, Jun 2012
Submitted 2011-02-28, Accepted 2011-09-08
Stefan Rödiger, Thomas Friedrichsmeier, Prasenjit Kapat, Meik Michalke
Vol. 49, Issue 9, Jun 2012
Submitted 2010-12-28, Accepted 2011-05-06
Vol. 49, Issue 10, Jun 2012
Submitted 2010-12-17, Accepted 2011-05-11
Vol. 49, Issue 11, Jun 2012
Submitted 2010-12-08, Accepted 2011-07-15