Home » Posts tagged 'mining'

Tag Archives: mining

Top 7 Business Strategy Models

UPDATED POST- Some Models I use for Business Strategy- to analyze the huge reams of qualitative and uncertain data that business generates. I have added a bonus the Business canvas Model (number 2)

  1. Porters 5 forces Model-To analyze industries
  2. Business Canvas
  3. BCG Matrix- To analyze Product Portfolios
  4. Porters Diamond Model- To analyze locations
  5. McKinsey 7 S Model-To analyze teams
  6. Gernier Theory- To analyze growth of organization
  7. Herzberg Hygiene Theory- To analyze soft aspects of individuals
  8. Marketing Mix Model- To analyze marketing mix.

(more…)

Analytics 2012 Conference

from http://www.sas.com/events/analytics/us/index.html

Analytics 2012 Conference

SAS and more than 1,000 analytics experts gather at

Caesars Palace
Caesars Palace

Analytics 2012 Conference Details

Pre-Conference Workshops – Oct 7
Conference – Oct 8-9
Post-Conference Training – Oct 10-12
Caesars Palace, Las Vegas

Keynote Speakers

The following are confirmed keynote speakers for Analytics 2012. Jim Goodnight Since he co-founded SAS in 1976, Jim Goodnight has served as the company’s Chief Executive Officer.

William Hakes Dr. William Hakes is the CEO and co-founder of Link Analytics, an analytical technology company focused on mobile, energy and government verticals.

Tim Rey Tim Rey  has written over 100 internal papers, published 21 external papers, and delivered numerous keynote presentations and technical talks at various quantitative methods forums. Recently he has co-chaired both forecasting and data mining conferences. He is currently in the process of co-writing a book, Applied Data Mining for Forecasting.

http://www.sas.com/events/analytics/us/train.html

Pre-Conference

Plan to come to Analytics 2012 a day early and participate in one of the pre-conference workshops or take a SAS Certification exam. Prices for all of the preconference workshops, except for SAS Sentiment Analysis Studio: Introduction to Building Models and the Business Analytics Consulting Workshops, are included in the conference package pricing. You will be prompted to select your pre-conference training options when you register.

Sunday Morning Workshop

SAS Sentiment Analysis Studio: Introduction to Building Models

This course provides an introduction to SAS Sentiment Analysis Studio. It is designed for system designers, developers, analytical consultants and managers who want to understand techniques and approaches for identifying sentiment in textual documents.
View outline
Sunday, Oct. 7, 8:30a.m.-12p.m. – $250

Sunday Afternoon Workshops

Business Analytics Consulting Workshops

This workshop is designed for the analyst, statistician, or executive who wants to discuss best-practice approaches to solving specific business problems, in the context of analytics. The two-hour workshop will be customized to discuss your specific analytical needs and will be designed as a one-on-one session for you, including up to five individuals within your company sharing your analytical goal. This workshop is specifically geared for an expert tasked with solving a critical business problem who needs consultation for developing the analytical approach required. The workshop can be customized to meet your needs, from a deep-dive into modeling methods to a strategic plan for analytic initiatives. In addition to the two hours at the conference location, this workshop includes some advanced consulting time over the phone, making it a valuable investment at a bargain price.
View outline
Sunday, Oct. 7; 1-3 p.m. or 3:30-5:30 p.m. – $200

Demand-Driven Forecasting: Sensing Demand Signals, Shaping and Predicting Demand

This half-day lecture teaches students how to integrate demand-driven forecasting into the consensus forecasting process and how to make the current demand forecasting process more demand-driven.
View outline
Sunday, Oct. 7; 1-5 p.m.

Forecast Value Added Analysis

Forecast Value Added (FVA) is the change in a forecasting performance metric (such as MAPE or bias) that can be attributed to a particular step or participant in the forecasting process. FVA analysis is used to identify those process activities that are failing to make the forecast any better (or might even be making it worse). This course provides step-by-step guidelines for conducting FVA analysis – to identify and eliminate the waste, inefficiency, and worst practices from your forecasting process. The result can be better forecasts, with fewer resources and less management time spent on forecasting.
View outline
Sunday, Oct. 7; 1-5 p.m.

SAS Enterprise Content Categorization: An Introduction

This course gives an introduction to methods of unstructured data analysis, document classification and document content identification. The course also uses examples as the basis for constructing parse expressions and resulting entities.
View outline
Sunday, Oct. 7; 1-5 p.m.

Introduction to Data Mining and SAS Enterprise Miner

This course serves as an introduction to data mining and SAS Enterprise Miner for Desktop software. It is designed for data analysts and qualitative experts as well as those with less of a technical background who want a general understanding of data mining.
View outline
Sunday, Oct. 7, 1-5 p.m.

Modeling Trend, Cycles, and Seasonality in Time Series Data Using PROC UCM

This half-day lecture teaches students how to model, interpret, and predict time series data using UCMs. The UCM procedure analyzes and forecasts equally spaced univariate time series data using the unobserved components models (UCM). This course is designed for business analysts who want to analyze time series data to uncover patterns such as trend, seasonal effects, and cycles using the latest techniques.
View outline
Sunday, Oct. 7, 1-5 p.m.

SAS Rapid Predictive Modeler

This seminar will provide a brief introduction to the use of SAS Enterprise Guide for graphical and data analysis. However, the focus will be on using SAS Enterprise Guide and SAS Enterprise Miner along with the Rapid Predictive Modeling component to build predictive models. Predictive modeling will be introduced using the SEMMA process developed with the introduction of SAS Enterprise Miner. Several examples will be used to illustrate the use of the Rapid Predictive Modeling component, and interpretations of the model results will be provided.
View outline
Sunday, Oct. 7, 1-5 p.m

Using Rapid Miner and R for Sports Analytics #rstats

Rapid Miner has been one of the oldest open source analytics software, long long before open source or even analytics was considered a fashion buzzword. The Rapid Miner software has been a pioneer in many areas (like establishing a marketplace for Rapid Miner Extensions) and the Rapid Miner -R extension was one of the most promising enablers of using R in an enterprise setting.
The following interview was taken with a manager of analytics for a sports organization. The sports organization considers analytics as a strategic differentiator , hence the name is confidential. No part of the interview has been edited or manipulated.

Ajay- Why did you choose Rapid Miner and R? What were the other software alternatives you considered and discarded?

Analyst- We considered most of the other major players in statistics/data mining or enterprise BI.  However, we found that the value proposition for an open source solution was too compelling to justify the premium pricing that the commercial solutions would have required.  The widespread adoption of R and the variety of packages and algorithms available for it, made it an easy choice.  We liked RapidMiner as a way to design structured, repeatable processes, and the ability to optimize learner parameters in a systematic way.  It also handled large data sets better than R on 32-bit Windows did.  The GUI, particularly when 5.0 was released, made it more usable than R for analysts who weren’t experienced programmers.

Ajay- What analytics do you do think Rapid Miner and R are best suited for?

 Analyst- We use RM+R mainly for sports analysis so far, rather than for more traditional business applications.  It has been quite suitable for that, and I can easily see how it would be used for other types of applications.

 Ajay- Any experiences as an enterprise customer? How was the installation process? How good is the enterprise level support?

Analyst- Rapid-I has been one of the most responsive tech companies I’ve dealt with, either in my current role or with previous employers.  They are small enough to be able to respond quickly to requests, and in more than one case, have fixed a problem, or added a small feature we needed within a matter of days.  In other cases, we have contracted with them to add larger pieces of specific functionality we needed at reasonable consulting rates.  Those features are added to the mainline product, and become fully supported through regular channels.  The longer consulting projects have typically had a turnaround of just a few weeks.

 Ajay- What challenges if any did you face in executing a pure open source analytics bundle ?

Analyst- As Rapid-I is a smaller company based in Europe, the availability of training and consulting in the USA isn’t as extensive as for the major enterprise software players, and the time zone differences sometimes slow down the communications cycle.  There were times where we were the first customer to attempt a specific integration point in our technical environment, and with no prior experiences to fall back on, we had to work with Rapid-I to figure out how to do it.  Compared to the what traditional software vendors provide, both R and RM tend to have sparse, terse, occasionally incomplete documentation.  The situation is getting better, but still lags behind what the traditional enterprise software vendors provide.

 Ajay- What are the things you can do in R ,and what are the things you prefer to do in Rapid Miner (comparison for technical synergies)

Analyst- Our experience has been that RM is superior to R at writing and maintaining structured processes, better at handling larger amounts of data, and more flexible at fine-tuning model parameters automatically.  The biggest limitation we’ve had with RM compared to R is that R has a larger library of user-contributed packages for additional data mining algorithms.  Sometimes we opted to use R because RM hadn’t yet implemented a specific algorithm.  The introduction the R extension has allowed us to combine the strengths of both tools in a very logical and productive way.

In particular, extending RapidMiner with R helped address RM’s weakness in the breadth of algorithms, because it brings the entire R ecosystem into RM (similar to how Rapid-I implemented much of the Weka library early on in RM’s development).  Further, because the R user community releases packages that implement new techniques faster than the enterprise vendors can, this helps turn a potential weakness into a potential strength.  However, R packages tend to be of varying quality, and are more prone to go stale due to lack of support/bug fixes.  This depends heavily on the package’s maintainer and its prevalence of use in the R community.  So when RapidMiner has a learner with a native implementation, it’s usually better to use it than the R equivalent.

RCOMM 2012 goes live in August

An awesome conference by an awesome software Rapid Miner remains one of the leading enterprise grade open source software , that can help you do a lot of things including flow driven data modeling ,web mining ,web crawling etc which even other software cant.

Presentations include:

  • Mining Machine 2 Machine Data (Katharina Morik, TU Dortmund University)
  • Handling Big Data (Andras Benczur, MTA SZTAKI)
  • Introduction of RapidAnalytics at Telenor (Telenor and United Consult)
  • and more

Here is a list of complete program

 

Program

 

Time
Slot
Tuesday
Training / Workshop 1
Wednesday
Conference 1
Thursday
Conference 2
Friday
Training / Workshop 2
09:00 – 10:30
Introductory Speech
Ingo Mierswa (Rapid-I)Resource-aware Data Mining or M2M Mining (Invited Talk)

Katharina Morik (TU Dortmund University)

More information

 

Data Analysis

 

NeurophRM: Integration of the Neuroph framework into RapidMiner
Miloš Jovanović, Jelena Stojanović, Milan Vukićević, Vera Stojanović, Boris Delibašić (University of Belgrade)

To be announced (Invited Talk)
Andras Benczur 

Recommender Systems

 

Extending RapidMiner with Recommender Systems Algorithms
Matej Mihelčić, Nino Antulov-Fantulin, Matko Bošnjak, Tomislav Šmuc (Ruđer Bošković Institute)

Implementation of User Based Collaborative Filtering in RapidMiner
Sérgio Morais, Carlos Soares (Universidade do Porto)

Parallel Training / Workshop Session

Advanced Data Mining and Data Transformations

or

Development Workshop Part 2

10:30 – 11:00
Coffee Break
Coffee Break
Coffee Break
11:00 – 12:30
Data Analysis

Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner
Mennatallah Amer, Markus Goldstein (DFKI)

Customers’ LifeStyle Targeting on Big Data using Rapid Miner
Maksim Drobyshev (LifeStyle Marketing Ltd)

Robust GPGPU Plugin Development for RapidMiner
Andor Kovács, Zoltán Prekopcsák (Budapest University of Technology and Economics)

Extensions

 

Optimization Plugin For RapidMiner
Venkatesh Umaashankar, Sangkyun Lee (TU Dortmund University; presented by Hendrik Blom)

 

Image Mining Extension – Year After
Radim Burget, Václav Uher, Jan Mašek (Brno University of Technology)

Incorporating R Plots into RapidMiner Reports
Peter Jeszenszky (University of Debrecen)

12:30 – 13:30
Lunch
Lunch
Lunch
13:30 – 15:30
Parallel Training / Workshop Session

Basic Data Mining and Data Transformations

or

Development Workshop Part 1

Applications

 

Introduction of RapidAnalyticy Enterprise Edition at Telenor Hungary
t.b.a. (Telenor Hungary and United Consult)

 

Application of RapidMiner in Steel Industry Research and Development
Bengt-Henning Maas, Hakan Koc, Martin Bretschneider (Salzgitter Mannesmann Forschung)

A Comparison of Data-driven Models for Forecast River Flow
Milan Cisty, Juraj Bezak (Slovak University of Technology)

Portfolio Optimization Using Local Linear Regression Ensembles in Rapid Miner
Gábor Nagy, Tamás Henk, Gergő Barta (Budapest University of Technology and Economics)

Extensions

 

An Octave Extension for RapidMiner
Sylvain Marié (Schneider Electric)

 

Unstructured Data

 

Processing Data Streams with the RapidMiner Streams-Plugin
Christian Bockermann, Hendrik Blom (TU Dortmund)

Automated Creation of Corpuses for the Needs of Sentiment Analysis
Peter Koncz, Jan Paralic (Technical University of Kosice)

 

Demonstration: News from the Rapid-I Labs
Simon Fischer; Rapid-I

This short session demonstrates the latest developments from the Rapid-I lab and will let you how you can build powerful analysis processes and routines by using those RapidMiner tools.

Certification Exam
15:30 – 16:00
Coffee Break
Coffee Break
Coffee Break
16:00 – 18:00
Book Presentation and Game Show

Data Mining for the Masses: A New Textbook on Data Mining for Everyone
Matthew North (Washington & Jefferson College)

Matthew North presents his new book “Data Mining for the Masses” introducing data mining to a broader audience and making use of RapidMiner for practical data mining problems.

 

Game Show
Did you miss last years’ game show “Who wants to be a data miner?”? Use RapidMiner for problems it was never created for and beat the time and other contestants!

User Support

Get some Coffee for free – Writing Operators with RapidMiner Beans
Christian Bockermann, Hendrik Blom (TU Dortmund)

Meta-Modeling Execution Times of RapidMiner operators
Matija Piškorec, Matko Bošnjak, Tomislav Šmuc (Ruđer Bošković Institute)

Conference day ends at ca. 17:00.

19:30
Social Event (Conference Dinner)
Social Event (Visit of Bar District)

 

and you should have a look at https://rapid-i.com/rcomm2012f/index.php?option=com_content&view=article&id=65

Conference is in Budapest, Hungary,Europe.

( Disclaimer- Rapid Miner is an advertising sponsor of Decisionstats.com in case you didnot notice the two banner sized ads.)

 

Big Noise on Big Data

Increasingly Big Data is used in writing where Business Analytics was used, and data mining is thrown in as a word just to keep liberal art majors happy that they are reading a scientific article.

Some Big Words I have noticed in my Short life-

Big Data? High Performance Analytics? High Performance Computing ? Cloud Computing? Time Sharing? Data Mining? SEMMA? CRISP-DM? KDD? Business Intelligence? Business Analytics and Optimization? (pick a card and any card)

(or Just Moore’s Law catching up with the analytics)

Some examples-

Replace Big Data with Analytics in these articles and let me know if you can make out much of a difference

  • Big Data on Campus

http://www.nytimes.com/2012/07/22/education/edlife/colleges-awakening-to-the-opportunities-of-data-mining.html

  • From the man who famously said BI is dead, is now burying Business Analytics within the new buzzword , SAS CMO Jim Davis

How to transform big data from an obstacle into an asset

http://blogs.sas.com/content/corneroffice/2012/07/22/how-to-transform-big-data-from-an-obstacle-into-an-asset/

(Related- Is big data over hyped? by Jim Davis

http://www.sas.com/knowledge-exchange/business-analytics/featured/is-big-data-over-hyped/index.html )

I am sure by 2015, Jim Davis, NYT and the merry men of analytics will find some other buzzwords to rally the troops. In the meantime, let me throw out the flag and call it Big  .

Interview James G Kobielus IBM Big Data

Here is an interview with  James G Kobielus, who is the Senior Program Director, Product Marketing, Big Data Analytics Solutions at IBM. Special thanks to Payal Patel Cudia of IBM’s communication team,for helping with the logistics for this.

Ajay -What are the specific parts of the IBM Platform that deal with the three layers of Big Data -variety, velocity and volume

James-Well first of all, let’s talk about the IBM Information Management portfolio. Our big data platform addresses the three layers of big data to varying degrees either together in a product , or two out of the three or even one of the three aspects. We don’t have separate products for the variety, velocity and volume separately.

Let us define these three layers-Volume refers to the hundreds of terabytes and petabytes of stored data inside organizations today. Velocity refers to the whole continuum from batch to real time continuous and streaming data.

Variety refers to multi-structure data from structured to unstructured files, managed and stored in a common platform analyzed through common tooling.

For Volume-IBM has a highly scalable Big Data platform. This includes Netezza and Infosphere groups of products, and Watson-like technologies that can support petabytes volume of data for analytics. But really the support of volume ranges across IBM’s Information Management portfolio both on the database side and the advanced analytics side.

For real time Velocity, we have real time data acquisition. We have a product called IBM Infosphere, part of our Big Data platform, that is specifically built for streaming real time data acquisition and delivery through complex event processing. We have a very rich range of offerings that help clients build a Hadoop environment that can scale.

Our Hadoop platform is the most real time capable of all in the industry. We are differentiated by our sheer breadth, sophistication and functional depth and tooling integrated in our Hadoop platform. We are differentiated by our streaming offering integrated into the Hadoop platform. We also offer a great range of modeling and analysis tools, pretty much more than any other offering in the Big Data space.

Attached- Jim’s slides from Hadoop World

Ajay- Any plans for Mahout for Hadoop

Jim- I cant speak about product plans. We have plans but I cant tell you anything more. We do have a feature in Big Insights called System ML, a library for machine learning.

Ajay- How integral are acquisitions for IBM in the Big Data space (Netezza,Cognos,SPSS etc). Is it true that everything that you have in Big Data is acquired or is the famous IBM R and D contributing here . (see a partial list of IBM acquisitions at at http://www.ibm.com/investor/strategy/acquisitions.wss )

Jim- We have developed a lot on our own. We have the deepest R and D of anybody in the industry in all things Big Data.

For example – Watson has Big Insights Hadoop at its core. Apache Hadoop is the heart and soul of Big Data (see http://www-01.ibm.com/software/data/infosphere/hadoop/ ). A great deal that makes Big Insights so differentiated is that not everything that has been built has been built by the Hadoop community.

We have built additions out of the necessity for security, modeling, monitoring, and governance capabilities into BigInsights to make it truly enterprise ready. That is one example of where we have leveraged open source and we have built our own tools and technologies and layered them on top of the open source code.

Yes of course we have done many strategic acquisitions over the last several years related to Big Data Management and we continue to do so. This quarter we have done 3 acquisitions with strong relevance to Big Data. One of them is Vivisimo (http://www-03.ibm.com/press/us/en/pressrelease/37491.wss ).

Vivisimo provides federated Big Data discovery, search and profiling capabilities to help you figure out what data is out there,what is relevance of that data to your data science project- to help you answer the question which data should you bring in your Hadoop Cluster.

 We also did Varicent , which is more performance management and we did TeaLeaf , which is a customer experience solution provider where customer experience management and optimization is one of the hot killer apps for Hadoop in the cloud. We have done great many acquisitions that have a clear relevance to Big Data.

Netezza already had a massively parallel analytics database product with an embedded library of models called Netezza Analytics, and in-database capabilties to massively parallelize Map Reduce and other analytics management functions inside the database. In many ways, Netezza provided capabilities similar to that IBM had provided for many years under the Smart Analytics Platform (http://www-01.ibm.com/software/data/infosphere/what-is-advanced-analytics/ ) .

There is a differential between Netezza and ISAS.

ISAS was built predominantly in-house over several years . If you go back a decade ago IBM acquired Ascential Software , a product portfolio that was the heart and soul of IBM InfoSphere Information Manager that is core to our big Data platform. In addition to Netezza, IBM bought SPSS two years back. We already had data mining tools and predictive modeling in the InfoSphere portfolio, but we realized we needed to have the best of breed, SPSS provided that and so IBM acquired them.

 Cognos- We had some BI reporting capabilities in the InfoSphere portfolio that we had built ourselves and also acquired for various degrees from prior acquisitions. But clearly Cognos was one of the best BI vendors , and we were lacking such a rich tool set in our product in visualization and cubing and so for that reason we acquired Cognos.

There is also Unica – which is a marketing campaign optimization which in many ways is a killer app for Hadoop. Projects like that are driving many enterprises.

Ajay- How would you rank order these acquisitions in terms of strategic importance rather than data of acquisition or price paid.

Jim-Think of Big Data as an ecosystem that has components that are fitted to particular functions for data analytics and data management. Is the database the core, or the modeling tool the core, or the governance tools the core, or is the hardware platform the core. Everything is critically important. We would love to hear from you what you think have been most important. Each acquisition has helped play a critical role to build the deepest and broadest solution offering in Big Data. We offer the hardware, software, professional services, the hosting service. I don’t think there is any validity to a rank order system.

Ajay-What are the initiatives regarding open source that Big Data group have done or are planning?

Jim- What we are doing now- We are very much involved with the Apache Hadoop community. We continue to evolve the open source code that everyone leverages.. We have built BigInsights on Apache Hadoop. We have the closest, most up to date in terms of version number to Apache Hadoop ( Hbase,HDFS, Pig etc) of all commercial distributions with our BigInsights 1.4 .

We have an R library integrated with BigInsights . We have a R library integrated with Netezza Analytics. There is support for R Models within the SPSS portfolio. We already have a fair amount of support for R across the portfolio.

Ajay- What are some of the concerns (privacy,security,regulation) that you think can dampen the promise of Big Data.

Jim- There are no showstoppers, there is really a strong momentum. Some of the concerns within the Hadoop space are immaturity of the technology, the immaturity of some of the commercial offerings out there that implement Hadoop, the lack of standardization for formal sense for Hadoop.

There is no Open Standards Body that declares, ratifies the latest version of Mahout, Map Reduce, HDFS etc. There is no industry consensus reference framework for layering these different sub projects. There are no open APIs. There are no certifications or interoperability standards or organizations to certify different vendors interoperability around a common API or framework.

The lack of standardization is troubling in this whole market. That creates risks for users because users are adopting multiple Hadoop products. There are lots of Hadoop deployments in the corporate world built around Apache Hadoop (purely open source). There may be no assurance that these multiple platforms will interoperate seamlessly. That’s a huge issue in terms of just magnifying the risk. And it increases the need for the end user to develop their own custom integrated code if they want to move data between platforms, or move map-reduce jobs between multiple distributions.

Also governance is a consideration. Right now Hadoop is used for high volume ETL on multi structured and unstructured data sources, or Hadoop is used for exploratory sand boxes for data scientists. These are important applications that are a majority of the Hadoop deployments . Some Hadoop deployments are stand alone unstructured data marts for specific applications like sentiment analysis like.

Hadoop is not yet ready for data warehousing. We don’t see a lot of Hadoop being used as an alternative to data warehouses for managing the single version of truth of system or record data. That day will come but there needs to be out there in the marketplace a broader range of data governance mechanisms , master data management, data profiling products that are mature that enterprises can use to make sure their data inside their Hadoop clusters is clean and is the single version of truth. That day has not arrived yet.

One of the great things about IBM’s acquisition of Vivisimo is that a piece of that overall governance picture is discovery and profiling for unstructured data , and that is done very well by Vivisimo for several years.

What we will see is vendors such as IBM will continue to evolve security features inside of our Hadoop platform. We will beef up our data governance capabilities for this new world of Hadoop as the core of Big Data, and we will continue to build up our ability to integrate multiple databases in our Hadoop platform so that customers can use data from a bit of Hadoop,some data from a bit of traditional relational data warehouse, maybe some noSQL technology for different roles within a very complex Big Data environment.

That latter hybrid deployment model is becoming standard across many enterprises for Big Data. A cause for concern is when your Big Data deployment has a bit of Hadoop, bit of noSQL, bit of EDW, bit of in-memory , there are no open standards or frameworks for putting it all together for a unified framework not just for interoperability but also for deployment.

There needs to be a virtualization or abstraction layer for unified access to all these different Big Data platforms by the users/developers writing the queries, by administrators so they can manage data and resources and jobs across all these disparate platforms in a seamless unified way with visual tooling. That grand scenario, the virtualization layer is not there yet in any standard way across the big data market. It will evolve, it may take 5-10 years to evolve but it will evolve.

So, that’s the concern that can dampen some of the enthusiasm for Big Data Analytics.

About-

You can read more about Jim at http://www.linkedin.com/pub/james-kobielus/6/ab2/8b0 or

follow him on Twitter at http://twitter.com/jameskobielus

You can read more about IBM Big Data at http://www-01.ibm.com/software/data/bigdata/

Interview Alain Chesnais Chief Scientist Trendspottr.com

Here is a brief interview with Alain Chesnais ,Chief Scientist  Trendspottr.com. It is a big honor to interview such a legend in computer science, and I am grateful to both him and Mark Zohar for taking time to write these down.
alain_chesnais2.jpg

Ajay-  Describe your career from your student days to being the President of ACM (Association of Computing Machinery http://www.acm.org/ ). How can we increase  the interest of students in STEM education, particularly in view of the shortage of data scientists.
 
Alain- I’m trying to sum up a career of over 35 years. This may be a bit long winded…
I started my career in CS when I was in high school in the early 70’s. I was accepted in the National Science Foundation’s Science Honors Program in 9th grade and the first course I took was a Fortran programming course at Columbia University. This was on an IBM 360 using punch cards.
The next year my high school got a donation from DEC of a PDP-8E mini computer. I ended up spending a lot of time in the machine room all through high school at a time when access to computers wasn’t all that common. I went to college in Paris and ended up at l’Ecole Normale Supérieure de Cachan in the newly created Computer Science department.
My first job after finishing my graduate studies was as a research assistant at the Centre National de la Recherche Scientifique where I focused my efforts on modelling the behaviour of distributed database systems in the presence of locking. When François Mitterand was elected president of France in 1981, he invited Nicholas Negroponte and Seymour Papert to come to France to set up the Centre Mondial Informatique. I was hired as a researcher there and continued on to become director of software development until it was closed down in 1986. I then started up my own company focusing on distributed computer graphics. We sold the company to Abvent in the early 90’s.
After that, I was hired by Thomson Digital Image to lead their rendering team. We were acquired by Wavefront Technologies in 1993 then by SGI in 1995 and merged with Alias Research. In the merged company: Alias|wavefront, I was director of engineering on the Maya project. Our team received an Oscar in 2003 for the creation of the Maya software system.
Since then I’ve worked at various companies, most recently focusing on social media and Big Data issues associated with it. Mark Zohar and I worked together at SceneCaster in 2007 where we developed a Facebook app that allowed users to create their own 3D scenes and share them with friends via Facebook without requiring a proprietary plugin. In December 2007 it was the most popular app in its category on Facebook.
Recently Mark approached me with a concept related to mining the content of public tweets to determine what was trending in real time. Using math similar to what I had developed during my graduate studies to model the performance of distributed databases in the presence of locking, we built up a real time analytics engine that ranks the content of tweets as they stream in. The math is designed to scale linearly in complexity with the volume of data that we analyze. That is the basis for what we have created for TrendSpottr.
In parallel to my professional career, I have been a very active volunteer at ACM. I started out as a member of the Paris ACM SIGGRAPH chapter in 1985 and volunteered to help do our mailings (snail mail at the time). After taking on more responsibilities with the chapter, I was elected chair of the chapter in 1991. I was first appointed to the SIGGRAPH Local Groups Steering Committee, then became ACM Director for Chapters. Later I was successively elected SIGGRAPH Vice Chair, ACM SIG Governing Board (SGB) Vice Chair for Operations, SGB Chair, ACM SIGGRAPH President, ACM Secretary/Treasurer, ACM Vice President, and finally, in 2010, I was elected ACM President. My term as ACM President has just ended on July 1st. Vint Cerf is our new President. I continue to serve on the ACM Executive Committee in my role as immediate Past President.
(Note- About ACM
ACM, the Association for Computing Machinery www.acm.org, is the world’s largest educational and scientific computing society, uniting computing educators, researchers and professionals to inspire dialogue, share resources and address the field’s challenges. )
Ajay- What sets Trendspotter apart from other startups out there in terms of vision in trying to achieve a more coherent experience on the web.
 
Alain- The Basic difference with other approaches that we are aware of is that we have developed an incremental solution that calculates the results on the fly as the data streams in. Our evaluators are based on solid mathematical foundations that have proven their usefulness over time. One way to describe what we do is to think of it as signal processing where the tweets are the signal and our evaluators are like triggers that tell you what elements of the signal have the characteristics that we are filtering for (velocity and acceleration). One key result of using this approach is that our unit cost per tweet analyzed does not go up with increased volume. Using more traditional data analysis approaches involving an implicit sort would imply a complexity of N*log(N), where N is the volume of tweets being analyzed. That would imply that the cost per tweet analyzed would go up with the volume of tweets. Our approach was designed to avoid that, so that we can maintain a cap on our unit costs of analysis, no matter what volume of data we analyze.
Ajay- What do you think is the future of big data visualization going to look like? What are some of the technologies that you are currently bullish on?
Alain- I see several trends that would have deep impact on Big Data visualization. I firmly believe that with large amounts of data, visualization is key tool for understanding both the structure and the relationships that exist between data elements. Let’s focus on some of the key things that are pushing in this direction:
  • the volume of data that is available is growing at a rate we have never seen before. Cisco has measured an 8 fold increase in the volume of IP traffic over the last 5 years and predicts that we will reach the zettabyte of data over IP in 2016
  • more of the data is becoming publicly available. This isn’t only on social networks such as Facebook and twitter, but joins a more general trend involving open research initiatives and open government programs
  • the desired time to get meaningful results is going down dramatically. In the past 5 years we have seen the half life of data on Facebook, defined as the amount of time that half of the public reactions to any given post (likes, shares., comments) take place, go from about 12 hours to under 3 hours currently
  • our access to the net is always on via mobile device. You are always connected.
  • the CPU and GPU capabilities of mobile devices is huge (an iPhone has 10 times the compute power of a Cray-1 and more graphics capabilities than early SGI workstations)
Put all of these observations together and you quickly come up with a massive opportunity to analyze data visually on the go as it happens no matter where you are. We can’t afford to have to wait for results. When something of interest occurs we need to be aware of it immediately.
Ajay- What are some of the applications we could use Trendspottr. Could we predict events like Arab Spring, or even the next viral thing.
 
Alain- TrendSpottr won’t predict what will happen next. What it *will* do is alert you immediately when it happens. You can think of it like a smoke detector. It doesn’t tell that a fire will take place, but it will save your life when a fire does break out.
Typical uses for TrendSpottr are
  • thought leadership by tracking content that your readership is interested in via TrendSpottr you can be seen as a thought leader on the subject by being one of the first to share trending content on a given subject. I personally do this on my Facebook page (http://www.facebook.com/alain.chesnais) and have seen my klout score go up dramatically as a result
  • brand marketing to be able to know when something is trending about your brand and take advantage of it as it happens.
  • competitive analysis to see what is being said about two competing elements. For instance, searching TrendSpottr for “Obama OR Romney” gives you a very good understanding about how social networks are reacting to each politician. You can also do searches like “$aapl OR $msft OR $goog” to get a sense of what is the current buzz for certain hi tech stocks.
  • understanding your impact in real time to be able to see which of the content that you are posting is trending the most on social media so that you can highlight it on your main page. So if all of your content is hosted on common domain name (ourbrand.com), searching for ourbrand.com will show you the most active of your site’s content. That can easily be set up by putting a TrendSpottr widget on your front page

Ajay- What are some of the privacy guidelines that you keep in  mind- given the fact that you collect individual information but also have government agencies as potential users.

 
Alain- We take privacy very seriously and anonymize all of the data that we collect. We don’t keep explicit records of the data we collected through the various incoming streams and only store the aggregate results of our analysis.
About-
Alain Chesnais is immediate Past President of ACM, elected for the two-year term beginning July 1, 2010.Chesnais studied at l’Ecole Normale Supérieure de l’Enseignement Technique and l’Université de Paris where he earned a Maîtrise de Mathematiques, a Maitrise de Structure Mathématique de l’Informatique, and a Diplôme d’Etudes Approfondies in Compuer Science. He was a high school student at the United Nations International School in New York, where, along with preparing his International Baccalaureate with a focus on Math, Physics and Chemistry, he also studied Mandarin Chinese.Chesnais recently founded Visual Transitions, which specializes in helping companies move to HTML 5, the newest standard for structuring and presenting content on the World Wide Web. He was the CTO of SceneCaster.com from June 2007 until April 2010, and was Vice President of Product Development at Tucows Inc. from July 2005 – May 2007. He also served as director of engineering at Alias|Wavefront on the team that received an Oscar from the Academy of Motion Picture Arts and Sciences for developing the Maya 3D software package.

Prior to his election as ACM president, Chesnais was vice president from July 2008 – June 2010 as well as secretary/treasurer from July 2006 – June 2008. He also served as president of ACM SIGGRAPH from July 2002 – June 2005 and as SIG Governing Board Chair from July 2000 – June 2002.

As a French citizen now residing in Canada, he has more than 20 years of management experience in the software industry. He joined the local SIGGRAPH Chapter in Paris some 20 years ago as a volunteer and has continued his involvement with ACM in a variety of leadership capacities since then.

About Trendspottr.com

TrendSpottr is a real-time viral search and predictive analytics service that identifies the most timely and trending information for any topic or keyword. Our core technology analyzes real-time data streams and spots emerging trends at their earliest acceleration point — hours or days before they have become “popular” and reached mainstream awareness.

TrendSpottr serves as a predictive early warning system for news and media organizations, brands, government agencies and Fortune 500 companies and helps them to identify emerging news, events and issues that have high viral potential and market impact. TrendSpottr has partnered with HootSuite, DataSift and other leading social and big data companies.

Follow

Get every new post delivered to your Inbox.

Join 691 other followers