Advanced Analytics on Multi-Terabyte Datasets- Conferences

Some news on Data Mining 2009 by Aster Data –

SAS and Aster Data to Present “Advanced Analytics on Multi-Terabyte Datasets” at M2009 in Las Vegas – Oct. 26-27
Learn how the tight coupling of SQL and MapReduce provided by Aster Data creates new ‘big data’ analytics opportunities when combined with SAS. Aster Data will exhibit throughout the event.
More

And also a nice  webcast by Curt Monash on the same Big Data topic-

Mastering MapReduce Webinar Series, Session 1
“Big Data Reality: The Role of MapReduce in Big Data Management and Analysis”- Oct. 15
Industry analyst Curt Monash explains the basics of MapReduce, key uses cases, and which industries and applications are heavily using MapReduce. Topics include recommendations for integrating MapReduce in an enterprise business intelligence and data warehousing environment.
More

Also,

Here is a brief synopsis on the Aster Data ( http://www.facebook.com/pages/Aster-Data-Systems/5601042375) Sponsored Big Data Summit  ( http://www.facebook.com/pages/Big-Data-Summit/143312171156 )which I attended-

  • A Plan for Large Scale Data Analytics: How to Utilize Aster nCluster and Hadoop in a Symbiotic
    Relationship to Support Processing in Excess of 100 Billion Rows Per Month
    – Michael Brown and Will Duckworth
    (EVP, Software Engineering, comScore, Inc. and Director, Software Engineering, comScore, Inc.)

This talked of the special needs of Com Score in handling big data and why Map Reduce and Hadoop seem to be the cost effective solutions for big big data while RDBMS seems stuck in the middle of middle data. Broadly informative on the statistical challenges of the future given the explosion of data as well.

  • Making Sense of Hadoop – Its Fit With Data Warehouses – Colin White
    (President and Founder of BI Research)

Colin brought a nice perspective on the open source Hadoop vis a vis the Properietary packages and the traditional DBMS. His perspective on the solution is no software is perfect for all needs while all softwares that sell have their own good points while the converging solution could be a heterogeneous solution of the above.

  • MapReduce Inside a Database System – When and How Case Studies from ShareThis, Specific Media, and Other – Tasso Argyros (Chief Technology Officer and Co-Founder of Aster Data)

This was a more detailed look at the Big Product Launch ( the Hadoop Connector) by Tasso and an interesting look at time series analysis using nPath rather than SQL . Interesting given the ongoing convergence analytics and business intelligence.

Also Tasso lived up to his presenting charm with an excellent pitch on nPath (as his interview said ).

  • Large-Scale Analytics at LinkedIn – Jonathan Goldman
    (Former Principal Scientist at LinkedIn)

This was nice given Jonathan’s perscpective ( he has Phd In Physics) and now does consulting for LinkedIn while maintaining his interests in education- the special needs for social media websites, designing experiments on the fly with huge real time datasets as well as some interesting visualizations (like India and America have the second biggest cross country Li connections after USA- UK. Apparently Linkedin ( http://www.facebook.com/group.php?gid=2211231478 ) does not sound so good when translated in Chinese ( AT Dinner I learnt from a fellow Chinese student that China censors Facebook – sigh!).

  • Networking Mixer: Beer, wine, hot hors d’oeuvres

I got interviewed ( AFTER) I had mixed some Beer and Wine for myself. The Video interview which was the first video interview I have given ( You know- I have taken SOME interviews by Email and plan to do some more while in Vegas for the Data Mining 2009  with SAS http://www.facebook.com/group.php?gid=2227381262)

They are still editing that interview 😉

—That was all – you need to send me a Facebook invite to see the rest of the NY trip or better still just join the Facebook page of Decision Stats at

http://www.facebook.com/pages/DecisionStats/191421035186

After two weeks I hope to have some more coverage on Data Mining 2009 while at the same time enjoying my much needed Fall Break-  Life at University at Tennessee is looking up ( since we beat Georgia 45-19 🙂 )

r*xE5HeUJa(%

Interview Shawn Kung Sr Director Aster Data

Here is an interview with Shawn Kung, Senior Director of Product Management at Aster Data. Shawn explains the difference between the various database technologies, Aster’s rising appeal to its unique technological approach and touches upon topics of various other interests as well to people in the BI and technology space.

image001

Ajay -Describe your career journey from a high school student of science till today .Do you think science is a more lucrative career?

Shawn: My career journey has spanned over a decade in several Silicon Valley technology companies.  In both high school and my college studies at Princeton, I had a fervent interest in math and quantitative economics.  Silicon Valley drew me to companies like upstart procurement software maker Ariba and database giant Oracle.  I continued my studies by returning to get a Master’s in Management Science at Stanford before going on to lead core storage systems for nearly 5 years at NetApp and subsequently Aster.

Science (whether it is math, physics, economics, or the hard engineering sciences) provides a solid foundation.  It teaches you to think and test your assumptions – those are valuable skills that can lead to a both a financially lucrative and personally inspiring career.

Ajay- How would you describe the difference between Map Reduce and Hadoop and Oracle and SAS, DBMS and Teradata and Aster Data products to a class of undergraduate engineers ?

Shawn: Let’s start with the database guys – Oracle and Teradata.  They focus on structured data – data that has a logical schema and is manipulated via a standards-based structured query language (SQL).  Oracle tries to be everything to everyone – it does OLTP (low-latency transactions like credit card or stock trade execution apps) and some data warehousing (typically summary reporting).  Oracle’s data warehouse is not known for large-scale data warehousing and is more often used for back-office reporting.

Teradata is focused on data warehousing and scales very well, but is extremely expensive – it runs on high-end custom hardware and takes a mainframe approach to data processing.  This approach makes less sense as commodity hardware becomes more compute-rich and better software comes along to support large-scale MPP data warehousing.

SAS is very different – it’s not a relational database. It really offers an application platform for data analysis, specifically data mining.  Unlike Oracle and Teradata which is used by SQL developers and managed by DBAs, SAS is typically run in business units by data analysts – for example a quantitative marketing analyst, a statistician/mathematician, or a savvy engineer with a data mining/math background.  SAS is used to try to find patterns, understand behaviors, and offer predictive analytics that enable businesses to identify trends and make smarter decisions than their competitors.

Hadoop offers an open-source framework for large-scale data processing.  MapReduce is a component of Hadoop, which also contains multiple other modules including a distributed filesystem (HDFS).  MapReduce offers a programming paradigm for distributed computing (a parallel data flow processing framework).

Both Hadoop and MapReduce are catered toward the application developer or programmer.  It’s not catered for enterprise data centers or IT.  If you have a finite project in a line of business and want to get it done, Hadoop offers a low-cost way to do this.  For example, if you want to do large-scale data munging like aggregations, transformations, manipulations of unstructured data – Hadoop offers a solution for this without compromising on the performance of your main data warehouse.  Once the data munging is finished, the post-processed data set can be loaded into a database for interactive analysis or analytics. It is a great combination of big data technologies for certain use-cases.

Aster takes a very unique approach.  Our Aster nCluster software offers the best of all worlds – we offer the potential for deep analytics of SAS, the low-cost scalability and parallel processing of Hadoop/MapReduce, and the structured data advantages (schema, SQL, ACID compliance and transactional integrity, indexes, etc) of a relational database like Teradata and Oracle.  Often, we find complementary approaches and therefore view SAS and Hadoop/MapReduce as synergistic to a complete solution.  Data warehouses like Teradata and Oracle tend to be more competitive.

Ajay- What exciting products have you launched so far and what makes them unique both from a technical developer perspective and a business owner perspective

Shawn: Aster was the first-to-market to offer In-Database MapReduce, which provides the standards and familiarity of SQL and databases with the analytic power of MapReduce.  This is very unique as it offers technical developers and application programmers to write embedded procedural algorithms once, upload it, and allow business analysts or IT folks (SQL developers, DBAs, etc) to invoke these SQL-MapReduce functions forever.

It is highly polymorphic (re-usable), highly fault-tolerant, highly flexible (any language – Java, Python, Ruby, Perl, R statistical language, C# in the .NET world, etc) and natively massively parallel – all of which differentiate these SQL extensions from traditional dumb user-defined functions (UDFs).

Ajay- “I am happy with my databases and I don’t need too much diversity or experimentation in my systems”, says a CEO to you.

How do you convince him using quantitative numbers and not marketing adjectives?

Shawn: Aster has dozens of production customers including big-names like MySpace, LinkedIn, Akamai, Full Tilt Poker, comScore, and several yet-to-be-named retail and financial service accounts.  We have quantified proof points that show orders of magnitude improvements in scalability, performance, and analytic insights compared to incumbent or competitor solutions.  Our highly referenceable customers would be happy to discuss their positive experiences with the CEO.

But taking a step back, there’s a fundamental concept that this CEO needs to first understand.  The world is changing – data growth is proliferating due to the digitization of so many applications and the emergence of unstructured data and new data types.  Like the book “Competing on Analytics”, the world is shifting to a paradigm where companies that don’t take risks and push the limits on analytics will die like the dinosaurs.

IDC is projecting 10x+ growth in data over the next few years to zetabytes of aggregate data driven by digitization (Internet, digital television, RFID, etc).  The data is there and in order to compete effectively and understand your customers more intimately, you need a large-scale analytics solution like the one Aster nCluster offers.  If you hold off on experimentation and innovation, it will be too late by the time you realize you have a problem at hand.

Ajay- How important is work life balance for you?

Shawn: Very important.  I hang out with my wife most weekends – we do a lot of outdoors activities like hiking and gardening.  In Silicon Valley, it’s all too easy to get caught up in the rush of things.  Taking breaks, especially during the weekend, is important to recharge and re-energize to be as productive as possible.

Ajay- Are you looking for college interns and new hires what makes aster exciting for you so you are pumped up every day to go to work?

Shawn: We’re always looking for smart, innovative, and entrepreneurial new college grads and interns, especially on the technical side.  So if you are a computer science major or recent grad or graduate student, feel free to contact us for opportunities.

What makes Aster exciting is 2 things –

first, the people.  Everyone is very smart and innovative so you learn a tremendous amount, which is personally gratifying and professionally useful long-term.

Second, Aster is changing the world!

Distributed systems computing focused on big data processing and analytics – these are massive game-changers that will fundamentally change the landscape in data warehousing and analytics.  Traditional databases have been a oligopoly for over a generation – they haven’t been challenged and so the 1970’s based technology has stuck around.  The emergence of big data and low-cost commodity hardware has created a unique opportunity to carve out a brand new market…

what gets me pumped every day is I have the ability to contribute to a pioneer that is quickly becoming Silicon Valley’s next great success story!

Biography-

Over the past decade, Shawn has led product management for some of Silicon Valley’s most successful and innovative technology companies.  Most recently, he spent nearly 5 years at Network Appliance leading Core Systems storage product management, where he oversaw the development of high availability software and Storage Systems hardware products that grew in annual revenue from $200M to nearly $800M.  Prior to NetApp, Shawn held senior product management and corporate strategy roles at Oracle Corporation and Ariba Inc.

Shawn holds an M.S. in Management Science and engineering from Stanford University, where he was awarded the Valentine Fellowship (endowed by Don Valentine of Sequoia Capital).  He also received a B.A. with high honors from Princeton University.

About Aster

Aster Data Systems is a proven leader in high-performance database systems for data warehousing and analytics – the first DBMS to tightly integrate SQL with MapReduce – providing deep insights on data analyzed on clusters of low-cost commodity hardware. The AsternCluster database cost-effectively powers frontline analytic applications for companies such as MySpace, aCerno (an Akamai company), and ShareThis.

Running on low-cost off-the-shelf hardware, and providing ‘hands-free’ administration, Aster enables enterprises to meet their data warehousing needs within their budget. Aster is headquartered in San Carlos, California and is backed by Sequoia Capital, JAFCO Ventures, IVP, Cambrian Ventures, and First-Round Capital, as well as industry visionaries including David Cheriton and Ron Conway.

R and SAS- Together again at PAWS

Two of my favorite speakers ( though maybe not favorite to each other) speak at PAWS ,

Anne Milley from SAS and David Smith, REvolution Computing.Also a great author and writer, Stephen Baker from Numerati ( that mathematical equivalent of The Godfather). More events at the link below.

Hmmmm- I hope they attend each other’s sessions just to keep up, but is that asking too much?

Citation-http://www.predictiveanalyticsworld.com/dc/2009/agenda.php#day1-22

7:30pm-10:00pm
useR Meeting
Room: Magnolia
– Sponsored by  Please join the group at www.meetup.com/R-users-DC/

R is an open source programming language for statistical computing, data analysis, and graphical visualization. R has an estimated one million users worldwide, and its user base is growing. While most commonly used within academia, in fields such as computational biology and applied statistics, it is gaining currency in commercial areas such as quantitative finance and business intelligence.

Among R’s strengths as a language are its powerful built-in tools for inferential statistics, its compact modeling syntax, its data visualization capabilities, and its ease of connectivity with persistent data stores (from databases to flatfiles).

In addition, R is open source nature and extensible via add-on “packages” allowing it to keep up with the leading edge in academic research.

For all its strengths, though, R has an admittedly steep learning curve; the first steps towards learning and using R can be challenging.

This DC R Users Group is dedicated to bringing together area practitioners of R to exchange knowledge, inspire new users, and spur the adoption of R for innovative research and commercial applications.


Wednesday October 21, 2009

8:00am-9:00am
Registration & Continental Breakfast


9:00am-9:50am
Keynote
Room: Magnolia
Opportunities and Pitfalls:
What the World Does and Doesn’t Want from Predictive Analytics

Mathematicians and statisticians are churning through mountains of data in their efforts to model and predict human behavior. The goal is to optimize every function possible, from sales and marketing to the enterprise itself. These Numerati are guided by the two dominant models of the late 20th century, the modeling of financial markets and of industrial systems. How do humans fit into these systems? And what will their response be when the analytic systems appear to misunderstand them or invade their privacy?

Stephen Baker joins PAW to directly address the Numerati. In his keynote presentation, Mr. Baker will guide us toward the untapped goldmines where predictive analytics will be embraced and thrive, and teach us to anticipate and maneuver around two central pitfalls: Consumer misperception of us, and our inadvertent mistreatment of them.

Moderator: Eric Siegel, Program Chair, Predictive Analytics World

Speaker: Stephen Baker, BusinessWeek – author, The Numerati


9:50am-10:10am
Platinum Sponsor Presentation
Room: Magnolia
Strength in Numbers: ACE!

As more organizations are beginning their analytical journey or reinvigorating their existing efforts, Analytic Centers of Excellence (ACEs) are helping them along the way. The interest in ACEs is growing across industries as organizations seek better ways to tap into their analytic infrastructure-most importantly, scarce high-end analytic expertise to improve results. We will highlight valuable best practices for achieving greater analytic bandwidth realizing more and better evidence-based decisions.

Moderator: Eric Siegel, Program Chair, Predictive Analytics World

Speaker: Anne Milley, Senior Director of Tech. Product Marketing, SAS

Interview Jeff Bass, Bass Institute (Part 2)

During the 1980’s and early 1990’s, the Bass Institute managed to attract a loyal following with it’s SAS language compiler, ultimately bowing to the financial pressures and technological pressures of the move to the Desktop. In the year 2009, as SAS language gains a new compiler in terms of the WPS, AND computing paradigms begin to shift to cloud computing from the desktop- Jeff Bass, founder of Bass Institute and genius tech coder brings a perspective rich in experience.

If we don’t learn from history, we are condemned to repeat it.

Ajay- Describe your career in science. How would you motivate children in class rooms today to be as excited about science as the moon generation was?

J Bass- My graduate training was in economics and statistics.  I have used that training in ways that I would never have anticipated when I was in graduate school 30 years ago.  But it is still exciting for me.  I started out building microeconomic models, then went on to write statistical language compliers and build health policy macroeconomic models.  These days I develop and articulate health policy to help increase patient’s access to cutting edge medicines.  The company I work for now is very science based and even applies scientific thinking, measurement and testing of alternatives in the business side of its operations.

I spend volunteer time as a guest teacher at local middle schools, high schools and community colleges.  I often talk about math and statistics and have found that one way to help motivate students is to give them “fun” example problems.  I often use an example of the 1969 lunar orbital calculations to motivate basic trigonometry and quite a number of students who say they don’t like math end up loving solving parts of that problem.  I think our school curriculums need to come up with problems and examples that the students find interesting.  I’m not sure our existing curricula processes make this an easy thing to do.  All too often we teach techniques without combining that teaching with strong motivating examples that make learning fun.

Ajay-  What are the changes in paradigms that you have seen across the decades? What are the key insights and summaries that you can provide.

J Bass- Our increasing understanding of biology and DNA is a major paradigm shift that is combining molecular biology and protein chemistry with computer science.  Identifying the human DNA sequence was only the beginning.  Imagine that you were handed the bit sequence of a CD-ROM and were told to figure out what parts of it were a text document, what parts were a JPEG photograph and what parts were an MP3 music file – if you did NOT know the coding schemes of such files.  That’s analogous to where we are today with DNA sequences…we know the ATCG sequence, but we are only scratching the surface of understanding the things that the DNA sequence codes for – proteins, cell metabolism, differentiating cell reproduction. Continue reading “Interview Jeff Bass, Bass Institute (Part 2)”

Jump to JMP: Using Data Analysis in a visual manner

Over the past month or so, I have really begun to appreciate the GUI of JMP. It is very clean and intutively designed. And excellent for a SAS Environment .

Best of all you can easily download a 30 day trial and pricing for this software is quite reasonable.

The worst part of JMP- the droll website. In fact on website, I can deduce something of an Ohri’s Law on Websites.

The better the software, the worse off is the website.

Corollary- The worse off the software, the better is the website in terms of glitz.

JMP is definitely worth a trial for 30 days if you

a) Want to learn a new stats software skill fast

2) Unhappy with visual data analysis of current softwares.

Integrating JMP ‘s functionality with a BI reporting tool is a formidable data decisionmaking tool and it works nicely for me in data analysis I do.

http://www.sas.com/apps/demosdownloads/jmptrial8_PROD__sysdep.jsp?packageID=000503&jmpflag=Y

jmp

SAS Program for Students

Here is a good program for students who learn or are learning SAS software products to take part in a big, fully paid up conference, the SAS Global Users Conference next year. The last date for submitting papers is October 26, so there is plenty of time if you know someone who would be better off going there.

Citation-
http://support.sas.com/learn/ap/student/amb.html

Select students will be named SAS Student Ambassadors and earn the opportunity to present their research at the 2010 SAS Global Forum in Seattle, Washington on April 11-14.

What are benefits of being a SAS Student Ambassador?
Have your travel expenses and registration fees paid to attend SAS Global Forum 2010.
Be featured and prominently recognized before an international audience by presenting at SAS Global Forum 2010.
Interact with a global audience of SAS users from every industry and sector.
Gain the unique title of SAS Student Ambassador – a true resume or CV differentiator!

Am I eligible for the SAS Student Ambassador Program?
Graduate and undergraduate students are eligible to apply to this program. Recent graduates also may be eligible to apply. The submitted project must have been conducted by the student within 12 months of the submission deadline.