Hive Tutorial: Cloud Computing

Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)

Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future  (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).

Citation-

http://wiki.apache.org/hadoop/Hive

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.


Reactions to IBM -SPSS takeover.

The business intelligence -business analytics- data mining industry ( or as James Taylor would say Decision Management Industry) have some reactions on IBM – SPSS ( which was NOT a surprise to many including me). Really.

From SAS Institute, Anne Milley

http://blogs.sas.com/sascom/index.php?/archives/557-Analytics-is-still-our-middle-name.html

Besides SAS, SPSS was one of the last independent analytic software companies. A colleague says, “It’s the end of the analytics cold war.”

I’ve been saying all along that analytics is required for success. Yes, data integration, data quality, and query & reporting are important too but, as W. Edwards Deming says, “The object of taking data is to provide a basis for action.”

The end of the analytics cold war- hmm. We all know what the end of real cold war brought us- Google, Cloud Computing, and other non technical issues.

From KXEN, Roger Hadaad

“The price paid for SPSS of four times revenues and 25 times earnings shows just how valuable this sector really is,” says Haddad. “But the deal has also created a tremendous opportunity for the sector’s remaining independent vendors that

KXEN is well placed to capitalize on. “There is no For Sale sign hanging in our window,” continues Haddad. “We launched KXEN in 1998 to democratize the benefits of data mining and predictive analytics, making them practical and affordable across the whole enterprise and not just the exclusive preserve of a few specialists. It’s going to take up to two years for the dust to settle following the IBM

“Former SPSS partners, systems integrators and distributors will face uncertainty.”

I think the PE multiple was still low- SPSS was worth more if you count the client base, active community, brand itself in the valuation. Tremendous cross sell opportunities and IBM with it’s nice research and development is a good supporter of pure science.  Yes, next two years would be facing increasing consolidation and more “surprising” news. At 4 times earnings, anyone can be bought in the present market if it is a public listed company. 😉

From the rather subdued voices on SPSS list, some subjective and non quantitative ‘strategic” forecasts.

http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0907&L=spssx-l&F=&S=&P=36324

I think the Ancient Chinese said it best “May you live in interesting times”.

Having worked with some flavors of Cognos and SPSS, I think there could be areas for technical integration for querying and GUI based forecasting as well, apart from financial mergers and administrative re adjustments. I mean people pull data not just to report it, but to estimate what comes next as well.

This could also spell the end of uni platform skilled analysts. You now need to learn atleast two different platforms like SAS,SPSS or KXEN, R or Cognos, Business Objects to hedge your chances of getting offshored (Note- I worked in offshoring for almost 4 years in India in data analytics).

Answering what IBM will do with SPSS and it’s open source commitment to R and consequences for employees, customers, vendors,partners who have more choices now than ever.

…. well it depends. Who is John Galt?

Interview Karen Lopez Data Modeling Expert

Zachman Framework
Image via Wikipedia

Here is an interview with Karen Lopez who has worked in data modeling for almost three decades and is a renowned data management expert in her field.

Data professionals need to know about the data domain in addition to the data structure domain – Karen Lopez

Ajay- Describe your career in science. How would you persuade younger students to take more science courses.

Karen- I’ve always had an interest in science and I attribute that to the great science teachers I had. I studied information systems at Purdue University though a unique program that focuses on systems analysis and computer technologies. I’m one of the few who studied data and process modeling in an undergraduate program 25+ years ago.

I believe that it is very important that we find a way of attracting more scientists to teach. In both the natural and computer sciences, it’s difficult for institutions to tempt scientists away from professional positions that offer much greater compensation. So I support programs that find ways to make that happen.

Ajay- If you had to give advice to a young person starting their career in BI and had to give them advice in just three points – what would they be?

Karen- Wow. It’s tough to think of just three things, but these are recommendations that I make often:

– Remember that every design decision should be made based on cost, benefit, and risk. If you can’t clearly describe these for every side of a decision, then you aren’t doing design; you are guessing.

– No one beside you is responsible for advancing your skills and keeping an eye on emerging practices. Don’t expect your employer to lay out a career plan that is in your best interest. That’s not their job. Data professionals need to know about the data domain in addition to the data structure domain. The best database or data warehouse design in the world is worse than uses useless if the how the data is processed is wrong. Remember to expand your knowledge about data, not just the data structures and tools.

– All real-world work involves collaboration and negotiation. There is no one right answer that works for every situation. Building your skills in these areas will pay off significantly.

Ajay- What do you think is the best way for a technical consultant and client to be on the same page regarding requirements. Which methodology or template have you used, and which has given you the most success.

Karen- While I’m a huge fan of modeling (data modeling and other modeling), I still think that giving clients a prototype or mockup of something that looks real to them goes a long way. We need to build tools and competencies to develop these prototypes quickly. It’s a lost art in the data world.

Ajay- What are the special incentives that make Canada a great place for tech entrepreneurs rather than say go to the United States. ( Note- Disclaimer I have family in Canada and study in the US)

Karen- I prefer not to think of this as an either-or decision. I immigrated to Canada from the US about 15 years ago, but most of our business is outside of Canada. I have enjoyed special incentives here in Canada for small businesses as well as special programs that allowed me to work in Canada as a technical professional before I moved here permanently.

Overall, I have found Canadian employers more open to sponsoring foreign workers and it is easier for them to do so than what my US clients experience. Having said that, a significant portion of my work over the last few years has been on global projects where we leverage online collaboration tools to meet our goals. The advent of these tools has made it much easier to work from wherever I am and to work with others regardless of their visa statuses.

Where a company forms is less tied to where one lives or works these days.

Ajay- Could you tell us more about the Zachman framework (apart from the wikipedia reference)? A practical example on how you used it on an actual project would be great.

Karen- Of course the best resource for finding out about the Zachman framework is from John Zachman himself http://www.zachmaninternational.com/index.php/home-article/13 . He offers some excellent courses and does a great deal of public speaking at government and DAMA events. I highly recommend anyone interested in the Framework to hear about it directly from him.

There are many misunderstandings about John’s intent, such as the myth that he requires big upfront modeling (he doesn’t), that the Framework is a methodology (it isn’t), or that it can only be used to build computer systems (it can be used for more than that).

I have used the Zachman Framework to develop a joint Business-IT Strategic Information Systems Plan as well as to inventory and track progress of multi-project programs. One interesting use was a paper I authored for the Canadian Information Processing Society (CIPS) on how various educational programs, specializations, and certifications map to the Zachman Framework. I later developed a presentation about this mapping for a Zachman conference.

For a specific project, the Zachman Framework allows business to understand where their enterprise assets are being managed – and how well they are managed. It’s not an IT thing; it’s an enterprise architecture thing.

Ajay- What does Karen Lopez do for fun when not at work, traveling, speaking or blogging.

Karen- Sometimes it seems that’s all I do. I enjoy volunteering for IT-related organizations such as DAMA and CIPS. I participate in the accreditation of college and university educational programs in Canada and abroad. As a member of data-related standards bodies, namely the Association for Retail Technology Standards and the American Dental Association, I help develop industry standard data models. I’ve also been a spokesperson for a CIPS program to encourage girls to take more math and science courses throughout their student careers so that they may have access to great opportunities in the future.

I like to think of myself as a runner; last year I completed my first half marathon, which I’d never thought was possible. I am studying Hindi and Sanskrit. I’m also a addicted to reading and am thankful that some of it I actually get paid to do.

Biography

Karen López is a Senior Project Manager at InfoAdvisors, Inc. Karen is a frequent speaker at DAMA conferences and DAMA Chapters. She has 20+ years of experience in project and data management on large, multi-project programs. Karen specializes in the practical application of data management principles. Karen is also the ListMistress and moderator of the InfoAdvisors Discussion Groups at http://www.infoadvisors.com. You can reach her at www.twitter.com/datachick

Interview John Sall Founder JMP/SAS Institute

Here is an interview with John Sall, inventor of SAS and JMP and co-founder and co-owner of SAS Institute, the largest independent business intelligence and analytics software firm. In a free wheeling and exclusive interview, John talks of the long journey within SAS and his experiences in helping make JMP the data visualization software of choice.
JMP is perfect for anyone who wants to do exploratory data analysis and modeling in a visual and interactive way – John Sall

untitled2

Ajay- Describe your early science career. How would you encourage today’s generation to take up science and math careers?

John- I was a history major in college, but I graduated into a weak job market. So I went to graduate school and discovered statistics and computer science to be very captivating. Of course, I grew up in the moon-race science generation and was always a science enthusiast.

Ajay- Archimedes leapt out the bath shouting “Eureka” when he discovered his principle. Could you describe a “Eureka” moment while creating the SAS language when you and Jim Goodnight were working on it?

John- I think that the moments of discovery were more like “Oh, we were idiots” as we kept having to rewrite much of the product to handle emerging environments, like CMS, minicomputers, bitmap workstations, personal computers, Windows, client-server, and now the cloud. Several of the rewrites were even changing the language we implemented it in. But making the commitment to evolve led to an amazing sequence of growth that is still going on after 35 years.

Ajay- Describe the origins of JMP. What specific market segments does the latest release of JMP target?

John- JMP emerged from a recognition of two things: size and GUI. SAS’ enterprise footprint was too big a commitment for some potential users, and we needed a product to really take advantage of graphical interactivity. It was a little later that JMP started being dedicated more to the needs of engineering and science users, who are most of our current customers.

Ajay- What other non-SAS Institute software do you admire or have you worked with? Which areas is JMP best suited for? For which areas would you recommend software other than JMP to customers?

John- My favorite software was the Metrowerks CodeWarrior development environment. Sadly, it was abandoned among various Macintosh transitions, and now we are stuck with the open-source GCC and Xcode. It’s free, but it’s not as good.

JMP is perfect for anyone who wants to do exploratory data analysis and modeling in a visual and interactive way. This is something organizations of all kinds want to do. For analytics beyond what JMP can do, I recommend SAS, which has unparalleled breadth, depth and power in its analytic methods.

Ajay- I have yet to hear of a big academic push for JMP distribution in Asia. Are there any plans to distribute JMP for free or at very discounted prices in academic institutions in countries like India, China or even the rest of the USA?

John- We are increasing our investment in supporting academic institutions, but it has not been an area of strength for us. Professors seem to want the package they learned long ago, the language that is free or the spreadsheet program their business students already have. JMP’s customers do tell us that they wish the universities would train their prospective future employees in JMP, but the universities haven’t been hearing them. Fortunately, JMP is easy enough to pick up after you enter the work world. JMP does substantially discount prices for academic users.

Ajay- What are your views on tech offshoring, given the recession in the United States?

John- As you know, our products are mostly made in the USA, but we do have growing R&D operations in Pune and Beijing that have been performing very well. Even when the software is authored in the US, considerable work happens in each country to localize, customize and support our local users, and this will only increase as we become more service-oriented. In this recession, JMP has still been growing steadily.

Ajay-  What advice would you give to young graduates in this recession? How does learning JMP enhance their prospect of getting a job?

John- Quantitative fields have been fairly resistant to the recession. North Carolina State University, near the SAS campus, even has a Master of Science in Analytics < http://analytics.ncsu.edu/ > to get people job-ready. JMP experience certainly helps get jobs at our major customers.

Ajay- What does John Sall do in his free time, when not creating world-class companies or groovy statistical discovery software?

John- I lead the JMP division, which has been a fairly small part of a large software company (SAS), but JMP is becoming bigger than the whole company was when JMP was started. In my spare time, I go to meetings and travel with the Nature Conservancy <http://www.nature.org/ >, North Carolina State University <http:// http://ncsu.edu/ >, WWF <http://wwf.org/ >, CARE <http://www.care.org/ > and several other nonprofit organizations that my wife or I work with.

Official Biography

John Sall is a co-founder and Executive Vice President of SAS, the world’s largest privately held software company. He also leads the JMP business division, which creates interactive and highly visual data analysis software for the desktop.

Sall joined Jim Goodnight and two others in 1976 to establish SAS. He designed, developed and documented many of the earliest analytical procedures for Base SAS® software and was the initial author of SAS/ETS® software and SAS/IML®. He also led the R&D effort that produced SAS/OR®, SAS/QC® and Version 6 of Base SAS.

Sall was elected a Fellow of the American Statistical Association in 1998 and has held several positions in the association’s Statistical Computing section. He serves on the board of The Nature Conservancy, reflecting his strong interest in international conservation and environmental issues. He also is a member of the North Carolina State University (NCSU) Board of Trustees. In 1997, Sall and his wife, Ginger, contributed to the founding of Cary Academy, an independent college preparatory day school for students in grades 6 through 12.

Sall received a bachelor’s degree in history from Beloit College in Beloit, WI, and a master’s degree in economics from Northern Illinois University in DeKalb, IL. He studied graduate-level statistics at NCSU, which awarded him an honorary doctorate in 2003.

About JMP-

Originally nicknamed as John’s Macintosh Program, JMP is a leading software program in data visualization for statistical software. Researchers and engineers – whose jobs didn’t revolve solely around statistical analysis – needed an easy-to-use and affordable stats program. A new software product, today known as JMP®, was launched in 1989 to dynamically link statistical analysis with the graphical capabilities of Macintosh computers. Now running on all platforms, JMP continues to play an important role in modeling processes across industries as a desktop data visualization tool. It also provides a visual interface to SAS in an expanding line of solutions that includes SAS Visual BI and SAS Visual Data Discovery. Sall remains the lead architect for JMP.

Citation- http://www.sas.com/presscenter/bios/jsall.html

Ajay- I am thankful to John and his marketing communication specialist Arati for this interview.With an increasing focus on data to drive more rational decision making, SAS remains an interesting company to watch for in the era of mega- vendors and any SAS Institute deal and alliance will be  making potential investment bankers as well as newer customers drool. For previous interviews and coverage of SAS please use www.decisionstats.com/tag/sas

SPSS bought by Big Blue

SPSS Inc maker of PASW series of analytics softwares is being bought by IBM ( unless Oracle spikes this deal too). IBM is seeking a play in the rapidly growing analytics market and is also a strategic partner to WPS ( who makes the Base SAS alternative SAS language software).

In a personal note- I just entered University of Tennessee as a statistics student.

Interesting community event by R/Statistical community

Citation-
http://en.oreilly.com/oscon2009/public/schedule/detail/10432

StackOverflow Flash Mob for the R User Community
Moderated by: Michael E. Driscoll
7:00pm Wednesday, 07/22/2009
Location: Ballroom A2

In concert with users online across the country, this session will lead a flashmob to populate StackOverflow with R language content.

R, the open source statistical language, has a notoriously steep learning curve. The same technical questions tend be asked repeatedly on the R-help mailing lists, to the detriment of both R experts (who tire of repeating themselves) and the learners (who often receive a technically correct, but terse response).

We have developed a list of the most common 100 technical R questions, based on an analysis of (i) queries sent to the RSeek.org web portal, and (ii) an examination of the R-help list archives, and (iii) a survey of members of R Users Groups in San Francisco, LA, and New York City.

In the first hour, participants will pair up to claim a question, formulate it on StackOverflow, and provide a comprehensive answer. In the second hour, participants will rate, review, and comment on the set of submitted questions and answers.

While Stackoverflow currently lacks content for the R language, we believe this effort will provide the spark to attract more R users, and emerge as a valuable resource to the growing R community.

This is an interesting example of a statistical software community using twitter for a tech help event. I hope this trend/ event gets replicated again and again-

Statisticians worldwide unite in the language of maths !!!

Please follow @rstatsmob to participate. See you at 7 PM PST!

twitter.com/Rstatsmob

Growing Rapidly: Rapid Miner 4.5

The Europe based Rapid Miner came out with version 4.5 of their data mining tool ( also known as Yale) with a much promising “Script” tool.

Also, Rapid Miner came in 1st in open source data mining tools in a poll by Industry benchmark www.kdnuggets.com

They have a brilliant video here for people who just want to have a look at the new Rapid Miner

http://rapid-i.com/videos/rapidminer_tour_3_4_en.html

Citation-

http://rapid-i.com/content/view/147/1/

New Operators:

  • FormulaExtractor
  • Trend
  • LagSeries
  • VectorLinearRegression
  • ExampleSetMinus
  • ExampleSetIntersect
  • Partition
  • Script
  • ForwardSelection
  • NeuralNetImproved
  • KernelNaiveBayes
  • ExhaustiveSubgroupDiscovery
  • URLExampleSource
  • NonDominatedSorting
Image

More Features:

  • The new Script operator allows for arbitrary user defined operations based on Groovy script combined with a simplified RapidMiner syntax
  • Improved the join operator and added options for left and right outer joins
  • New notification mail mechanism at the end of processes
  • Most file based data input operators now provide an option to skip error lines
  • Most file based example source operators as well as the IOObjectReader and the new URLExampleSource now accept URLs instead of a filename for the input source location