Webfocus RStat: Pervasive BI using R

Here is a great reporting and BI tool from Information Builders  and uses the Rattle R GUI ( covered earlier here http://www.decisionstats.com/2009/01/13/interview-dr-graham-williams/).

So if you are looking for generation next reporting solution here is one called WebFocus RStat.

Citation:

http://www.informationbuilders.com/products/webfocus/predictivemodeling.html

Predict the Future and Make Effective Decisions Today

Traditional reporting solutions provide a clear picture of past occurrences, but have little power to shed light on the future. The ability to anticipate and prepare for upcoming events can greatly impact the decisions that need to be made today.

WebFOCUS RStat is the market’s first fully-integrated business intelligence and data mining environment, seamlessly bridging the gap between backward and forward-facing views of business operations. With WebFOCUS RStat, companies can easily and cost-effectively deploy predictive models as intuitive scoring applications. So business users at all levels can make decisions based on accurate, validated future predictions, instead of relying on gut instinct alone.

WebFOCUS RStat provides a single platform for BI, data modeling, and scoring. This eliminates the need to purchase and maintain multiple tools, and frees analysts and other statisticians from spending countless hours extracting and querying data. At the same time, it reduces costs, simplifies maintenance, and optimizes IT resources.

But, the greatest benefit WebFOCUS RStat offers is significantly increased accuracy. With the R engine – a powerful and flexible open source statistical programming language – as its underlying analysis tool, WebFOCUS RStat can deliver results that are consistent, complete, and correct – every time.

WebFOCUS RStat provides:

  • A single tool, fully integrated with Developer Studio and WebFOCUS Reporting Servers with access to over 300 data sources, for both BI developers and data miners
  • Comprehensive data exploration, descriptive statistics, and interactive graphs
  • In-depth data visualization and transformation
  • Hypothesis testing, clustering, and correlation analysis

Other key WebFOCUS RStat features include:

  • The ability to build and export models for prediction and classification
  • Comprehensive model evaluation

Incidently the parent company which is based in Tennessee has some interesting numbers-

http://www.informationbuilders.com/about_us/index.html

Company At A Glance
  • $300 million in revenue
  • Over 30 years of experience
  • More than 1,400 employees
  • Over 12,000 customers
  • Over 350 business partners
  • 47 offices and 26 worldwide distributors
  • Rapid application creation through easy incorporation of scoring routines into WebFOCUS reports

See Also-

http://www.informationbuilders.com/cgi-shell/press/intpr/f_intpr.pl?intpr_code=06_03_08_rstat

http://rattle.togaware.com/

Poem: The Fine Print

did you read the fine print
when you signed your life away
or did you believe them badly
when they said your life was good to give today

did all the drums, the ribbons and the music
tilt your head to emotion away from fact
and did the inherent absurdity of it all
was swallowed by you intact

for as the world spins tilted
around the bright unforgiving sun
words in a language built to deceive
mask the coming pain below the frosting of fun

deception is the game here
and an unwilling player you have to be
fool them or be fooled in turn
reality is spotless for you to see

what old promises where tokens of love
it is all cash and carry now
as willed in your destiny from above

and even though eyes grow misty
by potential of what could be
you keep one eye on the rolling ball
lest more surprises it brings to see

Interview Gregory Piatetsky KDNuggets.com

Here is an interviw with Gregory Piatetsky, founder and editor of KDNuggets (www.KDnuggets.com ) ,the oldest and biggest independent industry websites in terms of data mining and analytics-

gps6

Ajay- Please describe your career in science, many challenges and rewards that came with it. Name any scientific research, degrees teaching etc.


Gregory-
I was born in Moscow, Russia and went to a top math high-school in Moscow. A unique  challenge for me was that my father was one of leading mathematicians in Soviet Union.  While I liked math (and still do), I quickly realized while still in high school that  I will never be as good as my father, and math career was not for me.

Fortunately, I discovered computers and really liked the process of programming and solving applied problems.  At that time (late 1970s) computers were not very popular and it was not clear that one can make a career in computers.  However I was very lucky that I was able to pursue what I liked and find demand for my skills.

I got my MS in 1979 and PhD in 1984 in Computer Science from New York University.
I was interested in AI (perhaps thanks to a lot of science fiction I read as a kid), but found a job in databases, so I was looking for ways to combine them.

In 1984 I joined GTE Labs where I worked on research in databases and AI, and in 1989 started the first project on Knowledge Discovery in data. To help convince my management that there will be a demand for this thing
called “data mining” (GTE management did not see much future for it), I also organized a AAAI workshop on the topic.

I thought “data mining” is not sexy enough name, and so I called it “Knowledge Discovery in Data”, or KDD.  Since 1989, I was working on KDD and data mining in all aspects – more on my page www.kdnuggets.com/gps.html

Ajay-  How would you encourage a young science entrepreneur in this recession.

Gregory- Many great companies were started or grew in a recession, e.g.
http://www.insidecrm.com/features/businesses-started-slump-111108/

Recession may be compared to a brush fire which removes dead wood and allows new trees to grow.

Ajay- What prompted you to set up KD Nuggets? Any reasons for the name (kNowledge Discovery Nuggets). Describe some key milestones in this iconic website for data mining people.

Gregory- After a third KDD workshop in 1993 I started a newsletter to connect about 50 people who attended the workshop and possibly others who were interested in data mining and KDD.  The idea was that it will have short items or “nuggets” of information. Also, at that time a popular metaphor for data miner was gold miners who were looking for gold “nuggets”.  So, I wanted a newsletter with “nuggets” – short, valuable items about Knowledge Discovery.  Thus, the name KDnuggets.

In 1994 I created a website on data mining at GTE and in 1997, after I left  GTE , I moved it to the current domain name www.kdnuggets.com .

In 1999, I was working for startup which provided data mining services to financial industry.  However, because of Y2K issues, all banks etc froze their systems in the second half of 1999, and we had very little work (and our salaries were reduced as well).  I decided that I will try to get some ads and was able to get companies like SPSS and Megaputer to advertise.

Since 2001, I am an independent consultant and KDnuggets is only part of what I am doing.  I also do data mining consulting, and actively participate in SIGKDD (Director 1998-2005, Chair 2005-2009).

Some people think that KDnuggets is a large company, with publisher, webmaster, editor, ad salesperson, billing dept, etc.  KDnuggets indeed has all this functions, but it is all me and my two cats.

Ajay- I am impressed by the fact KD nuggets is almost a dictionary or encyclopedia for data mining. But apart from advertising you have not been totally commercial- many features of your newsletter remain ad free – you still maintain a minimalistic look and do not take sponsership aligned with one big vendor. What is your vision for KD Nuggets for the years to come to keep it truly independent.

Gregory- My vision for KDnuggets is to be a comprehensive resource for data mining community, and I really enjoyed maintaining such resource for the first 7-8 years completely non-commercially. However, when I became self -employed, I could not do KDnuggets without any income, so I selectively introduced ads, and only those which are relevant to data mining.

I like to think of KDnuggets as a Craiglist for data mining community.

I certainly realize the importance of social media and Web 2.0 (and interested people can follow my tweets at tweeter.com/kdnuggets)  and plan to add more social features to KDnuggets.

Still, just like Wikipedia and Facebook do not make New York Times obsolete, I think there is room and need for an edited website, especially for such a nerdy and not very social group like data miners.

Ajay- What is the worst mistake/error in writing publishing that you did. What is the biggest triumph or high moment in the Nuggets history.

Gregory- My biggest mistake is probably in choosing the name kdnuggets – in retrospect,  I could have used a shorter and easier to spell domain name, but in 1997 I never expected that I will still be publishing www.KDnuggets.com 12 years later.

Ajay- Who are your favourite data mining students ( having known so many people). What qualities do you think set a data mining person apart from other sceinces.

Gregory- I was only an adjunct professor for a short time, so I did not really have data mining students, but I was privileged enough to know many current data mining leaders when they were students.  Among more recent students, I am very impressed with Jure Leskovec, who just finished his PhD and got the best KDD dissertation award.

Ajay- What does Gregory Piatetsky do for fun when he is not informing the world on analytics and knowledge discovery.

Gregory- I enjoy travelling with my family, and in the summer I like biking and windsurfing.
I also read a lot, and currently in the middle of reading Proust (which I periodically dilute by other, lighter books).

Ajay- What is your favourite reading blog and website ? Any India plans to visit.
Gregory
– I visit many blogs on www.kdnuggets.com/websites/blogs.html

and I like especially
– Matthew Hurst blog: Data Mining: Text Mining, Visualization, and Social Media
– Occam’s Razor by Avinash Kaushik, examining web analytics.
– Juice Analytics, blogging about analytics and visualization
– Geeking with Greg, exploring the future of personalized information.

I also like your website decisionstats.com and plan to visit it more frequently

I visited many countries, but not yet India – waiting for the right occasion !

Biography

(http://www.kdnuggets.com/gps.html)

Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides research and consulting services in the areas of data mining, web mining, and business analytics. Gregory is considered to be one of the founders of the data mining and knowledge discovery field.Gregory edited or co-edited many collections on data mining and knowledge discovery, including two best-selling books: Knowledge Discovery in Databases (AAAI/MIT Press, 1991) and Advances in Knowledge Discovery in Databases (AAAI/MIT Press, 1996), and has over 60 publications in the areas of data mining, artificial intelligence and database research.

Gregory is the founder of Knowledge Discovery in Database (KDD) conference series. He organized and chaired the first three Knowledge Discovery in Databases (KDD) workshops in 1989, 1991, and 1993. He then served as the Chair of KDD Steering committee and guided the conversion of KDD workshops into leading international conferences on data mining. He also was the General Chair of the KDD-98 conference.

Interview Tasso Argyros CTO Aster Data Systems

Here is an interview with Tasso Argyros,the CTO and co-founder of Aster Data Systems (www.asterdata.com ) .Aster Data Systems is one of the first DBMS to tightly integrate SQL with MapReduce.

tassos_argyros

Ajay- Maths and Science students the world over are facing a major decline. What would you recommend to young students to get careers in science.

[TA]My father is a professor of Mathematics and I spent a lot of my college time studying advanced math. What I would say to new students is that Math is not a way to get  a job, it’s a way to learn how to think. As such, a Math education can lead to success in any discipline that requires intellectual abilities. As long as they take the time to specialize at some point – via  postgraduate education or a job where they can learn a new discipline from smart people – they won’t regret the investment.

Ajay- Describe your career in Science particularly your time at Stanford. What made you think of starting up Asterdata. How important is it for a team rather than an individual to begin startups. Could you describe the startup moment when your team came together.

[TA] – While at Stanford I became very familiar with the world of startups through my advisor, David Cheriton (who was an angel investor in VMWare, Google and founder of two successful companies). My research was about processing large amounts of data on large, low-cost computer farms. A year into my research it became obvious that this approach had huge processingpower advantages and it was superior to anything else I could see in the marketplace. I then happened to meet my other two co-founders, Mayank Bawa & George Candea who were looking at a similar technical problem from the database and reliability perspective, respectively.

I distinctly remember George walking into my office one day (I barely knew him back then) and saying “I want talk to you about startups and the future” – the rest has become history.

Ajay- How would you describe your product Aster nCluster Cloud Edition to omebody who does not anything beyond the Traditional Server/ Datawarehouse technologies. Could you rate it against some known vendors and give a price point specific to what level of usage does the Total Cost of Ownership in Asterdata becomes cheaper than a say Oracle or a SAP or a Microsoft Datawarehosuing solution.

[TA]- Aster allows businesses  to reduce the data analytics TCO in two interesting ways. First, it has a much lower hardware cost than any traditional DW technology because of its use of commodity servers or cloud infrastructure like Amazon EC2. Secondly, Aster has implemented a lot of  innovations that simplify the (previously tedious and expensive) management of the system, which includes scaling the system elastically up/down as needed – so they are not paying for capacity they don’t need at a given point in time.

But cutting costs is one side of the equation; what makes me even more excited is the ability to make a business more profitable, competitive and efficient through analyzing more data at greaterdepth. We have customers that have cut their costs and increased their customers and revenue by using Aster to analyze their valuable (and usually underutilized) data. If you have data – and you think you’re not taking full advantage of it – Aster can help.

Ajay- I have always have this one favourite question.When can I analyze 100 giga bytes of data using just a browser and some statistical software like R or advanced forecasting softwares that are available.Describe some of Asterdata ‘s work in enhancing the analytical capabilities of big data.

Can I run R ( free -open source) on an on demand basis for an Asterdata solution. How much would it cost me to crunch 100 gb of data and make segmentations and models with say 50 hours of processing time per month

[TA]- One of the big innovations that Aster does it to allow analytical applications like R to be embedded in the database via our SQL/MapReduce framework. We actually have customers right now that are using R to do advanced analytics over terabytes of data.  100GB is actually on the lower end of what our software can enable and as such the cost would not be significant.

Ajay- What do people at Asterdata do when not making complex software.

[TA]- A lot of Asterites love to travel around the world – we are, after all, a very diverse company. We also love coffee, Indian food as well as international and US sports like soccer, cricket, cycling,and football!

Ajay- Name some competing products to Asterdata and where Asterdata products are more suitable for a TCO viewpoint. Name specific areas where you would not recommend your own products.

[TA]- We go against products like Orace database, Teradata and IBM DB2. If you need to do analytics over 100s of GBs or terabytes of data, our price/performance ratio would be orders of magnitude better.

Ajay- How do you convince named and experienced VC’s Sequia Capital to invest in a start-up ( eg I could do with some server costs coming financing)

[TA]- You need to convince Sequoia of three things. (a) that the market you’re going after is very large (in the billions of dollars, if you’re successful). (b) that your team is the best set of people that could ever come together to solve the particular problem you’re trying to solve. And (c) that the technology you’ve developed gives you an “unfair advantage” over incumbents or new market entrants.  Most importantly, you have to smile a lot! J

Biography

About Tasso:

Tasso (Tassos) Argyros is the CTO and co-founder of Aster Data Systems, where he is responsible for all product and engineering operations of the company. Tasso was recently recognized as one ofBusinessWeek’s Best Young Tech Entrepreneurs for 2009 and was an SAP fellow at the Stanford Computer Science department. Prior to Aster, Tasso was pursuing a Ph.D. in the Stanford Distributed Systems Group with a focus on designing cluster architectures for fast, parallel data processing using large farms of commodity servers. He holds an MsC in Computer Science from Stanford University and a Diploma in Computer and Electrical Engineering from Technical University of Athens.

About Aster:

Aster Data Systems is a proven leader in high-performance database systems for data warehousing and analytics – the first DBMS to tightly integrate SQL with MapReduce – providing deep insights on data analyzed on clusters of low-cost commodity hardware.

The Aster nCluster database cost-effectively powers frontline analytic applications for companies such as MySpace, aCerno (an Akamai company), and ShareThis. Running on low-cost off-the-shelf hardware, and providing ‘hands-free’ administration, Aster enables enterprises to meet their data warehousing needs within their budget.

Aster is headquartered in San Carlos, California and is backed by Sequoia Capital, JAFCO Ventures, IVP, Cambrian Ventures, and First-Round Capital, as well as industry visionaries including David Cheriton, Rajeev Motwani and Ron Conway.

Aster_logo_3.0_red

Interview Steve Sarsfield Author The Data Governance Imperative

Here is an interview with Steve Sarsfield, data quality evangelist and author of Data Quality Imperative.


Ajay- Describe your early career to the present point. At what point did you decide to specialize or focus on data quality and data governance? What were the causes for it?


Steve- When I was growing up, not many normal people had aspirations of becoming data management professionals. Back in those days, we had aspirations to be NFL wide receivers, writers, and engineers,and lawyers.  Data management careers tend to find you.

My career path has wandered through technical support, technical writer and managing editor, consulting,and product management for Lotus development. I’ve been working for the past nine years at a major data quality vendor – the longest job I’ve had to date. The good news is that this latest gig has given me a chance to meet with a LOT of people who have been implementing data quality and data governance projects.

When you get involved with the projects, you’ll begin to realize the power it has. You begin to love data governance for the efficiencies it brings, and for the impact it will have on your organization as it becomes more competitive.


Ajay- Some people think data quality is a boring job and data governance is an abstract philosophy. How would you interest a young high school /college student, with the right aptitude, in taking a business intelligence career and be focused on it.


Steve- In my opinion if you promote a geeky view of data governance the message will tend to fall flat. If there’s one thing I have written most about, it is about bridging the gap between technology and business.Those who succeed in this field now and in the future will be people who are a bit of a jack-of-all-trades.

You need to be a good technologist, critical thinker, marketer, and strategist, and you need to use those skills every day to succeed. Leadership skills are also important, especially if you are trying to bootstrap a data governance program at your corporation. Those job attributes are not boring, they are challenging and exciting.

In terms of being persuasive about getting involved in a data career, it’s clear that data is not likely to decrease in volume in the coming years, quite the contrary, so your job will have a reasonable amount of security.  Nor will there be less of a need in the future for developing accurate business metrics from the data.

In my book, I talk about the fact that the decision of a corporation to move toward data governance is really a choice between optimism and fear. Your company must decide to either be haunted by a never-ending vision that there will only will be more data, more mergers and more complexity in the years to come, orthey will decide to take charge for a more hopeful future that will bring more opportunity, more efficiency and a more agile working environment. When you choose data governance as a career, you choose to provide that optimism for your employer.


Ajay-What are the salient points in your book Data Governance Imperative. Do you think data governance is an idea whose time has come.


Steve-The book is about the increasing importance of data to a business. As your company collects more and more data about customers, products, suppliers, transactions and billing, it becomes more difficult to accurately maintain that information without a centralized approach and a team devoted to the data management mission.

The book comes from discussions with folks in the business who are trying to get a data governance program started in their corporation.  They are the data champions who “get it”, but are yet to convince their management that data is crucial to the success of the company.

The fact is, there are metrics you can follow, processes that you can put in place, conversations that you can have, and technology that you can implement in order to make your managers and co-workers see the importance of data governance.  We know this because it has worked for so many companies who are far more advanced in managing their data than most.

The most evolved companies will have support from executive management and the entire company to define reusable processes for data governance and a center of excellence is formed around it. Much of the book is about garnering support and setting up the processes to prove enterprise data’s importance.  Only when you do that will your company evolve its data governance strategy.


Ajay- Garbage Data In and Garbage Data Analysis Out. What percentage of a BI installation budget goes to input data quality at data entry center. What is the kind of budget you would like it to be.


Steve- I’m sure this varies depending upon many factors, including the number of sources, age and quality of the source data, etc. Anecdotally, the percentage of budget five years ago was near zero. You really only saw realization of the problem LATE in the project, after the first data warehouse loading occurred. What has happened over the years is that we’ve gotten a lot smarter about this, perhaps as a result of our past failures. In the past, if the data worked well in the source systems it was assumed that it would work in the target.

A lot of those projects failed because the team incorrectly scoped the project with regard to the data integration. Today we have the wisdom and experience to know that this is not true.  In order to really assess our needs for data quality, we know we need to profile the data as one of the first tasks in the process.  This will help us create a more accurate timeline and budget and ensure management that weknow what we’re doing with regard to data integration and business intelligence.


Ajay- Do you think Federal Governments can focus stimulus spending smarter with better input data quality?


Steve- Believe it or not, I’m encouraged by the US Government’s plan on data quality. To varying degrees,Presidents Clinton, Bush and Obama have all supported plans for greater transparency and openness. To accomplish that, you have to govern data. In Washington, many government agencies now have a Chief Information Officer. The government is recruiting leading universities like MIT to work toward better data governance in government.  The sheer number of databases even within a single US government agencywill be a huge challenge, but the direction is good.

This year’s MIT Information Quality Symposium, for example, had a very solid government track with speakers from the Army, Air Force, Department of Defense, EPA, HUD, and National Institute of Health to name just a few.

Other than the US, it gets even cloudier.  There are governments ahead of the US, like UK and Germany, and those who still need to catch up.


Ajay- Name some actual anecdotes in which 1) bad data quality led to disaster 2) good data quality gave great insights


Steve- There are certainly plenty of typical examples I always like the unusual examples, like:

A major motorcycle manufacturer used data quality tools to pull out nicknames from their customer records. Many of the names they had acquired for their prospect list were from motorcycle events and contests where the entries were, shall we say, colorful. The name fields contained data like “John the Mad Dog Smith” or “Frank Motor-head Jones”. The client used the tool to separate the name from the nickname, making it a more valuable marketing list.

One major utility company used data quality tools to identify and record notations on meter-reader records that were important to keep for operational uses, but not in the customer billing record. Upon analysis of the data, the company noticed random text like “LDIY” and “MOR” along with the customer records. After somework with the business users, they figured out that LDIY meant “Large Dog in Yard” which was particularly important for meter readers. MOR meant “Meter in Right, which was also valuable. The readers were given their own notes field, so that they could maintain the integrity of the name and address while also keeping this valuable data. IT probably saved a lot of meter readers from dog bite situations.

Financial organizations have used data quality tools to separate items like “John and Judy Smith/221453789 ITF George Smith”. The organization wanted to consider this type of record as three separate records “John Smith” and “Judy Smith” and “George Smith” with obvious linkage between the individuals. This type of data is actually quite common on mainframe migrations.

A food manufacturer standardizes and cleanses ingredient names to get better control of manufacturing costs. In data from their worldwide manufacturing plants, an ingredient might be “carrots” “chopped frozen carrots” “frozen carrots, chopped” “chopped carrots, frozen” and so on. (Not to mention all the possible abbreviations for the words carrots, chopped and frozen.) Without standardization of these ingredients, there was really no way to tell how many carrots the company purchased worldwide.

There was no bargaining leverage with the carrot supplier, and all the other ingredient suppliers, until the data was fixed.In terms of disasters, I’d recommend the IAIDQ’s web site – IQ Trainwrecks.http://www.iqtrainwrecks.com/ The IAIDQ does a great job and I contribute when I can.


Ajay- What are the essential 5 things a CEO should ask his CTO to ensure good data quality in an enterprise.


Steve- What a great question. I can think of more than five, but let’s start with:


1) What is poor quality data costing us?
This should inspire your CTO to go out and seek problem areas in partnership with the business and ways to improve processes.

2) Do I have to make decisions on gut-feel, or should I trust the business intelligence you give our employees?  What confidence level do you have in our BI?

The CEO should be confident in the metrics delivered with BI and he should make sure the CTO has the same concerns.

3) Are we in compliance with all laws regarding our governance of data?

CEOs are often culpable for non-compliance, so he/she should be concerned about any laws that govern the company’s industry. Even in unregulated industries, organizations must comply with spam laws and “do not mail” laws for marketing.

4) Are you working across business units to work towards data governance, or is data quality done in silos?

When possible data quality should be a process that is reusable and able to be implemented in similar manner across business units.

5) Do you have the access to data you need?

The CEO should understand if any office politics are getting in the way of ensuring data quality and this question opens the door to that discussion.

Ajay- What does Steve Sarsfield do when not writing blogs and books.


Steve-These days, when I’m not thinking about data or my blog, I’m thinking about my fantasy football team and the upcoming season. I’ve got a ticket to the New England Patriots opening game vs the Buffalo Bills and I’m looking forward to it. On the weekends, you may find me playing a game of mafia wars on Facebook or cooking up a big pot of chili for the family.


Biography-


Steve Sarsfield is a Data governance business expert, speaker, author of The Data Governance Initiative ( at http://www.itgovernance.co.uk/products/2446 ) and blogger at http://data-governance.blogspot.com/. Product marketing professional  at a major data quality vendor and author of the book “The Data Governance Imperative”.He was Guest speaker at MIT Information Quality Symposium (July 2007 and July 2008),  at the International Association for Information and Data Quality (IAIDQ) Symposium (December 2006) and at SAP CRM 2006 summit.

Social Network Analysis: Using R

Here is a great video and slides on doing statistical network analysis using R. It is by Drew Conway from NYU.

Social Network Analysis in R from Drew Conway on Vimeo.