books – Page 4 – DECISION STATS

Short Interview Jill Dyche

Here is brief one question interview with Jill Dyche , founder Baseline Consulting.

In 2010.

It was more about consciousness-raising in the executive suite—
getting C-level managers to understand the ongoing value proposition of BI,
why MDM isn’t their father’s database, and
how data governance can pay for itself over time.
Some companies succeeded with these consciousness-raising efforts. Some didn’t.

But three big ones in 2011 would be:

Predictive analytics in the cloud. The technology is now ready, and so is the market—and that includes SMB companies.
Enterprise search being baked into (commoditized) BI software tools. (The proliferation of static reports is SO 2006!)
Data governance will begin paying dividends. Until now it was all about common policies for data. In 2011, it will be about ROI.

I do a “Predictions for the coming year” article every January for TDWI,

Note- Jill ‘s January TDWI article seems worth waiting for in this case.

About-

Source-http://www.baseline-consulting.com/pages/page.asp?page_id=49125

Partner and Co-Founder

Jill Dyché is a partner and co-founder of Baseline Consulting. She is responsible for key client strategies and market analysis in the areas of data governance, business intelligence, master data management, and customer relationship management.

Jill counsels boards of directors on the strategic importance of their information investments.

Author

Jill is the author of three books on the business value of IT. Jill’s first book, e-Data (Addison Wesley, 2000) has been published in eight languages. She is a contributor to Impossible Data Warehouse Situations: Solutions from the Experts (Addison Wesley, 2002), and her book, The CRM Handbook (Addison Wesley, 2002), is the bestseller on the topic.

Jill’s work has been featured in major publications such as Computerworld, Information Week, CIO Magazine, the Wall Street Journal, the Chicago Tribune and Newsweek.com. Jill’s latest book, Customer Data Integration (John Wiley and Sons, 2006) was co-authored with Baseline partner Evan Levy, and shows the business breakthroughs achieved with integrated customer data.

Industry Expert

Jill is a featured speaker at industry conferences, university programs, and vendor events. She serves as a judge for several IT best practice awards. She is a member of the Society of Information Managementand Women in Technology, a faculty member of TDWI, and serves as a co-chair for the MDM Insight conference. Jill is a columnist for DM Review, and a blogger for BeyeNETWORK and Baseline Consulting.

Baseline Consulting Introduces The Data Governance eBook (eon.businesswire.com)
Baseline Consulting Kicks Off the TDWI Solution Summit (eon.businesswire.com)
DataFlux Brings Together Data Management Practitioners, Experts at DataFlux IDEAS 2010 (eon.businesswire.com)
Tableau Upgrade Heats Up Cool BI World (informationweek.com)

Statistical Analysis with R- by John M Quick

I was asked to be a techie reviewe for John M Quick’s new R book “Statistical Analysis with R” from Packt Publishing some months ago-(very much to my surprise I confess)-

I agreed- and technical reviewer work does take time- its like being a mid wife and there is whole team trying to get the book to birth.

Statistical Analysis with R- is a Beginner’s Guide so has nice screenshots, simple case studies, and quizzes to check recall of student/ reader. I remember struggling with the official “beginner’s guide to R” so this one is different in that it presents a story of a Chinese Army and how to use R to plan resources to fight the battle. It’s recommended especially for undergraduate courses- R need not be an elitist language- and given my experience with Asian programming acumen – I am sure it is a matter of time before high schools in India teach basic R in final years ( I learnt quite a shit load of quantum physics as compulsory topics in Indian high schools- but I guess we didnt have Jersey Shore things to do)

Congrats to author Mr John M Quick- he is doing his educational Phd from ASU- and I am sure both he and his approach to making education simple informative and fun will go places.

Only bad thing- The Name Statistical Analysis with R has atleast three other books , but I guess Google will catch up to it.

This book is here-https://www.packtpub.com/statistical-analysis-with-r-beginners-guide/book

A workflow for R (revolutionanalytics.com)
Statistical analysis as journalism – Benford’s law (onlinejournalismblog.com)
Using Box-and-Whiskers Plots (brighthub.com)
Managing a statistical analysis project – guidelines and best practices (r-statistics.com)

Bringing Poetry to Life

Here is a new poetry book.

———————————————————————————————–

I’m excited to let you know about Carol Calkins who is releasing her first book of poetry entitled Bring Poetry to Life. This book is a powerful compilation of poetry touching on the most important moments in our everyday lives from new beginnings, to special people and events, to endings and saying goodbye. Carol who found her life purpose through poetry is excited to release the first of a series of poetry books on Amazon. Grab your copy of Bring Poetry to Life today on Amazon.com – Find out more about Carol and her new book at http://www.bringpoetrytolife.com

We Said Goodbye a Thousand Times

Don’t be sad about my parting

Don’t feel like you never said goodbye

For you and I both know deep in our hearts

That We Said Goodbye a Thousand Times

And shared so much love and joy every day

Be happy that I am now at peace

Be joyful that I have lived a wonderful life

Be happy that we have shared so much together

And remember I am always with you in a thought and a sigh

Every day when you see the beauty in nature think of me

Every day when you see the colorful flowers think of me

Every day when you see a frisky animal prancing around think of me

Every day when you look into the eyes of someone you love think of me

And know beyond a doubt that I am with you in everything you do

And know beyond a doubt that I am with you in everything you say

And know beyond a doubt that I am with you in every quiet moment of your life

Don’t be sad about my parting

Don’t feel like you never said goodbye

For you and I both know deep in our hearts

That We Said Goodbye a Thousand Times

And shared so much love and joy every day

Looking for books on poetry (ask.metafilter.com)
The Poetry Scene in Bangalore (prathambooks.org)
Geek Kids : Geek Magnetic Poetry Kit (thinkgeek.com)
Eyedea’s & His Grandma’s New Poetry Book For Sale (donewaiting.com)

Dataists shake up R community with a rocking contest

Flipboard — Image by Johan Larsson via Flickr

Newly created Dataists are creating waves on Hacker News and beyond with their innovative contest- A Recommendation Engine for R Packages.

Not only is the contest useful, it is likely to teach R Users some data hacking skills, as well as the basics of creating a GitHub Project.

For that reason, we’ve settled on the more manageable question, “which packages are most often installed by normal R users?”

This last question could potentially be answered in a variety of ways. Our current approach uses a convenience sample of installation data that we’ve collected from volunteers in the R community, who kindly agreed to send us a list of the packages they have on their systems. We’ve anonymized this data and compiled a set of metadata-based predictors that allow us to predict the installation probabilities quite well. We’re releasing all of our current work, including the data we have and all of the code we’ve used so far for our exploratory analyses. The contest itself will go live on Kaggle on Sunday and will end four months from Sunday on February 10, 2011. The rules, prizes and official data sets are all described below.

Rules and Prizes

To win the contest, you need to predict the probability that a user U has a package P installed on their system for every pair, (U, P). We’ll assess your performance using ROC methods, which will be evaluated against a held out test data set. The winning team will receive 3 UseR! books of their choosing. In order to win the contest, you’ll have to provide your analysis code to us by creating a fork of our GitHub repository. You’ll also be required to provide a written description of your approach. We’re asking for so much openness from the winning team because we want this contest to serve as a stepping stone for the R community. We’re also hoping that enterprising data hackers will extend the lessons learned through this contest to other programming languages.

Extract from-http://www.dataists.com/2010/10/using-data-tools-to-find-data-tools-the-yo-dawg-of-data-hacking/

Read the full article there

Interview Michael J. A. Berry Data Miners, Inc

Here is an interview with noted Data Mining practitioner Michael Berry, author of seminal books in data mining, noted trainer and consultant

Ajay- Your famous book “Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management” came out in 2004, and an update is being planned for 2011. What are the various new data mining techniques and their application that you intend to talk about in that book.

Michael- Each time we do a revision, it feels like writing a whole new book. The first edition came out in 1997 and it is hard to believe how much the world has changed since then. I’m currently spending most of my time in the on-line retailing world. The things I worry about today–improving recommendations for cross-sell and up-sell,and search engine optimization–wouldn’t have even made sense to me back then. And the data sizes that are routine today were beyond the capacity of the most powerful super computers of the nineties. But, if possible, Gordon and I have changed even more than the data mining landscape. What has changed us is experience. We learned an awful lot between the first and second editions, and I think we’ve learned even more between the second and third.

One consequence is that we now have to discipline ourselves to avoid making the book too heavy to lift. For the first edition, we could write everything we knew (and arguably, a bit more!); now we have to remind ourselves that our intended audience is still the same–intelligent laymen with a practical interest in getting more information out of data. Not statisticians. Not computer scientists. Not academic researchers. Although we welcome all readers, we are primarily writing for someone who works in a marketing department and has a title with the word “analyst” or “analytics” in it. We have relaxed our “no equations” rule slightly for cases when the equations really do make things easier to explain, but the core explanations are still in words and pictures.

The third edition completes a transition that was already happening in the second edition. We have fully embraced standard statistical modeling techniques as full-fledged components of the data miner’s toolkit. In the first edition, it seemed important to make a distinction between old, dull, statistics, and new, cool, data mining. By the second edition, we realized that didn’t really make sense, but remnants of that attitude persisted. The third edition rectifies this. There is a chapter on statistical modeling techniques that explains linear and logistic regression, naive Bayes models, and more. There is also a brand new chapter on text mining, a curious omission from previous editions.

There is also a lot more material on data preparation. Three whole chapters are devoted to various aspects of data preparation. The first focuses on creating customer signatures. The second is focused on using derived variables to bring information to the surface, and the third deals with data reduction techniques such as principal components. Since this is where we spend the greatest part of our time in our work, it seemed important to spend more time on these subjects in the book as well.

Some of the chapters have been beefed up a bit. The neural network chapter now includes radial basis functions in addition to multi-layer perceptrons. The clustering chapter has been split into two chapters to accommodate new material on soft clustering, self-organizing maps, and more. The survival analysis chapter is much improved and includes material on some of our recent application of survival analysis methods to forecasting. The genetic algorithms chapter now includes a discussion of swarm intelligence.

Ajay- Describe your early career and how you came into Data Mining as a profession. What do you think of various universities now offering MS in Analytics. How do you balance your own teaching experience with your consulting projects at The Data Miners.

Michael- I fell into data mining quite by accident. I guess I always had a latent interest in the topic. As a high school and college student, I was a fan of Martin Gardner‘s mathematical games in in Scientific American. One of my favorite things he wrote about was a game called New Eleusis in which one players, God, makes up a rule to govern how cards can be played (“an even card must be followed by a red card”, say) and the other players have to figure out the rule by watching what plays are allowed by God and which ones are rejected. Just for my own amusement, I wrote a computer program to play the game and presented it at the IJCAI conference in, I think, 1981.

That paper became a chapter in a book on computer game playing–so my first book was about finding patterns in data. Aside from that, my interest in finding patterns in data lay dormant for years. At Thinking Machines, I was in the compiler group. In particular, I was responsible for the run-time system of the first Fortran Compiler for the CM-2 and I represented Thinking Machines at the Fortran 8X (later Fortran-90) standards meetings.

What changed my direction was that Thinking Machines got an export license to sell our first machine overseas. The machine went to a research lab just outside of Paris. The connection machine was so hard to program, that if you bought one, you got an applications engineer to go along with it. None of the applications engineers wanted to go live in Paris for a few months, but I did.

Paris was a lot of fun, and so, I discovered, was actually working on applications. When I came back to the states, I stuck with that applied focus and my next assignment was to spend a couple of years at Epsilon, (then a subsidiary of American Express) working on a database marketing system that stored all the “records of charge” for American Express card members. The purpose of the system was to pick ads to go in the billing envelope. I also worked on some more general purpose data mining software for the CM-5.

When Thinking Machines folded, I had the opportunity to open a Cambridge office for a Virginia-based consulting company called MRJ that had been a major channel for placing Connection Machines in various government agencies. The new group at MRJ was focused on data mining applications in the commercial market. At least, that was the idea. It turned out that they were more interested in data warehousing projects, so after a while we parted company.

That led to the formation of Data Miners. My two partners in Data Miners, Gordon Linoff and Brij Masand, share the Thinking Machines background.

To tell the truth, I really don’t know much about the university programs in data mining that have started to crop up. I’ve visited the one at NC State, but not any of the others.

I myself teach a class in “Marketing Analytics” at the Carroll School of Management at Boston College. It is an elective part of the MBA program there. I also teach short classes for corporations on their sites and at various conferences.

Ajay- At the previous Predictive Analytics World, you took a session on Forecasting and Predicting Subsciber levels (http://www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-6) .

It seems inability to forecast is a problem many many companies face today. What do you think are the top 5 principles of business forecasting which companies need to follow.

Michael- I don’t think I can come up with five. Our approach to forecasting is essentially simulation. We try to model the underlying processes and then turn the crank to see what happens. If there is a principal behind that, I guess it is to approach a forecast from the bottom up rather than treating aggregate numbers as a time series.

Ajay- You often partner your talks with SAS Institute, and your blog at http://blog.data-miners.com/ sometimes contain SAS code as well. What particular features of the SAS software do you like. Do you use just the Enterprise Miner or other modules as well for Survival Analysis or Forecasting.

Michael- Our first data mining class used SGI’s Mineset for the hands-on examples. Later we developed versions using Clementine, Quadstone, and SAS Enterprise Miner. Then, market forces took hold. We don’t market our classes ourselves, we depend on others to market them and then share in the revenue.

SAS turned out to be much better at marketing our classes than the other companies, so over time we stopped updating the other versions. An odd thing about our relationship with SAS is that it is only with the education group. They let us use Enterprise Miner to develop course materials, but we are explicitly forbidden to use it in our consulting work. As a consequence, we don’t use it much outside of the classroom.

Ajay- Also any other software you use (apart from SQL and J)

Michael- We try to fit in with whatever environment our client has set up. That almost always is SQL-based (Teradata, Oracle, SQL Server, . . .). Often SAS Stat is also available and sometimes Enterprise Miner.

We run into SPSS, Statistica, Angoss, and other tools as well. We tend to work in big data environments so we’ve also had occasion to use Ab Initio and, more recently, Hadoop. I expect to be seeing more of that.

Biography-

Together with his colleague, Gordon Linoff, Michael Berry is author of some of the most widely read and respected books on data mining. These best sellers in the field have been translated into many languages. Michael is an active practitioner of data mining. His books reflect many years of practical, hands-on experience down in the data mines.

Data Mining Techniques for Marketing, Sales and Customer Relationship Management

by Michael J. A. Berry and Gordon S. Linoff
copyright 2004 by John Wiley & Sons
ISB

Mining the Web

by Michael J.A. Berry and Gordon S. Linoff
copyright 2002 by John Wiley & Sons
ISBN 0-471-41609-6

Non-English editions available in Traditional Chinese and Simplified Chinese

This book looks at the new opportunities and challenges for data mining that have been created by the web. The book demonstrates how to apply data mining to specific types of online businesses, such as auction sites, B2B trading exchanges, click-and-mortar retailers, subscription sites, and online retailers of digital content.

Mastering Data Mining

by Michael J.A. Berry and Gordon S. Linoff
copyright 2000 by John Wiley & Sons
ISBN 0-471-33123-6

Non-English editions available in Japanese , Italian , Traditional Chinese , and Simplified Chinese

A case study-based guide to applying data mining techniques for solving practical business problems. These “warts and all” case studies are drawn directly from consulting engagements performed by the authors.

A data mining educator as well as a consultant, Michael is in demand as a keynote speaker and seminar leader in the area of data mining generally and the application of data mining to customer relationship management in particular.

Prior to founding Data Miners in December, 1997, Michael spent 8 years at Thinking Machines Corporation. There he specialized in the application of massively parallel supercomputing techniques to business and marketing applications, including one of the largest database marketing systems of the time.

Business Analytics Analyst Relations /Ethics/White Papers

Curt Monash, whom I respect and have tried to interview (unsuccessfully) points out suitable ethical dilemmas and gray areas in Analyst Relations in Business Intelligence here at http://www.dbms2.com/2010/07/30/advice-for-some-non-clients/

If you dont know what Analyst Relations are, well it’s like credit rating agencies for BI software. Read Curt and his landscaping of the field here ( I am quoting a summary) at http://www.strategicmessaging.com/the-ethics-of-white-papers/2010/08/01/

Vendors typically pay for

They want to connect with sales prospects.

They want general endorsement from the analyst.

They specifically want endorsement from the analyst for their marketing claims.

They want the analyst to do a better job of explaining something than they think they could do themselves.

They want to give the analyst some money to enhance the relationship,

Merv Adrian (I interviewed Merv here at http://www.dudeofdata.com/?p=2505) has responded well here at http://www.enterpriseirregulars.com/23040/white-paper-sponsorship-and-labeling/

None of the sites I checked clearly identify the work as having been sponsored in any way I found obvious in my (admittefly) quick scan. So this is an issue, but it’s not confined to Oracle.

My 2 cents (not being so well paid 😉 are-

I think Curt was calling out Oracle (which didnt respond) and not Merv ( whose subsequent blog post does much to clarify).

As a comparative new /younger blogger in this field,
I applaud both Curt to try and bell the cat ( or point out what everyone in AR winks at) and for Merv for standing by him.

In the long run, it would strengthen analyst relations as a channel if they separate financial payment of content from bias. An example is credit rating agencies who forgot to do so in BFSI and see what happened.

Customers invest millions of dollars in BI systems trusting marketing collateral/white papers/webinars/tests etc. Perhaps it’s time for an industry association for analysts so that individual analysts don’t knuckle down under vendor pressure.

It is easier for someone of Curt, Merv’s stature to declare editing policy and disclosures before they write a white paper.It is much harder for everyone else who is not so well established.

White papers can take as much as 25,000$ to produce- and I know people who in Business Analytics (as opposed to Business Intelligence) slog on cents per hour cranking books on R, SAS , webinars, trainings but there are almost no white papers in BA. Are there any analytics independent analysts who are not biased by R or SAS or SPSS or etc etc. I am not sure but this looks like a good line to pursue 😉 – provided ethical checks and balances are established.

Personally I know of many so called analytics communities go all out to please their sponsors so bias in writing does exist (you cant praise SAS on a R Blogging Forum or R USers Meet and you cant write on WPS at SAS Community.org )

– at the same time someone once told me- It is tough to make a living as a writer, and that choice between easy money and credible writing needs to be respected.

Most sponsored white papers I read are pure advertisements, directed at CEOs rather than the techie community at large.

Almost every BI vendor claims to have the fastest database with 5X speed- and benchmarking in technical terms could be something they could do too.

Just like Gadget sites benchmark products, you can not benchmark BI or even BA products as it is written not to do so in many licensing terms.

Probably that is the reason Billions are spent in BI and the positive claims are doubtful ( except by the sellers). Similarly in Analytics, many vendors would have difficulty justifying their claims or prices if they are subjected to a side by side comparison. Unfortunately the resulting confusion results in shoddy technology coming stronger due to more aggressive marketing.

Interview : R For Stata Users

Here is an interview with Bob Muenchen , author of ” R For SAS and SPSS Users” and co-author with Joe Hilbe of ” R for Stata Users”.

Describe your new book R for Stata Users and how it is helpful to users.

Stata is a marvelous software package. Its syntax is well designed, concise and easy to learn. However R offers Stata users advantages in two key areas: education and analysis.

Regarding education, R is quickly becoming the universal language of data analysis. Books, journal articles and conference talks often include R code because it’s a powerful language and everyone can run it. So R has become an essential part of the education of data analysts, statisticians and data miners.

Regarding analysis, R offers a vast array of methods that R users have written. Next to R, Stata probably has more useful user-written add-ons than any other analytic software. The Statistical Software Components collection at Boston College’s Department of Economics is quite impressive (http://ideas.repec.org/s/boc/bocode.html), containing hundreds of useful additions to Stata. However, R’s collection of add-ons currently contains 3,680 packages, and more are being added every week. Stata users can access these fairly easily by doing their data management in Stata, saving a Stata format data set, importing it into R and running what they need. Working this way, the R program may only be a few lines long.

In our book, the section “Getting Started Quickly” outlines the most essential 50 pages for Stata users to read to work in this way. Of course the book covers all the basics of R, should the reader wish to learn more. Being enthusiastic programmers, we’ll be surprised if they don’t want to read it all.

There are many good books on R, but as I learned the language I found myself constantly wondering how each concept related to the packages I already knew. So in this book we describe R first using Stata terminology and then using R terminology. For example, when introducing the R data frame, we start out saying that it’s just like a Stata data set: a rectangular set of variables that are usually numeric with perhaps one or two character variables. Then we move on to say that R also considers it a special type of “list” which constrains all its “components” to be equal in length. That then leads into entirely new territory.

The entire book is laid out to make learning easy for Stata users. The names used in the table of contents are Stata-based. The reader may look up how to “collapse” a data set by a grouping variable to find that one way R can do that is with the mysteriously named “tapply” function. A Stata user would never have guessed to look for that name

. When reading from cover-to-cover that may not be that big of a deal, but as you go back to look things up it’s a huge time saver. The index is similar in that you can look every subject up by its Stata name to find the R function or vice versa. People see me with both my books near my desk and chuckle that they’re there for advertising. Not true! I look details up in them all the time.

I didn’t have enough in-depth knowledge of Stata to pull this off by myself, so I was pleased to get Joe Hilbe as a co-author. Joe is a giant in the world of Stata. He wrote several of the Stata commands that ship with the product including glm, logistic and manova. He was also the first editor of the Stata Technical Bulletin, which later turned into the Stata Journal. I have followed his work from his days as editor of the statistical software reviews section in the journal The American Statistician. There he not only edited but also wrote many of the reviews which I thoroughly enjoyed reading over the years. If you don’t already know Stata, his review of Stata 9.0 is still good reading (November 1, 2005, 59(4): 335-348).

Describe the relationship between Stata and R and how it is the same or different from SAS / SPSS and R.

This is a very interesting question. I pointed out in R for SAS and SPSS Users that SAS and SPSS are structured very similarly while R is totally different. Stata, on the other hand, has many similarities to R. Here I’ll quote directly from the book:

• Both include rich programming languages designed for writing new analytic methods, not just a set of prewritten commands.

• Both contain extensive sets of analytic commands written in their own languages.

• The pre-written commands in R, and most in Stata, are visible and open for you to change as you please.

• Both save command or function output in a form you can easily use as input to further analysis.

• Both do modeling in a way that allows you to readily apply your models for tasks such as making predictions on new data sets. Stata calls these postestimation commands and R calls them extractor functions.

• In both, when you write a new command, it is on an equal footing with commands written by the developers. There are no additional “Developer’s Kits” to purchase.

• Both have legions of devoted users who have written numerous extensions and who continue to add the latest methods many years before their competitors.

• Both can search the Internet for user-written commands and download them automatically to extend their capabilities quickly and easily.

• Both hold their data in the computer’s main memory, offering speed but limiting the amount of data they can handle.

Can the book be used by a R user for learning Stata

That’s certainly not ideal. The sections that describe the relationship between the two languages would be good to know and all the example programs are presented in both R and Stata form. However, we spend very little time explaining the Stata programs while going into the R ones step by step. That said, I continue to receive e-mails from R experts who learned SAS or SPSS from R for SAS and SPSS Users, so it is possible.

Describe the response to your earlier work R for SAS and SPSS users and if any new editions is forthcoming.

I am very pleased with the reviews for R for SAS and SPSS Users. You can read them all, even the one really bad one, at http://r4stats.com. We incorporated all the advice from those reviews into R for Stata Users, so we hope that this book will be well received too.

In the first book, Appendix B: A Comparison of SAS and SPSS Products with R Packages and Functions has been particularly popular for helping people find the R packages they need. As it expanded, I moved it to the web site: http://r4stats.com/add-on-modules. All three packages are changing so fast that I sometimes edit that table several times per week!
The second edition to R for SAS and SPSS Users is due to the publisher by the end of February, so it will be in the bookstores by sometime in April 2011, if all goes as planned. I have a list of thirty new topics to add, and those won’t all fit. I have some tough decisions to make!

On a personal note, Ajay, it was a pleasure getting to meet you when you came to UT, especially our chats on the current state of the analytics market and where it might be headed. I love the fact that the Internet allows people to meet across thousands of miles. I look forward to reading more on DecisionStats!

About –

Bob Muenchen has twenty-eight years of experience consulting, managing and teaching in a variety of complex, research oriented computing environments. You can read about him here http://web.utk.edu/~muenchen/RobertMuenchenResume.html

To read about Bob’s new book, please go to http://www.springer.com/statistics/computanional+statistics/book/978-1-4419-1317-3

or http://r4stats.com/

Partner and Co-Founder

Author

Industry Expert

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

Rules and Prizes

Please share:

Please share:

Please share:

Please share: