Updated-R for SAS and SPSS Users

Updated  –I  finally got my hardback copy of the R for SAS and SPSS users . Digital copies are one thing, but a paper book is really beautiful .I had written an article on R ( with some mild sarcasm on some other softwares that are mildly more expensive) at Smart Data Collective. That created around 711 views of that article, ( my website got X00 hits that day, which is a personal best ,ehmm 🙂

It also inspired Sandro, a terrific data miner from Switzerland and a PhD to write an article called 5 reasons R is good for you, which can be accessed here http://smartdatacollective.com/Home/15756 and http://dataminingresearch.blogspot.com/2009/01/top-5-reasons-r-is-good-for-you.html

The story of how I wrote that Top Ten R article is also amusing – mentioned here by Jerry who creates terrific communities for content , all extremely digital and informative , readable here –http://www.socialmediatoday.com/SMC/67268

Now the reason I originally became involved with R, was because I couldn’t afford SAS and SPSS on my own computer after years of getting companies to pick up the tab. A question on the R help list led me to Bob Muenchen , who had written a short guidebook on R for SAS and SPSS users, and was then finishing his book. The following article is interesting given that it was done almost 3-4 months back yet some themes and events seemed to recur exactly as Bob mentioned them. I still bounce between Bob’s book and the Rattle guide for R programming but I am getting there !!!

Note-Robert Muenchen (pronounced Min’-chen) is the author of the famous R for SAS and SPSS users, and his book is an extensive tutorial on anyone wanting to learn either SAS,SPSS,or R or even to migrate from one platform to another. In an exclusive interview Bob agreed to answer some questions on the book , and on students planning to enter science careers.

What made you write the R For SAS and SPSS users?

The book-

A few years ago, all my colleagues seemed to be suddenly talking about R. Had I tried it? What did I think? Wasn’t it amazing? I searched around for a review and found an article by Patrick Burns, "R Relative to Statistics Packages" which is posted on the UCLA site (http://www.ats.ucla.edu/stat/technicalreports/). That article pointed out the many advantages of R and in it Burns claimed that knowing a standard statistics package interfered with learning R. That article really got my interest up. Pat’s article was a rejoinder to "Strategically using General Purpose Statistics Packages: A Look at Stata, SAS and SPSS" by Michael Mitchell, then the manager of statistical consulting at UCLA (it’s at that same site). In it he said little about R, other than he had "enormous difficulties" learning it that he had especially found the documentation lacking.

I dove in and started learning R. It was incredibly hard work, most of which was caused by my expectations of how I thought it ought to work. I did have a lot to "unlearn" but once I figured a certain step out, I could see that explaining it to another SAS or SPSS user would be relatively easy. I started keeping notes on these differences for myself initially. I finally posted them on the Internet as the first version of R for SAS and SPSS Users. It was only 80 pages and much of its explanation was in the form of extensive R program comments. I provided 27 example programs, each done in SAS, SPSS and R. A person could see how they differed, topic by topic. When a person ran the sections of the R programs and read all the comments, he or she would learn how R worked.

A web page counter on that document showed it was getting about 10,000 hits a month. That translates into about 300 users, paging back and forth through the document. An editor from Springer emailed me to ask if I could make it a book. I said it might be 150 pages when I wrote out the prose to replace all the comments. It turned out to be 480 pages!

What are the salient points in this book ?

The main point is that having R taught to you using terms you already know will make R much easier to learn. SAS and SPSS concepts are used in the body of the book as well as the table of contents, the index and even the glossary. For example, the table of contents has an entry for "Value Labels or Formats" even though R uses neither of those terms as SPSS and SAS do, respectively. The index alone took over 80 hours to compile because it is important for people to be able to look up things like "length" as both a SAS statement and as an R function. The glossary defines R terms using SAS/SPSS jargon and then again using proper R definitions.

SAS and SPSS each have five main parts: 1) commands to read and manage data, 2) procedures for statistics & graphics, 3) output management systems that allow you to use output as input to other analyses, 4) a macro language to automate the above steps and finally 5) a matrix language to help you extend the packages. All five of these parts use different statements and rules that do not apply to the others. Due to the complexity of all this, many SAS and SPSS users never get past the first two parts.

R instead has all these functions unified into a common single structure. That makes it much more flexible and powerful. This claim may seem to be a matter of opinion, but the evidence to back it up comes from the companies themselves. The developers at SAS Institute and SPSS Inc. don’t write their procedures in their own languages, R developers do.

How do you think R will impact the statistical software vendors?

With more statistical procedures than any other package, and its free price, some people think R will put many of the proprietary vendors out of business. R is a tsunami coming at the vendors and how they respond will determine their future. Take SPSS Inc. for example. They have written an excellent interface to R that lets you transfer your data back and forth, letting you run R functions in the middle of your SPSS programs. I show how to use it in my book. Starting with SPSS 17, you can also add R functions to the SPSS menus. This is particularly important because most SPSS users prefer to use menus. The company itself is adding menus to R functions, letting them rapidly expand SPSS’ capabilities at very little expense. They saw the R tsunami coming and they hopped on a surfboard to make the most of it. I think this attitude will help them thrive in the future.

SAS Institute so far as been ignoring R. That means if you need to use an analytic method that is only available in R, you must learn much more R than an SPSS user would. Once you have done that, you might be much more likely to switch over completely to R. Colleagues inside SAS Institute tell me they are debating whether they should follow SPSS’ lead and write a link to R. T
his has already been done by MineQuest, LLC (see http://www.minequest.com/Products.html ) with their amusingly named, "A bridge to R" product (playing off "A Bridge Too Far.")

Statistica is officially supporting R. You can read about the details at (http://www.statsoft.com/industries/Rlanguage.htm) . Statacorp has not supported R in Stata yet, although a user, Roger Newson, has written an R interface to it (http://ideas.repec.org/c/boc/bocode/s456847.html).

The company with the most to lose are the makers of S-PLUS. That was Insightful Corp. until they were recently bought out by Tibco. Since R is an implementation of the S language, S-PLUS could be hit pretty hard. On the other hand, they do have functions that handle "big data" so there is a chance that people will develop programs in R, run out of memory and then end up porting them to S-PLUS. S-PLUS also has a more comprehensive graphical user interface than R does, giving them an advantage. However, XL-Solutions Corp. has their new R-PLUS version that adds a slick GUI to R (http://www.experience-rplus.com/). There could be a rocky road ahead for S-PLUS. IBM faced a similar dilemma when computing hardware started becoming commodities. They prospered by making up the difference with service income. Perhaps Tibco can too.

Do you have special discounts for students?

My original version of R for SAS and SPSS Users is still online at http://RforSASandSPSSusers.com so students can get it there for free. The book version has a small market that is mostly students so pricing was set with that in mind.

What made you choose a career in Science and what have been the reasons for your success in it.

I started out as an accounting major. I was lucky enough to have had two years of bookkeeping in high school, and I worked part-time in the accounting department of ServiceMaster Industries for several years. I got to fill in for whoever was on vacation, so I got a broad range of accounting experience. I also got my first experience with statistics by helping the auditors. We took a stratified sample of transactions. With transactions divided into segments by their value, and sample a greater proportion as the value increased. For the most expensive transactions, we examined them all. My job was to be the "gofer" who collected all the invoices, checks, etc. to prove that the transactions were real. For a kid in high school, that was great fun!

By the time I was a freshman at Bradley University, I became excited by three new areas: mathematics, computing and psychology. I got to work in a lab at the Peoria Addictions Research Institute, studying addiction in rats and the parts of the brain that were involved. I wrote a simple stat package in FORTRAN to analyze data. After getting my B.A. in psychology, I worked on a PhD in Educational Psychology at Arizona State University. I loved that field and did well, but the job market for professors in that field was horrible at the time. So I transferred to a PhD program in Industrial/Organizational Psychology at The University of Tennessee. It turned out that I did not really care for that area at all, and I spent much of my time studying computing and calculus. My assistantship was with the Department of Statistics. By the time my first year was up, I transferred to statistics. At the time the department lacked a PhD program, so after four years of grad school I stopped with an M.S. in Statistics and got a job as a computing consultant helping people with their SAS, SPSS and STATGRAPHICS programs. Later I was able to expand that role, creating a full-fledged statistical consulting center in partnership with the Department of Statistics. Ongoing funding cuts have been chipping away at that concept though.

What made me a success? I love my job! I get to work with a lot of smart scientists and their grad students, expanding scientific knowledge. What could be better?

Science is boring, and not well paying career compared to being a lawyer or a sales job. People think you are a nerd. Please comment based on your experiences.

Science is constantly making new discoveries. That’s not boring! An area that most people can relate to is medicine. When we finish a study that shows a new treatment is better than an old one, our efforts will help thousands of people. In one study we compared a new, very expensive anti-nausea drug to an old one that was quite cheap. The pharmaceutical company claimed the new drug was better of course, but our study showed that it was not. That ended up helping to control health care costs that we all see escalating rapidly.

Another study found for the first time, a measure that could predict how well a hearing aid would help a person. Now, it’s easy to measure a hearing aid and see that it is doing what it is supposed to do, but a huge proportion of people who buy them don’t like them and stop wearing them after a brief period. Scientists tried for decades to predict which people would not be good candidates for hearing aids. A very sharp scientist at UT, Anna Nabelek, came up with the concept of Acceptable Noise Level. We measured how much background noise people were willing to tolerate before trying a hearing aid. That allowed us to develop a model that could predict well for the first time if someone should bother spending up to $5,000 for hearing aids. For retired people on a fixed income, that was an important finding. An audiology journal devoted an entire issue to the work.

It’s true that you can make more money in many other fields. But the excitement of discovery and the feeling that I’m helping to extend science very satisfying and well worth the lower salary. Plus, having a job in science means you will never have a chance to get bored!

What is your view on Rice University’s initiatives to create open source textbooks at http://cnx.org/ .

I think this is a really good idea. One of my favorite statistics books is Statnotes: Topics in Multivariate Analysis, by G David Garson. You can read it for free at http://www2.chass.ncsu.edu/garson/pa765/statnote.htm .

Universities pay professors to spend their time doing research, which must be published to get credit. So why not pay professors to write text books too? There have been probably hundreds of introductory books in every imaginable field. They cannot all make it in the marketplace so when they drop out of publication, why not make them available for free? I still have my old Introductory Statistics textbook from 30 years ago and the material is still good. It may be missing a few modern things like boxplots, but it would not take much effort to bring it up to date.

I’m also a huge fan of Project Gutenburg (http://www.archive.org/details/gutenberg). That is a collection of over 20,000 books, articles, etc. available there for free download. My wife does volunteer project management and post-processing with Distributed Proofreaders (http://www.pgdp.net/) which supplies books for Gutenburg.

What are your vie
ws on students uploading scanned copies of books to torrent sharing web sites because of expensive books.

The cost of textbooks has gotten out of hand. I think students should pressure universities and professors to consider cheaper alternatives. However scanning books putting them up on web sites isn’t sharing, it’s stealing. I put in most of my weekends and nights for 2 ½ years on my book that will be lucky to sell a few thousand copies. That works out to pennies per hour. Seeing it scanned in would be quite depressing.

When is the book coming out ? What is taking so long ?

We ran into problems when the book was translated from Microsoft Word to LaTeX. The translator program did not anticipate that an index would already be in place. That resulted in 2-3 errors per page. We’re working through that and should finally get it printed in early October.

Biography

Robert A. Muenchen is a consulting statistician with 28 years of experience. He is currently the manager of the Statistical Consulting Center at the University of Tennessee. He holds a B.A. in Psychology and an M.S. in Statistics. Bob has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has coauthored over 40 articles published in scientific journals and conference proceedings. Bob has served on the advisory boards of SPSS Inc., the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization,text analysis, data mining, psychometrics and resampling.

Ajay-He is also a very modest and great human being.

http://www.amazon.com/SAS-SPSS-Users-Statistics-Computing/dp/0387094172/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1217456813&sr=8-1

Give yourself a Tax Rebate:Google Docs and other stuff you already knew

If I remember correctly, the last time that the US government sent mail in checks to many people, the tax rebate was as low as 300$. You can save yourself much more that , by doing the following-

1) Switch to Ubuntu Linux at http://www.ubuntu.com/products/WhatIsUbuntu/desktopedition

2) Use only Google Docs from http://docs.google.com (keep data securely online) and Open Office (which comes with Ubuntu above or at http://download.openoffice.org/)

3) Use a trusted anti virus solution from AVG (http://free.avg.com/ ) Hesitant , well it happens to be the most downloaded software on CNET’s Download.com

4) Insist on these freeware with your IT department and at your store even if your new laptop or PC comes bundled with other software . Those costs are embedded within your hardware costs.

5) Start using more Amazon EC2 if you are a large data user at office.

6) Use R for analytics work instead of the hugely expensive analytical closed source programs. Here is the easy to learn GUI http://www.rattle.togaware.com . See book on that from the right sidebar or at www.rforsasandspssusers.com .

Chances are you just saved yourself more than 1000$ per head by doing this.If you used option 5 and 6, the savings could be even more substantial running into tens of thousands of dollars.

If you have to CHOOSE between saving costs , maybe saving your job or even your subordinates job, OR making Bill Gates richer so he can give away YOUR money away to charity, what would you choose ? The time is RIGHT NOW.

The declining relevance of LinkedIn

I can still remember two years ago when a friend and erstwhile client from the United States sent me a link to www.LinkedIn.com. While today social media is a rage, back in early 2006 ,LinkedIn re-defined social networking from chatting with teenagers to actual value delivered to customers. Over a period of time both my network and LinkedIn grew- my network is now 6200 members , Decision Stats on LinkedIn has 570 + members and LinkedIn has 30 million people and a reported 1 billion dollar valuation.

 

Yet Life on LinkedIn has been slowly losing interest to the point where it is now just a directory service of contacts.

image

 

Some reasons for the declining relevance of LinkedIn are –

1) Average User Interface Updating- While www.Facebook.com successfully transformed itself into a new look for 70 million users, the UI at LI does leave some things to be desired. Some glitches include a slower than promised rollout of third party applications, and bugs aplenty in the way you update your status, how to remove connections  and the new cluttered home page.

2) Thrust on Groups rather than content (Questions and Answers)– Q and A at LI were a great interactive feature as people answered and posted interesting questions. This has been reduced in focus, by the group discussion features which are a half way effort from the discussions free for all and making a newsletter happen for the group. Many successful LI groups made the transition to being full communities in their own right, mostly using www.ning.com .LI was also unable to capture the whole value chain of engaged communities by not having a newsletter function in the groups, and by group owners not being able to customize stuff.

3) Top Down User Limits– Limits on groups being at most 50, invites being at most 3000, meant that slowly LI was punishing active users more than controlling spam. The Open Networkers movement (people who network openly with everyone) was neither predicted nor monetized well by LI.

4) Inability to monetize recruiters fully (they exist and flourish thanks to LI’s inability to fully channelize them into a paying media), not able to cut down on spam (which exists in much bigger volumes now  due to bigger user base now), and refusal to create connection specific privacy (as in Face book which allows you to keep levels of privacy display for your connections) are other reasons for the decline.

LI has been a pioneer not only in professional networking but also in using non ad pricing strategies in keeping a steady cash flow. Some new features like LinkedIn Polls are promising , and hopefully the next generation of Third Party applications would make the site interesting again.

So there is hope it will get its act together again. However in a very competitive online ad market, time and speed of reaction are critical. LI does have the first mover advantage, but it can lose relevance just like the Lycos and the Yahoo did if it changes slower than users want it. With the current recession, it is an opportunity for communities like LI to tap into the recruiting market and also focus on owning, creating , if not enabling ,relevant content for reading and sharing by users.

Blog Boy wins…

Dear List – I just became blogger on the week on http://www.socialmediatoday.com/SMC/

That’s because of the R article, the interview with Dr Graham and other India specific things that I write about.Even though I did lose the alumni President elections by some miles at www.iiml.org.

You can read the complete article at Social Media here at-

http://www.socialmediatoday.com/SMC/67268

How do I feel ? Well like my creation , Blog Boy below says it best………

Fudging Data: The How, The Why and Catching it

An often encountered problem in data management as well as reporting is data inaccuracy. I was tempted to write about this while poring through reams of data that specifically I had been told to investigate for veracity.

Why data is fudged

Some of data problems are  due to bad data gathering systems, some of it are due to wrong specifications, and some of it is often plain bad or simplistic assumptions.

Data fudging on the other hand is clearly inventing data to fit the curve or trend, and is deliberate and thus harder to catch.

It can also be included to give confusing rather than inaccurate data just to avoid greater scrutiny.

Sometimes it may be termed as over-fitting but over-fitting is generally due to statistical and programmatic reasons rather than human reasons.

 

Note fudging data or talking about is not really political correct in the data world , yet it exists all all levels from students preparing survey samples to budgetary requests.

I am outlining some ways in how to recognize data fudging – and to catch a fudge, you sometime have to think like one.

How data is often fudged-

  1. Factors-This starts be recognizing all factors that can positively or negatively impact the final numbers that are being presented.Note the list can be expanded to many more factors than needed just to divert attention from main causal factors.
  2. Sensitivity-This gives the range of answers gotten by tweaking individual factors within a certain range say +- 10 % and noting the final figures.Assumptions can be both conservative or aggressive in  terms of recognizing the weightage of causal factors in order to suit the final numbers.
  3. Causal Equation-Recognizing the interplay between various factors due to correlation as well to the final numbers due to causing variance changes.The causal equation can then be tweaked including playing with weightage, powers of polynomial expression, as well correlation between many factors.

How data fudging is often caught-

  1. Sampling- Using a random sample or holdout sample, and thus seeking if final answer converges to that known to happen. The validation sample technique is powerful to recognize data modeling inaccuracies.
  2. Checking assumptions- For reasons of risk management, always consider conservative or worst case scenarios first and then build up your analysis. Similarly for checking an analysis , check for over optimism or the period or history on which the assumption growth factors/sensitivities are assumed.
  3. Missing Value and Central Value Movements- If a portion of data is missing, check the mean as well as median for both the reported as well overall data. You can try and resample by taking a random sample from the data and check these values repeatedly to see if they hold firm.
  4. Early Warning Indicators-Ask the question (loudly)- if this analysis was totally wrong , what indicator would give us the first indication of it being wrong. This could be then incorporated as part of metric tracking early warning system

Note the above are simplistic expressions of numbers I have seen being presented wrongly, or being fudged. They are based on my experiences so feel free to add in your share of data anecdotes.

Using these simple techniques could have helped many people in the financial as well as other decision making including budgetary as well as even in other strategic areas.

As the saying goes- In God we Trust, Everybody else has to bring data ( which we will have to check before trusting it)

A Base SAS to Java Compiler

Republished by demand: Here is a nice SAS to Java compiler. It basically cuts away at the problem of executing legacy SAS code, SAS training and focuses on executing the tasks in Java thus making them much faster.

It’s available at http://dullesopen.com/

And its free for personal use.And academic use.

image

I quote from the website "

Carolina Benefits

Converting Base SAS® to Java with Carolina provides two main benefits to enterprises:

  • Savings on license fees. Carolina costs about 70% less than SAS.
  • Performance gains. Carolina-converted code runs significantly faster than the native SAS program.

Additional Benefits

  • Greater flexibility. Java is an industry-standard environment that runs on all platforms. It is much easier to support than the legacy SAS environment it replaces.
  • Better integration. Carolina, as a Java application, supports web services through true J2EE integration.
  • Flawless automated conversion. Eliminate time-consuming, error-prone manual conversion.
  • Simpler contracts. Carolina is licensed in a simple, straightforward fashion.
  • Reduced training costs. Carolina-converted programs can be understood by analysts without training in SAS, and SAS-trained analysts don’t need to learn a new programming language."

Zazzle.com and Cafepress.com

Here is a nice new age Web 2.0 website to create customized website merchandising. You get a share of the royalty and can create products like caps, T shirts and Mugs. I used to have an account some years back  at www.cafepress.com and this website www.zazzle.com seems to do the job, perhaps taking it up a notch higher.

 www.Cafepress.com                               www.zazzle.com

image image

 

Here is an example of a cap–Remember you can buy it by clicking on it.

 

ps- I got the tip from the Sandro at http://dataminingresearch.blogspot.com/

He is offering lot more stuff for sale.

pps- Coming up –

A Survey Poll on

  • Online Web Advertisements (Text,Graphic,Flash) and
  • Merchandising (Branded,Third Party,Affiliated)