Give yourself a Tax Rebate:Google Docs and other stuff you already knew

If I remember correctly, the last time that the US government sent mail in checks to many people, the tax rebate was as low as 300$. You can save yourself much more that , by doing the following-

1) Switch to Ubuntu Linux at http://www.ubuntu.com/products/WhatIsUbuntu/desktopedition

2) Use only Google Docs from http://docs.google.com (keep data securely online) and Open Office (which comes with Ubuntu above or at http://download.openoffice.org/)

3) Use a trusted anti virus solution from AVG (http://free.avg.com/ ) Hesitant , well it happens to be the most downloaded software on CNET’s Download.com

4) Insist on these freeware with your IT department and at your store even if your new laptop or PC comes bundled with other software . Those costs are embedded within your hardware costs.

5) Start using more Amazon EC2 if you are a large data user at office.

6) Use R for analytics work instead of the hugely expensive analytical closed source programs. Here is the easy to learn GUI http://www.rattle.togaware.com . See book on that from the right sidebar or at www.rforsasandspssusers.com .

Chances are you just saved yourself more than 1000$ per head by doing this.If you used option 5 and 6, the savings could be even more substantial running into tens of thousands of dollars.

If you have to CHOOSE between saving costs , maybe saving your job or even your subordinates job, OR making Bill Gates richer so he can give away YOUR money away to charity, what would you choose ? The time is RIGHT NOW.

Fudging Data: The How, The Why and Catching it

An often encountered problem in data management as well as reporting is data inaccuracy. I was tempted to write about this while poring through reams of data that specifically I had been told to investigate for veracity.

Why data is fudged

Some of data problems are  due to bad data gathering systems, some of it are due to wrong specifications, and some of it is often plain bad or simplistic assumptions.

Data fudging on the other hand is clearly inventing data to fit the curve or trend, and is deliberate and thus harder to catch.

It can also be included to give confusing rather than inaccurate data just to avoid greater scrutiny.

Sometimes it may be termed as over-fitting but over-fitting is generally due to statistical and programmatic reasons rather than human reasons.

 

Note fudging data or talking about is not really political correct in the data world , yet it exists all all levels from students preparing survey samples to budgetary requests.

I am outlining some ways in how to recognize data fudging – and to catch a fudge, you sometime have to think like one.

How data is often fudged-

  1. Factors-This starts be recognizing all factors that can positively or negatively impact the final numbers that are being presented.Note the list can be expanded to many more factors than needed just to divert attention from main causal factors.
  2. Sensitivity-This gives the range of answers gotten by tweaking individual factors within a certain range say +- 10 % and noting the final figures.Assumptions can be both conservative or aggressive in  terms of recognizing the weightage of causal factors in order to suit the final numbers.
  3. Causal Equation-Recognizing the interplay between various factors due to correlation as well to the final numbers due to causing variance changes.The causal equation can then be tweaked including playing with weightage, powers of polynomial expression, as well correlation between many factors.

How data fudging is often caught-

  1. Sampling- Using a random sample or holdout sample, and thus seeking if final answer converges to that known to happen. The validation sample technique is powerful to recognize data modeling inaccuracies.
  2. Checking assumptions- For reasons of risk management, always consider conservative or worst case scenarios first and then build up your analysis. Similarly for checking an analysis , check for over optimism or the period or history on which the assumption growth factors/sensitivities are assumed.
  3. Missing Value and Central Value Movements- If a portion of data is missing, check the mean as well as median for both the reported as well overall data. You can try and resample by taking a random sample from the data and check these values repeatedly to see if they hold firm.
  4. Early Warning Indicators-Ask the question (loudly)- if this analysis was totally wrong , what indicator would give us the first indication of it being wrong. This could be then incorporated as part of metric tracking early warning system

Note the above are simplistic expressions of numbers I have seen being presented wrongly, or being fudged. They are based on my experiences so feel free to add in your share of data anecdotes.

Using these simple techniques could have helped many people in the financial as well as other decision making including budgetary as well as even in other strategic areas.

As the saying goes- In God we Trust, Everybody else has to bring data ( which we will have to check before trusting it)

A Base SAS to Java Compiler

Republished by demand: Here is a nice SAS to Java compiler. It basically cuts away at the problem of executing legacy SAS code, SAS training and focuses on executing the tasks in Java thus making them much faster.

It’s available at http://dullesopen.com/

And its free for personal use.And academic use.

image

I quote from the website "

Carolina Benefits

Converting Base SAS® to Java with Carolina provides two main benefits to enterprises:

  • Savings on license fees. Carolina costs about 70% less than SAS.
  • Performance gains. Carolina-converted code runs significantly faster than the native SAS program.

Additional Benefits

  • Greater flexibility. Java is an industry-standard environment that runs on all platforms. It is much easier to support than the legacy SAS environment it replaces.
  • Better integration. Carolina, as a Java application, supports web services through true J2EE integration.
  • Flawless automated conversion. Eliminate time-consuming, error-prone manual conversion.
  • Simpler contracts. Carolina is licensed in a simple, straightforward fashion.
  • Reduced training costs. Carolina-converted programs can be understood by analysts without training in SAS, and SAS-trained analysts don’t need to learn a new programming language."

Images to Data :OCR Softwares

Some softwares I have used to convert images into text and even tables are –

 http://code.google.com/p/ocropus/

and http://code.google.com/p/tesseract-ocr/

Note- Both are open source , funded by Google   , who uses OCR to enhance search as well as for its book scanning project and these softwares are greatly helpful for say email marketing or converting images into rows and columns of text.

An additional open source imaging software is from http://www.leptonica.com/

You may need to tweak the resolution a bit and the highlighted scan area in order to get good results and can thus convert images into text and numeric data with a simple desktop scanner.

For higher end needs like production environments for questionnaires and responses-

The following come from the SPSS X List – ( it’s a nice list with many business problems that are also familiar in the SAS and R lists)

The software that SPSS recommends is ReadSoft (http://www.readsoft.com/).  Additionally, SPSS have a couple of complimentary products mrPaper (http://www.spss.com/mrpaper/) and mrScan (http://www.spss.com/mrScan/).

and from Dr Steven Lars in the same list

The high 90%+ accuracy of OCR technology of modern scanning, Remarks OMR (optical mark recognition) algorithms produce 99.9%+ accuracy for detected full or empty closed circles and squares.  It worked very well .

Scanning has been accomplished using upper end, hobby grade scanners with automatic form-feed options driven by Windows software.  Though these machines were purchased through university purchasing, all could have been purchased at Best Buy ( discount technology stores), a comparable store, or at internet discount sources.

You can store data for research purposes, access the data using SPSS or Excel, and respond to counselor questions within a week of receiving the raw, paper surveys.

The current preference is to use web-based information gathering developed through university information technology resources or developed on www.Zoomerang.com ( or even www.surveymonkey.com ).  

The biggest challenges were:
You have to be careful in the production of our to-be-scanned forms.  Our forms had to be printed on the same copy machine from a set master or printed on the same laser printer to ensure accuracy.  Poor quality copies and printing yields inaccurate scanning.  Survey color also has to be managed carefully, as some colors are opaque to some optical scanners.
 

A link to the publishers of Remark OMR:
http://www.gravic.com/remark/officeomr/index.html?gclid=CJj01MTSiZgCFRxNagodd0_IDQ

The SAS-L Rookie of the Year

Well I have been told, I am on the SAS-L rookie of the year list at http://www.listserv.uga.edu/cgi-bin/wa?A1=ind0901b&L=sas-l#29.

With 351 posts in 2008 and 0 in 2007, you can certainly say I have been an active rocky, I mean rookie on the list.

Some the things I did were-

  • Share experiences with SAS language code including Automation
  • Share and ask on non –SAS software areas like Google Docs, Cloud Computing ,WPS comparisons.
  • Provoke by design and mostly by accident discussion on R , SAS Software Pricing, relationship and dependence between SAS Community .Org and the SAS Institute and diversity and international issues on the list.

I believe SAS is a good software and the SAS institute has been a pioneer, and it needs to listen to feedback from its retail customers just as much it needs to make money.

  1. A more transparent way of announcing strategic intent on where they are going to concentrate research and
  2. maybe a more nuanced public  relationship stance on rival softwares , with
  3. a readiness to once again experiment if not embrace open source contributions (particularly by using some interface to R code, as well as R datasets) could lead to great stuff from SAS Institute again.

p.s. I Don’t expect to win though. I am bad at elections.

Top ten RRReasons R is bad for you ?

 

This is the original symbol of the Perl progra...
Image via Wikipedia

 

R stands for programming language based out of www.r-project.org

R is bad for you because –

1) It is slower with bigger datasets than SPSS language and SAS language .If you use bigger datasets, then you should either consider more hardware , or try and wait for some of the ODBC connect packages.

2) It needs more time to learn than SAS language .Much more time to learn how to do much more.

3) R programmers are lesser paid than SAS programmers.They prefer it that way.It equates the satisfaction of creating a package in development with a world wide community with the satisfaction of using a package and earning much more money per hour.

4) It forces you to learn the exact details of what you are doing due to its object oriented structure. Thus you either get no answer or get an exact answer. Your customer pays you by the hour not by the correct answers.

5) You can not push a couple of buttons or refer to a list of top ten most commonly used commands to finish the project.

6) It is free. And open for all. It is socialism expressed in code. Some of the packages are built by university professors. It is free.Free is bad. Who pays for the mortgage of the software programmers if all softwares were free ? Who pays for the Friday picnics. Who pays for the Good Night cruises?

7) It is free. Your organization will not commend you for saving them money- they will question why you did not recommend this before. And why did you approve all those packages that expire in 2011.R is fReeeeee. Customers feel good while spending money.The more software budgets you approve the more your salary is. R thReatens all that.

8) It is impossible to install a package you do not need or want. There is no one calling you on the phone to consider one more package or solution. R can make you lonely.

9) R uses mostly Command line. Command line is from the Seventies. Or the Eighties. The GUI’s RCmdr and Rattle are there but still…..

10) R forces you to learn new stuff by the month. You prefer to only earn by the month. Till the day your job got offshored…

Written by a R user in English language

( which fortunately was not copyrighted otherwise we would be paying Britain for each word)

the above post was reprinted by request.

Using R and Excel Together

I put up a question to the R list on using VBA macros from within excel. It seems you can use R from within Excel and can customize it so that the end user doesnot know R. It is called RExcel (what else !)

Quoting Erich from R archives ”
There is RExcel (available by downloading the CRAN package RExcelInstaller. It allows to transfer data between R and Excel, and run R code from within Excel. So you can start with your data in Excel, let R do an analysis, and transfer the results back to Excel. You can write VBA macros which do this, but “hidden from exposure”,
so the Excel user does not even notice that R is doing the hard work.

It also has an Excel worksheet function RApply which allows to call an R function from an Excel cell formula. =RApply(”rfun”,A1)
would apply the R function rfun to the value in cell A1.
If the value in A1 changes, Excel will force R to recalculate the formula.

There is a (half hour long) video demo about RExcel
at http://rcom.univie.ac.at/RExcelDemo/

http://rcom.univie.ac.at/ has more information about the project

 

 

This can help save a huge number of costs as Excel is the least expensive analytical software and is present on all analytics companies.

 

More news on R here http://bits.blogs.nytimes.com/2009/01/08/r-you-ready-for-r/