Blog Boy wins…

Dear List – I just became blogger on the week on http://www.socialmediatoday.com/SMC/

That’s because of the R article, the interview with Dr Graham and other India specific things that I write about.Even though I did lose the alumni President elections by some miles at www.iiml.org.

You can read the complete article at Social Media here at-

http://www.socialmediatoday.com/SMC/67268

How do I feel ? Well like my creation , Blog Boy below says it best………

Fudging Data: The How, The Why and Catching it

An often encountered problem in data management as well as reporting is data inaccuracy. I was tempted to write about this while poring through reams of data that specifically I had been told to investigate for veracity.

Why data is fudged

Some of data problems are  due to bad data gathering systems, some of it are due to wrong specifications, and some of it is often plain bad or simplistic assumptions.

Data fudging on the other hand is clearly inventing data to fit the curve or trend, and is deliberate and thus harder to catch.

It can also be included to give confusing rather than inaccurate data just to avoid greater scrutiny.

Sometimes it may be termed as over-fitting but over-fitting is generally due to statistical and programmatic reasons rather than human reasons.

 

Note fudging data or talking about is not really political correct in the data world , yet it exists all all levels from students preparing survey samples to budgetary requests.

I am outlining some ways in how to recognize data fudging – and to catch a fudge, you sometime have to think like one.

How data is often fudged-

  1. Factors-This starts be recognizing all factors that can positively or negatively impact the final numbers that are being presented.Note the list can be expanded to many more factors than needed just to divert attention from main causal factors.
  2. Sensitivity-This gives the range of answers gotten by tweaking individual factors within a certain range say +- 10 % and noting the final figures.Assumptions can be both conservative or aggressive in  terms of recognizing the weightage of causal factors in order to suit the final numbers.
  3. Causal Equation-Recognizing the interplay between various factors due to correlation as well to the final numbers due to causing variance changes.The causal equation can then be tweaked including playing with weightage, powers of polynomial expression, as well correlation between many factors.

How data fudging is often caught-

  1. Sampling- Using a random sample or holdout sample, and thus seeking if final answer converges to that known to happen. The validation sample technique is powerful to recognize data modeling inaccuracies.
  2. Checking assumptions- For reasons of risk management, always consider conservative or worst case scenarios first and then build up your analysis. Similarly for checking an analysis , check for over optimism or the period or history on which the assumption growth factors/sensitivities are assumed.
  3. Missing Value and Central Value Movements- If a portion of data is missing, check the mean as well as median for both the reported as well overall data. You can try and resample by taking a random sample from the data and check these values repeatedly to see if they hold firm.
  4. Early Warning Indicators-Ask the question (loudly)- if this analysis was totally wrong , what indicator would give us the first indication of it being wrong. This could be then incorporated as part of metric tracking early warning system

Note the above are simplistic expressions of numbers I have seen being presented wrongly, or being fudged. They are based on my experiences so feel free to add in your share of data anecdotes.

Using these simple techniques could have helped many people in the financial as well as other decision making including budgetary as well as even in other strategic areas.

As the saying goes- In God we Trust, Everybody else has to bring data ( which we will have to check before trusting it)

A Base SAS to Java Compiler

Republished by demand: Here is a nice SAS to Java compiler. It basically cuts away at the problem of executing legacy SAS code, SAS training and focuses on executing the tasks in Java thus making them much faster.

It’s available at http://dullesopen.com/

And its free for personal use.And academic use.

image

I quote from the website "

Carolina Benefits

Converting Base SAS® to Java with Carolina provides two main benefits to enterprises:

  • Savings on license fees. Carolina costs about 70% less than SAS.
  • Performance gains. Carolina-converted code runs significantly faster than the native SAS program.

Additional Benefits

  • Greater flexibility. Java is an industry-standard environment that runs on all platforms. It is much easier to support than the legacy SAS environment it replaces.
  • Better integration. Carolina, as a Java application, supports web services through true J2EE integration.
  • Flawless automated conversion. Eliminate time-consuming, error-prone manual conversion.
  • Simpler contracts. Carolina is licensed in a simple, straightforward fashion.
  • Reduced training costs. Carolina-converted programs can be understood by analysts without training in SAS, and SAS-trained analysts don’t need to learn a new programming language."

Zazzle.com and Cafepress.com

Here is a nice new age Web 2.0 website to create customized website merchandising. You get a share of the royalty and can create products like caps, T shirts and Mugs. I used to have an account some years back  at www.cafepress.com and this website www.zazzle.com seems to do the job, perhaps taking it up a notch higher.

 www.Cafepress.com                               www.zazzle.com

image image

 

Here is an example of a cap–Remember you can buy it by clicking on it.

 

ps- I got the tip from the Sandro at http://dataminingresearch.blogspot.com/

He is offering lot more stuff for sale.

pps- Coming up –

A Survey Poll on

  • Online Web Advertisements (Text,Graphic,Flash) and
  • Merchandising (Branded,Third Party,Affiliated)

Interview :Dr Graham Williams

(Updated with comments from Dr Graham in the comments section )


I have often talked about how the Graphical User Interface ,Rattle for R language makes learning R and building models quite simple. Rattle‘s latest version has been released and got extensive publicity including in KD Nuggets .I wrote to it’s creator Dr Graham, and he agreed for an extensive interview explaining data mining, its evolution and the philosophy and logic behind open source languages like R as well as Rattle.

Dr Graham Williams is the author of the Rattle data mining software and Adjunct Professor, University of Canberra and Australian National University.  Rattle is available from rattle.togaware.com.

Ajay Could you describe your career journey . What made you enter this field and what experiences helped shape your perspectives . What would your advice be to young professionals entering this field today.

Graham – With a PhD in Artificial Intelligence (topic: combining multiple decision trees to build ensembles) and a strong interest in practical applications, I started out in the late 1980’s developing expert systems for business and government, including bank loan assessment systems and bush fire prediction.

When data mining emerged as a discipline in the early 1990’s I was involved in setting up the first data mining team in Australia with the government research organization (CSIRO). In 2004 I joined the Australian Taxation Office and provide the technical lead for the deployment of its Analytics team, overseeing the development of a data
mining capability. I have been teaching data mining at the Australian National University (and elsewhere) since 1995 and continue to do so.

The business needs for Data Mining and Analytics continues to grow, although courses in Data Mining are still not so common. A data miner combines good backgrounds in Computer Science and Statistics. The Computer Science is too little emphasized, but is crucial for skills in developing repeatable procedures and good software engineering
practices, which I believe to be important in Data Mining.

Data Mining is more than just using a point and click graphical user interface (GUI). It is an experimental endeavor where we really need to be able to follow our nose as we explore through our data, and then capture the whole process in an automatically repeatable manner that can be readily communicated to others. A programming language offers this sophisticated level of communications.

Too often, I see analysts, when given a new dataset that updates last years data, essentially start from scratch with the data pre-processing, cleaning, and then mining, rather than beginning with last year’s captured processes and tuning to this year’s data.  The GUI generation of software often does not encourage repeatability.

Ajay -What made you get involved with R . What is the advantage of using Rattle
versus normal R.

Graham- I have used Clementine and SAS Enterprise miner over many years (and IBM’s original Intelligent Miner and Thinking Machines’ Darwin, and many other tools that emerged early on with Data Mining). Commercial vendors come and go (even large one’s like IBM, in terms of the products they support).

Lock-in is one problem with commercial tools. Another is that many vendors, understandably, won’t put resources into new algorithms until they are well accepted.
Because it is open source, R is robust, reliable, and provides access to the most advanced statistics. Many research Statisticians publish their new algorithms in R. But what is most important is that the source code is always going to be available. Not everyone has the skill to delve into that source code, but at least we have a chance to
do so. We also know that there is a team of highly qualified developers whose work is openly peer reviewed. I can monitor their coding changes, if I so wanted.  This helps ensure quality and integrity.

Rolling out R to a community of data analysts, though, does present challenges. Being primarily a language for statistics, we need to learn to speak that language. That is, we need to communicate with language rather than pictures (or GUI). It is, of course, easier to draw pictures, but pictures can be limiting. I believe a written language allows us to express and communicate ideas better and more formally. But it needs to be with the philosophy that we are communicating those ideas to our fellow humans, not just writing code to be executed by the computer.

Nonetheless, GUIs are great as memory aides, for doing simple tasks, and for learning how to perform particular tasks. Rattle aims to do the standard data mining steps, but to also expose everything that is done as R commands in the log. In fact, the log is designed to be able to be run as an R script, and to teach the user the R commands.

Ajay- What are the advantages of using Rattle  instead of SAS or SPSS. What are the disadvantages of using Rattle instead of SAS or SPSS.

Graham- Because it is free and open source, Rattle (and R) can be readily used in teaching data mining.  In business it is, initially, useful for people who want to experiment with data mining without the sometimes quite significant up front costs of the commercial offerings. For serious data mining, Rattle and R offers all of the data mining algorithms offered by the commercial vendors, but also many more. Rattle provides a simple, tab-based, user interface which is not as graphically sophisticated as Clementine in SPSS and SAS Enterprise Miner.

But with just 4 button clicks you will have built your first data mining model.

The usual disadvantage quoted for R (and so Rattle) is in the handling of large datasets – SAS and SPSS can handle datasets out of memory although they do slow down when doing so. R is memory based, so going to a 64bit platform is often necessary for the larger datasets. A very rough rule of thumb has been that the 2-3GB limit of the common 32bit processors can handle a dataset of up to about 50,000 rows with 100 columns (or 100,000 rows and 10 columns, etc), depending on the algorithms you deploy. I generally recommend, as quite a powerful yet inexpensive data mining machine, one running on an AMD64 processor, running the Debian GNU/Linux operating system, with as much memory as you can afford (e.g., 4GB to 32GB, although some machines today can go up to 128 GB, but memory gets expensive at that end of the scale).

Ajay – Rattle is free to download and use- yet it must have taken you some time
to build it.What are your revenue streams to support your time and efforts?

Graham –Yes, Rattle is free software: free for anyone to use, free to review the code, free to extend the code, free to use it for whatever purpose.  I have been developing Rattle for a few years now, with a number of
contributions from other users. Rattle, of course, gets its full power from R. The R community works together to help each other,
and others, for the benefit of all. Rattle and R can be the basic toolkit for knowledge workers providing analyses. I know of a number of data mining consultants around the world who are using Rattle to support their day-to-day consultancy work.

As a company, Togaware provides user support, installations of R and Rattle, runs training in using Rattle and in doing data mining. It also delivers data mining projects to clients. Togaware also provides support for incorporating Rattle (and R) into other products (e.g., as RStat for Information Builders).

Ajay – What is your vision of analytics for the future. How do you think the recession of 2008 and slowdown in 2009 will affect choice of softwares.

Graham- Watching the growth of data mining and analytics over the past 18 years it does seem that there has been and continues to be a monotonically increasing interest and demand for Analytics. Analytics continues to demonstrate benefit.

The global financial crisis, as others have suggested, should lead organizations to consider alternatives to expensive software. Good quality free and open source software has been available for a while now, but the typical CTO is still more comfortable purchasing expensive software. A purchase gives some sense of (false?) security but formally provides no warranty. My philosophy has been that we
should invest in our people, within an organization, and treat software as a commodity, that we openly contribute back into.

Imagine a world where we only use free open source software. The savings made by all will be substantial (consider OpenOffice versus MS/Office license fees paid by governments world wide, or Rattle versus SAS Enterprise Miner annual license fees). A small part of that saving might be expended on ensuring we have staff who are capable of understanding and extending that software to suit our needs, rather than vice versa (i.e., changing our needs to suit the software). We feed our extensions back into the grid of open source software, whilst also benefiting from contributions others are making. Some commercial vendors like to call this “communism” as part of their attempt to discredit open source, but we had better learn to share, for the good of the planet, before we lose it.

( Note from Ajay – If you are curious to try R , and have just 15 minutes to try it in, download Rattle from rattle.togaware.com. It has a click and point  interface and auto generates R code in it’s log. Trust me, it would time well spent.)

Images to Data :OCR Softwares

Some softwares I have used to convert images into text and even tables are –

 http://code.google.com/p/ocropus/

and http://code.google.com/p/tesseract-ocr/

Note- Both are open source , funded by Google   , who uses OCR to enhance search as well as for its book scanning project and these softwares are greatly helpful for say email marketing or converting images into rows and columns of text.

An additional open source imaging software is from http://www.leptonica.com/

You may need to tweak the resolution a bit and the highlighted scan area in order to get good results and can thus convert images into text and numeric data with a simple desktop scanner.

For higher end needs like production environments for questionnaires and responses-

The following come from the SPSS X List – ( it’s a nice list with many business problems that are also familiar in the SAS and R lists)

The software that SPSS recommends is ReadSoft (http://www.readsoft.com/).  Additionally, SPSS have a couple of complimentary products mrPaper (http://www.spss.com/mrpaper/) and mrScan (http://www.spss.com/mrScan/).

and from Dr Steven Lars in the same list

The high 90%+ accuracy of OCR technology of modern scanning, Remarks OMR (optical mark recognition) algorithms produce 99.9%+ accuracy for detected full or empty closed circles and squares.  It worked very well .

Scanning has been accomplished using upper end, hobby grade scanners with automatic form-feed options driven by Windows software.  Though these machines were purchased through university purchasing, all could have been purchased at Best Buy ( discount technology stores), a comparable store, or at internet discount sources.

You can store data for research purposes, access the data using SPSS or Excel, and respond to counselor questions within a week of receiving the raw, paper surveys.

The current preference is to use web-based information gathering developed through university information technology resources or developed on www.Zoomerang.com ( or even www.surveymonkey.com ).  

The biggest challenges were:
You have to be careful in the production of our to-be-scanned forms.  Our forms had to be printed on the same copy machine from a set master or printed on the same laser printer to ensure accuracy.  Poor quality copies and printing yields inaccurate scanning.  Survey color also has to be managed carefully, as some colors are opaque to some optical scanners.
 

A link to the publishers of Remark OMR:
http://www.gravic.com/remark/officeomr/index.html?gclid=CJj01MTSiZgCFRxNagodd0_IDQ