Interview :Dr Graham Williams

(Updated with comments from Dr Graham in the comments section )


I have often talked about how the Graphical User Interface ,Rattle for R language makes learning R and building models quite simple. Rattle‘s latest version has been released and got extensive publicity including in KD Nuggets .I wrote to it’s creator Dr Graham, and he agreed for an extensive interview explaining data mining, its evolution and the philosophy and logic behind open source languages like R as well as Rattle.

Dr Graham Williams is the author of the Rattle data mining software and Adjunct Professor, University of Canberra and Australian National University.  Rattle is available from rattle.togaware.com.

Ajay Could you describe your career journey . What made you enter this field and what experiences helped shape your perspectives . What would your advice be to young professionals entering this field today.

Graham – With a PhD in Artificial Intelligence (topic: combining multiple decision trees to build ensembles) and a strong interest in practical applications, I started out in the late 1980’s developing expert systems for business and government, including bank loan assessment systems and bush fire prediction.

When data mining emerged as a discipline in the early 1990’s I was involved in setting up the first data mining team in Australia with the government research organization (CSIRO). In 2004 I joined the Australian Taxation Office and provide the technical lead for the deployment of its Analytics team, overseeing the development of a data
mining capability. I have been teaching data mining at the Australian National University (and elsewhere) since 1995 and continue to do so.

The business needs for Data Mining and Analytics continues to grow, although courses in Data Mining are still not so common. A data miner combines good backgrounds in Computer Science and Statistics. The Computer Science is too little emphasized, but is crucial for skills in developing repeatable procedures and good software engineering
practices, which I believe to be important in Data Mining.

Data Mining is more than just using a point and click graphical user interface (GUI). It is an experimental endeavor where we really need to be able to follow our nose as we explore through our data, and then capture the whole process in an automatically repeatable manner that can be readily communicated to others. A programming language offers this sophisticated level of communications.

Too often, I see analysts, when given a new dataset that updates last years data, essentially start from scratch with the data pre-processing, cleaning, and then mining, rather than beginning with last year’s captured processes and tuning to this year’s data.  The GUI generation of software often does not encourage repeatability.

Ajay -What made you get involved with R . What is the advantage of using Rattle
versus normal R.

Graham- I have used Clementine and SAS Enterprise miner over many years (and IBM’s original Intelligent Miner and Thinking Machines’ Darwin, and many other tools that emerged early on with Data Mining). Commercial vendors come and go (even large one’s like IBM, in terms of the products they support).

Lock-in is one problem with commercial tools. Another is that many vendors, understandably, won’t put resources into new algorithms until they are well accepted.
Because it is open source, R is robust, reliable, and provides access to the most advanced statistics. Many research Statisticians publish their new algorithms in R. But what is most important is that the source code is always going to be available. Not everyone has the skill to delve into that source code, but at least we have a chance to
do so. We also know that there is a team of highly qualified developers whose work is openly peer reviewed. I can monitor their coding changes, if I so wanted.  This helps ensure quality and integrity.

Rolling out R to a community of data analysts, though, does present challenges. Being primarily a language for statistics, we need to learn to speak that language. That is, we need to communicate with language rather than pictures (or GUI). It is, of course, easier to draw pictures, but pictures can be limiting. I believe a written language allows us to express and communicate ideas better and more formally. But it needs to be with the philosophy that we are communicating those ideas to our fellow humans, not just writing code to be executed by the computer.

Nonetheless, GUIs are great as memory aides, for doing simple tasks, and for learning how to perform particular tasks. Rattle aims to do the standard data mining steps, but to also expose everything that is done as R commands in the log. In fact, the log is designed to be able to be run as an R script, and to teach the user the R commands.

Ajay- What are the advantages of using Rattle  instead of SAS or SPSS. What are the disadvantages of using Rattle instead of SAS or SPSS.

Graham- Because it is free and open source, Rattle (and R) can be readily used in teaching data mining.  In business it is, initially, useful for people who want to experiment with data mining without the sometimes quite significant up front costs of the commercial offerings. For serious data mining, Rattle and R offers all of the data mining algorithms offered by the commercial vendors, but also many more. Rattle provides a simple, tab-based, user interface which is not as graphically sophisticated as Clementine in SPSS and SAS Enterprise Miner.

But with just 4 button clicks you will have built your first data mining model.

The usual disadvantage quoted for R (and so Rattle) is in the handling of large datasets – SAS and SPSS can handle datasets out of memory although they do slow down when doing so. R is memory based, so going to a 64bit platform is often necessary for the larger datasets. A very rough rule of thumb has been that the 2-3GB limit of the common 32bit processors can handle a dataset of up to about 50,000 rows with 100 columns (or 100,000 rows and 10 columns, etc), depending on the algorithms you deploy. I generally recommend, as quite a powerful yet inexpensive data mining machine, one running on an AMD64 processor, running the Debian GNU/Linux operating system, with as much memory as you can afford (e.g., 4GB to 32GB, although some machines today can go up to 128 GB, but memory gets expensive at that end of the scale).

Ajay – Rattle is free to download and use- yet it must have taken you some time
to build it.What are your revenue streams to support your time and efforts?

Graham –Yes, Rattle is free software: free for anyone to use, free to review the code, free to extend the code, free to use it for whatever purpose.  I have been developing Rattle for a few years now, with a number of
contributions from other users. Rattle, of course, gets its full power from R. The R community works together to help each other,
and others, for the benefit of all. Rattle and R can be the basic toolkit for knowledge workers providing analyses. I know of a number of data mining consultants around the world who are using Rattle to support their day-to-day consultancy work.

As a company, Togaware provides user support, installations of R and Rattle, runs training in using Rattle and in doing data mining. It also delivers data mining projects to clients. Togaware also provides support for incorporating Rattle (and R) into other products (e.g., as RStat for Information Builders).

Ajay – What is your vision of analytics for the future. How do you think the recession of 2008 and slowdown in 2009 will affect choice of softwares.

Graham- Watching the growth of data mining and analytics over the past 18 years it does seem that there has been and continues to be a monotonically increasing interest and demand for Analytics. Analytics continues to demonstrate benefit.

The global financial crisis, as others have suggested, should lead organizations to consider alternatives to expensive software. Good quality free and open source software has been available for a while now, but the typical CTO is still more comfortable purchasing expensive software. A purchase gives some sense of (false?) security but formally provides no warranty. My philosophy has been that we
should invest in our people, within an organization, and treat software as a commodity, that we openly contribute back into.

Imagine a world where we only use free open source software. The savings made by all will be substantial (consider OpenOffice versus MS/Office license fees paid by governments world wide, or Rattle versus SAS Enterprise Miner annual license fees). A small part of that saving might be expended on ensuring we have staff who are capable of understanding and extending that software to suit our needs, rather than vice versa (i.e., changing our needs to suit the software). We feed our extensions back into the grid of open source software, whilst also benefiting from contributions others are making. Some commercial vendors like to call this “communism” as part of their attempt to discredit open source, but we had better learn to share, for the good of the planet, before we lose it.

( Note from Ajay – If you are curious to try R , and have just 15 minutes to try it in, download Rattle from rattle.togaware.com. It has a click and point  interface and auto generates R code in it’s log. Trust me, it would time well spent.)

Images to Data :OCR Softwares

Some softwares I have used to convert images into text and even tables are –

 http://code.google.com/p/ocropus/

and http://code.google.com/p/tesseract-ocr/

Note- Both are open source , funded by Google   , who uses OCR to enhance search as well as for its book scanning project and these softwares are greatly helpful for say email marketing or converting images into rows and columns of text.

An additional open source imaging software is from http://www.leptonica.com/

You may need to tweak the resolution a bit and the highlighted scan area in order to get good results and can thus convert images into text and numeric data with a simple desktop scanner.

For higher end needs like production environments for questionnaires and responses-

The following come from the SPSS X List – ( it’s a nice list with many business problems that are also familiar in the SAS and R lists)

The software that SPSS recommends is ReadSoft (http://www.readsoft.com/).  Additionally, SPSS have a couple of complimentary products mrPaper (http://www.spss.com/mrpaper/) and mrScan (http://www.spss.com/mrScan/).

and from Dr Steven Lars in the same list

The high 90%+ accuracy of OCR technology of modern scanning, Remarks OMR (optical mark recognition) algorithms produce 99.9%+ accuracy for detected full or empty closed circles and squares.  It worked very well .

Scanning has been accomplished using upper end, hobby grade scanners with automatic form-feed options driven by Windows software.  Though these machines were purchased through university purchasing, all could have been purchased at Best Buy ( discount technology stores), a comparable store, or at internet discount sources.

You can store data for research purposes, access the data using SPSS or Excel, and respond to counselor questions within a week of receiving the raw, paper surveys.

The current preference is to use web-based information gathering developed through university information technology resources or developed on www.Zoomerang.com ( or even www.surveymonkey.com ).  

The biggest challenges were:
You have to be careful in the production of our to-be-scanned forms.  Our forms had to be printed on the same copy machine from a set master or printed on the same laser printer to ensure accuracy.  Poor quality copies and printing yields inaccurate scanning.  Survey color also has to be managed carefully, as some colors are opaque to some optical scanners.
 

A link to the publishers of Remark OMR:
http://www.gravic.com/remark/officeomr/index.html?gclid=CJj01MTSiZgCFRxNagodd0_IDQ

The SAS-L Rookie of the Year

Well I have been told, I am on the SAS-L rookie of the year list at http://www.listserv.uga.edu/cgi-bin/wa?A1=ind0901b&L=sas-l#29.

With 351 posts in 2008 and 0 in 2007, you can certainly say I have been an active rocky, I mean rookie on the list.

Some the things I did were-

  • Share experiences with SAS language code including Automation
  • Share and ask on non –SAS software areas like Google Docs, Cloud Computing ,WPS comparisons.
  • Provoke by design and mostly by accident discussion on R , SAS Software Pricing, relationship and dependence between SAS Community .Org and the SAS Institute and diversity and international issues on the list.

I believe SAS is a good software and the SAS institute has been a pioneer, and it needs to listen to feedback from its retail customers just as much it needs to make money.

  1. A more transparent way of announcing strategic intent on where they are going to concentrate research and
  2. maybe a more nuanced public  relationship stance on rival softwares , with
  3. a readiness to once again experiment if not embrace open source contributions (particularly by using some interface to R code, as well as R datasets) could lead to great stuff from SAS Institute again.

p.s. I Don’t expect to win though. I am bad at elections.

OT: The Little Child of the Holy Land

A child, a small child,

Roams around his yard, for a little while,

Till he hears the wail of the siren,

Too late now, he’s killed by shards of iron.

 

 

Ten rockets launched , but only one Jewish kid is dead.

His folks vow vengeance and war lies ahead.

 

A child, a small child,

Roams around his neighborhood, for a little while,

When with a whoosh his world explodes,

He wakes up in hospital, with a melted nose.

 

His parents dead, it was collateral damage,

They were in the wrong neighborhood,

so the story goes.

 

One more Arab kid, scarred for life,

One more statistic added to the score.

 

His career options on growing up are just two-

Suicide Bomber or fight  with a rifle too.

 

A child, one more  little child,

Looks up to the blue skies,

From where rockets and tank shells come.

 

God he cries out, or Gods (who ever is there)

Before the sons of Abraham

and the sons of Arabs,

finish each other ,

 

I am the son of Adam and Eve,

their common mother.

 

Take me please,

Far away from this place.

 

I want to grow up in some happy place,

where no one thinks I am a Jew or an Arab

Just a small kid, who needs some love.

 

((Image from http://www.stolenchildhood.net/page/3/))

 

Ajay – I don’t know how to analyze non quantitative things like Politics. As Gandhi once said , An eye for an eye will lead to a world of the blind.

Kids deserve to play.

Top Seven Reasons :Why Outsourcing is Bad for India

Sometimes too much of a good thing can be a bad thing .Here are some reasons why outsourcing is bad for India.

1) Micro Economic Benefit is Overstated for working class people– An average Indian worker in outsourcing would earn INR 20-30000 per month for working 5 days week and 8 hour shifts ( assumed). Thus his wage is no more than 30000/22*8*45=  That means  2.5 to 3.7 dollars per hour. Most people fresh from college in their first job in KPO and BPO start at 15,000 rupees . That means around 1.8 dollars per hour.

2) Social Impact- The impact on social life can be seen by going to any Gurgaon pub or discotheque at around 12- 2pm where you would see raunchy scenes, as young people of the age of 20-25 relax, after working for 8 hours interacting with mostly Western people. The reality of how life is in the West is distorted by their perceptions of Hollywood. Its like thinking Indian society is like Bollywood movies. Families are torn between accepting the immediate cash that the young son or daughter brings, and the rising number of teenage pregnancies in ITES centers is a statistic that is ignored– as abortion is legal in India.Many Indian companies have fired CEOs for having office affairs yet these have been hidden from the press.Many a times young couples in same night shifts have experimented with live in relationships, as they feel that is “okay” without knowing the tremendous impact the breakdown of family life has on people.The working hours and stresses have also impacted the divorce rates in India which are now shooting up. Alas these inconvenient truths are neither tabulated by the same companies who are creating a database of all ITES workers to ensure they can track people from company to company.

3) Macro Economic Dependency – By diverting most of the young people of an economy to low end repetitive jobs, it is a brain drain or immigration without actual transport due to technology as these people would be otherwise be working to develop in India’s economy.The linking of globalized economies is not always a good thing as it it is sometimes leads to infecting economies globally with a small crisis multiplied due to dependencies.

4) Ownership- Most Indian outsourcers are owned by Private Equity funds, which coincidentally sit on the boards of some of their biggest clients.Western Private equity funds do a regulatory arbitrage of labor conditions and corporate governance , as Indians are relatively new to the concepts of work life balance, over time pay , or societal ethics. Lack of protection to whistle blowers has been evident in the murders of engineers ( involving politicians) in construction, so there are very little whistle blowing laws in India.Thus economically it is a transfer of money from American middle class workers in terms of job loss to middle class Indian workers at a 40 % wage to that paid earlier but the majority share goes back to American top end society who invests in these private equity funds.

5) Labor conditions-If the NYSE listed companies of Indian ITES are asked by regulators in the United States on how many extra hours have workers put in beyond 40 hours a week, and how much overtime money has been paid to them for that, the answer would suffice to tell you that Indians are intellectual chimney sweeps in terms of pay as well as health care benefits.Joining a union is illegal in most Indian companies.Work life balance is an alien concept as heady promotions lure people to work even harder.Culturally Indians find it difficult to say no to superiors and working extra hard is considered good than trying for balanced life.

6) Illegal Contracts – Renowned Indian KPO’s and BPOs hire and make people sign a employment bond especially for an overseas trip. The rationale is that so people do not quit immediately after coming from an on site client. Yet for a two or even a three week trip, people are forced to sign employment bonds of one to two years, with heavy financial penalties for leaving the job earlier. These would be illegal in the same country as which the Western client is situated yet these are enforced compulsorily. Some companies even force people to sign a bond saying they will not quit for the first two years, thus creating an unique bonded labor system for white collar workers. Some companies sign anti poaching contracts ,promising not to hire people from the other company.

7) Health Impact-  I have personally noted many promising people side lined due to bad backs in their late twenties and early thirties because of constant sitting in badly designed ergonomic chairs and working more than 8-9 hours in a day. Yet chronic organizational overwork of employees is noted as productivity increases without noting additional costs , as India lacks adequate medical infrastructure for all its people. I have seen BPO workers sip gin and water between doing double shifts ,and gorge on fatty road side dhabas to keep their energy levels up. Who pays for their health cost ? Mostly it is the family.

 

Not all outsourcing is bad.Exposure to cutting edge technology and research is one area where offshoring is really good. Not all of it is good either.

A balanced way in which all  companies are forced to adopt same labor conditions at all their vendors as they have for their employees ( maybe with different wages adjusted for purchasing power parity) is the way out.

 

An additional point is financially creative outsourcing- Many times companies shift headquarters to say a tax free location like Dubai, then offshore their work to India at 40 % cost and 120 % hours worked per worker, while keeping a small staff in the US.

These tax incentives to do these kinds of offshoring is plain and simple cheating.

Consulting companies may publish reports on how good offshoring is , but there is no such thing as a free lunch. There are hidden costs involved in this industry ,just like anything else, and it is for regulators and governments to ensure fairness and openness.

 

The Author has worked with some of India’s top offshoring companies. These are his personal perspectives.

Top ten RRReasons R is bad for you ?

 

This is the original symbol of the Perl progra...
Image via Wikipedia

 

R stands for programming language based out of www.r-project.org

R is bad for you because –

1) It is slower with bigger datasets than SPSS language and SAS language .If you use bigger datasets, then you should either consider more hardware , or try and wait for some of the ODBC connect packages.

2) It needs more time to learn than SAS language .Much more time to learn how to do much more.

3) R programmers are lesser paid than SAS programmers.They prefer it that way.It equates the satisfaction of creating a package in development with a world wide community with the satisfaction of using a package and earning much more money per hour.

4) It forces you to learn the exact details of what you are doing due to its object oriented structure. Thus you either get no answer or get an exact answer. Your customer pays you by the hour not by the correct answers.

5) You can not push a couple of buttons or refer to a list of top ten most commonly used commands to finish the project.

6) It is free. And open for all. It is socialism expressed in code. Some of the packages are built by university professors. It is free.Free is bad. Who pays for the mortgage of the software programmers if all softwares were free ? Who pays for the Friday picnics. Who pays for the Good Night cruises?

7) It is free. Your organization will not commend you for saving them money- they will question why you did not recommend this before. And why did you approve all those packages that expire in 2011.R is fReeeeee. Customers feel good while spending money.The more software budgets you approve the more your salary is. R thReatens all that.

8) It is impossible to install a package you do not need or want. There is no one calling you on the phone to consider one more package or solution. R can make you lonely.

9) R uses mostly Command line. Command line is from the Seventies. Or the Eighties. The GUI’s RCmdr and Rattle are there but still…..

10) R forces you to learn new stuff by the month. You prefer to only earn by the month. Till the day your job got offshored…

Written by a R user in English language

( which fortunately was not copyrighted otherwise we would be paying Britain for each word)

the above post was reprinted by request.