SAS with the GUI Enterprise Guide (Updated)

Here is a slideshow I made using Google Docs ( which is good except the PDF version is much worse than Microsoft Slidehare). It is on the latest R GUI called AwkWard. It is based on the webpage here

http://docs.google.com/View?id=dcvss358_1015frg4k8gj

In my last post on WPS , R and Sas I had briefly shown a screenshot of SAS Enterprise Guide with a single comment on how it could do with a upgrade in it’s GUI. Well it seems that the upgrade has been available since March 2009, but probably not applied since no one noticed even once in the Fall Semester here in the Tennessee ( including people from the University who read this blog 🙂 Actually the upgrade was made to local machines but there is also a cloud version but didnt apply the upgrade – where we can use Citrix Server to just run analytics on the browser

Here is a revised update of SAS Enterprise Guide 4.2

SAS Enterprise Guide is a Windows interface to SAS that allows for SAS programming *and* point-and-click tasks for reporting, graphs, analytics, and data filter/query/manipulation. SAS Enterprise Guide can work with SAS on your local machine, and it can connect to SAS servers on Windows, Unix/Linux, and the mainframe.

It doesn’t have decision tree support; that’s provided by a more specialized application for data mining called SAS Enterprise Miner.

And you can easily extend SAS Enterprise Guide with your own tasks. See http://support.sas.com/eguide. You do not need SAS/Toolkit. You can use off-the-shelf development tools for Microsoft .NET, including the freely available express editions of Microsoft Visual C# or Visual Basic .NET.

With credit to Chris from SAS for forwarding me the correct document and answers.

PS-
It would be great if the SAS User Conferences Archives used slideshare or Google Docs ( PDFs are so from the 90s) for saying displaying the documents at the sascommunity.org ( which took the twitter id @sascommunity after two months of requests,threats and friendly pleas from me- only to not use it actively except for one Tip of the Day Tweet, sigh)

Weak Security in Internet Databases for Statisticians

A year ago while working as a virtual research assistant to Dr Vincent Granville( of Analyticbridge.com and who signed my recommendation form for University of Tennessee) I helped download almost 22000 records of almost all the statisticians and economists of the world. This included databases like American Statistical Association and Royal Society ( ASA, ACME, RS etc).

After joining University of Tennessee, i sent a sample of code and database with me by email  to two professors ( one a fellow of ASA and the other an expert into internet protocols to make it an academic paper except they did not know any journal or professor who knew stuff on data scraping 😦 )

I am publishing this now in the hope they would have plugged the gap before someone gets that kind of database and exploits for spamming or commercial mal use.

The weak link was once you were in the database using a valid login and password, you can use automated HTML capture to basically do a lot of data scraping using the iMacro macro or Firefox Plugin. Since the login were done on Christmas Eve and during year end- this also used the fact that admins were likely to overlook into analytical logs ( if they had software like clicky or were preserving logs).

Here is the code that was used for scraping the whole database for ASA ( Note the scraping was not used by me- it was sent to Dr Granville and this was an academic research project).

See complete code here- http://docs.google.com/View?id=dcvss358_335dg2xmdcp

1) Use Firefox Browser  ( or Download from  http://www.mozilla.com/en-US/firefox/ )

2) Install  IMacros from https://addons.mozilla.org/en-US/firefox/addon/3863

3) Use the following code, paste in a notepad file and save as “macro1.iim”.

VERSION BUILD=6111213 RECORDER=FX

Note the ‘ prefix denotes commented out code

‘AUTOMATED ENTRY INTO WEBSITE IN CORRECT POSITION

TAB T=1

‘URL GOTO=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

‘TAG POS=1:TEXT FORM=NAME:frmLogin ATTR=NAME:txtUser CONTENT=USERNAME

‘SET !ENCRYPTION NO

‘TAG POS=1:PASSWORD FORM=NAME:frmLogin ATTR=NAME:txtPassword CONTENT=USERPASSWORD

‘TAG POS=1:SUBMIT FORM=NAME:frmLogin ATTR=NAME:btnSubmit&&VALUE:Login

‘TAG POS=1 ATTR=ID:el34

 

‘ENTER FORM INPUTS

‘TAG POS=1 FORM=NAME:frmSearch ATTR=NAME:txtState CONTENT=%CA

‘TAG POS=1:TEXT FORM=NAME:frmSearch ATTR=NAME:txtName CONTENT=b

‘TAG POS=1:SUBMIT FORM=NAME:frmSearch ATTR=NAME:btnSubmit&&VALUE:Submit

‘END FORM INPUTS

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

SET !LOOP 1

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

TAG POS=1 ATTR=TXT:Name

TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF

SET !VAR1 {{!EXTRACT}}

‘PROMPT {{!EXTRACT}}

URL GOTO={{!VAR1}}

TAG POS=1 ATTR=TXT:Name

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

TAG POS=1 ATTR=TXT:Email

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

‘PROMPT {{!EXTRACT}}

 

BACK

SAVEAS FOLDER=* FILE=*

4) The code should be run after logging in and after giving inputs for name (use wild card of a single alphabet say a)  and state  from drop down

5) Click submit to get number of records

6)Click on the IOpus Macro button next to address bar in Firefox  and load the macro file above

7) Run macro ( Click on run loop button from 1 to X where X is number of records returned in step5.

Repeat Steps 4 to 7 till a single State ( which is the group by variable here ) is complete.

8) Go to  C:\Documents and Settings\admin\My Documents\iMacros\Downloads (Check this from IMacros settings and options in your installation)

9) Rename the file index as “state.csv”

10) Open CSV file

11) Use the following Office 2003 Macro to clean the file

Sub Macro1()

‘ Macro1 Macro

‘ Macro recorded 12/22/2008 by ajay

‘ Keyboard Shortcut: Ctrl+q

Cells.Select

Selection.Replace What:=”#NEWLINE#”, Replacement:=””, LookAt:=xlPart, _

SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _

ReplaceFormat:=False

Columns(“B:B”).Select

Selection.TextToColumns Destination:=Range(“B1”), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“C:C”).Select

Selection.TextToColumns Destination:=Range(“C1”), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“B:B”).ColumnWidth = 23.71

Columns(“A:A”).EntireColumn.AutoFit

ActiveWindow.SmallScroll Down:=9

ActiveWorkbook.Save

End Sub

 

12) In case you have Office 2007 Use The Record Macro feature to create your unique Macro in your personal Macro Workbook, basically replacing all #NEWFILE# with space (using Ctrl+H) and using Text to columns for column 2 and column 3, with type delimited,next, treat successive delimiters as one (check box),next,do not import first column (BY selecting that column”)

13) To append lots of files into 1 file use the following R Commands

 

Download R from www.r-project.org

 

>setwd(“C:\\Documents and Settings\\admin\\My Documents\\iMacros\\Downloads”)

Note this is the same folder as in Step 8 above

>list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,

+     recursive = FALSE, ignore.case = FALSE)

 

The R output is something like below

 

 

> list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,  +     recursive = FALSE, ignore.case = FALSE)  [1] “Automation Robot – Documents – Office Live Workspace” “Book1.xls”                                             [3] “cala.csv”                                             “calb.csv”                                              [5] “calc.csv”                                             “cald.csv”                                              [7] “cale.csv”                                             “calf.csv”                                              [9] “calg.csv”                                             “calh.csv”                                             [11] “cali.csv”                                             “calj.csv”                                             [13] “calk.csv”                                             “call.csv”                                             [15] “calm.csv”                                             “caln.csv”                                             [17] “calo.csv”                                             “calp.csv”                                             [19] “calq.csv”                                             “calr.csv”                                             [21] “cals.csv”                                             “calt.csv”                                             [23] “calu.csv”                                             “calv.csv”                                             [25] “calw.csv”                                             “calx.csv”                                             [27] “caly.csv”                                             “calz.csv”                                             [29] “cola.csv”                                             “colac.csv”                                            [31] “colad.csv”                                            “colae.csv”                                            [33] “colaf.csv”                                            “colag.csv”                                            [35] “coloa.csv”                                            “colob.csv”                                            [37] “index”                                                “login”                                                > file.append(“coloa.csv”,”colob.csv”) [1] TRUE > file.append(“coloa.csv”,”colac.csv”) [1] TRUE > file.append(“coloa.csv”,”colad.csv”) [1] TRUE > file.append(“coloa.csv”,”colae.csv”) [1] TRUE > file.append(“coloa.csv”,”colaf.csv”) [1] TRUE > file.append(“coloa.csv”,”colag.csv”) [1] TRUE > file.append(“cala.csv”,”calb.csv”) [1] TRUE > file.append(“cala.csv”,”calc.csv”) [1] TRUE > file.append(“cala.csv”,”cald.csv”) [1] TRUE > file.append(“cala.csv”,”cale.csv”) [1] TRUE > file.append(“cala.csv”,”calf.csv”) [1] TRUE > file.append(“cala.csv”,”calg.csv”) [1] TRUE > file.append(“cala.csv”,”calh.csv”) [1] TRUE > file.append(“cala.csv”,”cali.csv”) [1] TRUE > file.append(“cala.csv”,”calj.csv”) [1] TRUE > file.append(“cala.csv”,”calk.csv”) [1] TRUE > file.append(“cala.csv”,”call.csv”) [1] TRUE > file.append(“cala.csv”,”calm.csv”) [1] TRUE > file.append(“cala.csv”,”caln.csv”) [1] TRUE > file.append(“cala.csv”,”calo.csv”) [1] TRUE > file.append(“cala.csv”,”calp.csv”) [1] TRUE > file.append(“cala.csv”,”calq.csv”) [1] TRUE > file.append(“cala.csv”,”calr.csv”) [1] TRUE > file.append(“cala.csv”,”cals.csv”) [1] TRUE > file.append(“cala.csv”,”calt.csv”) [1] TRUE > file.append(“cala.csv”,”calu.csv”) [1] TRUE > file.append(“cala.csv”,”calv.csv”) [1] TRUE > file.append(“cala.csv”,”calw.csv”) [1] TRUE > file.append(“cala.csv”,”calx.csv”) [1] TRUE > file.append(“cala.csv”,”caly.csv”) [1] TRUE > file.append(“cala.csv”,”calz.csv”) [1] TRUE

ACTUAL EXECUTION TIME REVISED MACRO

 

This uses multiple tabs ( using TAB T=1 and TAB T=2) to switch between Tabs. Thus you can search for a big name in Tab 1 , while Tab 2 consists of the details of the table components ( here Name and Email positioned relatively)

 

Execution of Loop is by the Loop Button on IMacros

 

 

VERSION BUILD=6111213 RECORDER=FX TAB T=1 SET !LOOP  This sets Initial value of Loop to start from Value=1 SET !ERRORIGNORE YES Setting Errors to be Ignored ( Like in cases when Email is not present ) and thus resume the rest of code SET !EXTRACT_TEST_POPUP NO Setting Popups to be disabled. Note Popups are useful while creating the code, but reduce execution time. TAG POS=1 ATTR=TXT:Name TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF Note here the extratced value takes position of the link (HREF) positioned at (R1) Row 1(from Loop) using the reference from Text ( In Strong) Name SET !VAR1 {{!EXTRACT}} Passing Value of Extract to the new variable var2. TAB T=2 Creating a new tab in Firefox within same window  URL GOTO={{!VAR1}} Going to the new URL (which is the link of the table constituent – referenced by its name) TAG POS=1 ATTR=TXT:Name TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Name TAG POS=1 ATTR=TXT:Email TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Email ‘ONDIALOG POS=1 BUTTON=OK CONTENT= Commented out section- Used when Firefox gives a message to resubmit the data TAB T=1 Back to Tab 1 or where Form Inputs Search are present ‘BACK Commented out , instead of using back in same tab, we are moving across tabs to avoid submitting the search again and again SAVEAS FOLDER=* FILE=* Downloading the data into default folder, default format(File)Back to same Steps (Click here)

If you are interested in knowing more you can see the Google Docs


http://docs.google.com/View?id=dcvss358_335dg2xmdcp

 

 

The declining market for Telecommunication Churn Models

[tweetmeme=”decisionstats”]

Users of Predictive Analytics within telecom sector can look into an interesting side effect of the iPhone – AT &T agreement. With Google also jumping into the market with it’s Droid – the new norms in Telecom agreements is lockedin contracts for consumers. While this is permitted by the telecom regulators as fair to competition- this also means that there is very little churn within these locked in contracts. This leads to further savings for the telecom provider allowing them to have higher profits and even share the profits by price decreases-

and thus the traditional bug bear of telecom analytics churn modeling is slowly losing importance to plain vanilla reporting or better data mining dashboard like solutions. Lower Churn , means also lower costs on analytics softwares to predict churn.

As competition within the 3G Mobile market ramps up due to Google’s entry and licensing with partners exclusively- the trend will likely increase for reduced churn due to locked in customers.Even existing mobile providers can offer discounts to lock in customers for not switching ( especially in Mobile Markets like India- where I have personally interacted with large players like Bharti) and China which has even bigger mobile market.

Ergo Lower need to buy softwares that predict churn-

See Below Image from TeraData’s Churn Model.

Twitter Cloud and a note on Cloud Computing

That’s what I use twitter for. If you have a twitter account you can follow me here

http://twitter.com/decisionstats

A couple of weeks ago I accidentally deleted many followers using a Twitter App called Refollow- I was trying to clean up people I follow and checked the wrong tick box-

so please if you feel I unfollowed you- it was a mistake. Seriously.

[tweetmeme=”decisionstats”]

 

 

 

 

 

 

 

 

 

 

 

 

On Cloud Computing- and Google- rumours ( 🙂 ) are emerging that Google’s push for cloud computing is to turn desktop computing to IBM like mainframe computing .  Except that there are too many players this time. Where is the Department of Justice and anti trust – does Amazon qualify for being too big in cloud computing currently.

Or the rumours could be spread by Microsoft/ Apple / Amazon competitors etc. Geeks are like that sometimes.

Creating Customized Packages in SAS Software

It seems there is a little known component called SAS Toolkit that enables you to create customized SAS commands.

[tweetmeme=”decisionstats”]

I am still trying to find actual usage of this software but it basically can be used to create additional customization in SAS. The price is reportedly 12000 USD a year for the Tool Kit but academics could be encouraged to write thesis or projects in newer algols using standard SAS discounting. In addition there is no licensing constraint as of now to reselling your customized sas algol ( but check with Cary,NC or http://www.sas.com on this before you go ahead and develop)

So if you have an existing R package (with open source) and someone wants to port it to SAS language or SAS software, they can simply use the SAS Toolkit to transport the algorithm ( which to my knowledge are mostly open in R). Specific instances are graphics, Hmisc, Pl.ier or even lattice and clustering (like mclust) packages. or maybe even license it.

Citation-http://www.sas.com/products/toolkit/index.html

SAS/TOOLKIT® SAS/TOOLKIT software enables you to write your own customized SAS procedures (including graphics procedures), informats, formats, functions (including IML and DATA step functions), CALL routines, and database engines in several languages including C, FORTRAN, PL/I, and IBM assembler. SAS Procedures A SAS procedure is a program that interfaces with the SAS System to perform a given action. The SAS System provides services to the procedure such as:

  • statement processing
  • data set management
  • memory allocation

SAS Informats, Formats, Functions, and CALL Routines (IFFCs) You can use SAS/TOOLKIT software to write your own SAS informats, formats, functions, and CALLroutines in the same choice of languages: C, FORTRAN, PL/I, and IBM assembler. Like procedures, user-written functions and CALL routines add capabilities to the SAS System that enable you to tailor the system to your site’s specific needs. Many of the same reasons for writing procedures also apply to writing SAS formats and CALL routines. SAS/TOOLKIT Software and PROC FORMAT You may wonder why you should use SAS/TOOLKIT software to create user-written formats and informats when base SAS software includes PROC FORMAT. SAS/TOOLKIT software enables you to create formats and informats that perform more than the simple table lookup functions provided by the FORMAT procedure. When you write formats and informats with SAS/TOOLKIT software, you can do the following:

  • assign values according to an algorithm instead of looking up a value in a table.
  • look up values in a Database to assign formatted values.

Writing a SAS IFFC

The routines you are most likely to use when writing an IFFC perform the following tasks:

  • provide a mechanism to interface with functions that are already written at your site
  • use algorithms to implement existing programs
  • handle problems specific to the SAS environment, such as missing values.

SAS Engines SAS engines allow data to be presented to the SAS System so it appears to be a standard SAS data set. Engines supplied by SAS Institute consist of a large number of subroutines, all of which are called by the portion of the SAS System known as the engine supervisor.

However, with SAS/TOOLKIT software, an additional level of software, the engine middle-manager simplifies how you write your user-written engine. An Engine versus a Procedure To process data from an external file, you can write either an engine or a SAS procedure. In general, it is a good idea to implement data extraction mechanisms as procedures instead of engines. If your applications need to read most or all of a data file, you should consider creating a procedure—-but if they need random access to the file, you should consider creating an engine. Writing SAS Engines When you write an engine, you must include in your program a prescribed set of routines to perform the various tasks required to access the file and interact with the SAS System. These routines:

  • open and close the data set
  • obtain information about variables
  • provide information about an external file or database
  • read and write observations.

In addition, your program uses several structures defined by the SAS System for storing information needed by the engine and the SAS System. The SAS System interacts with your engine through the SAS engine middle-manager.

Using the USERPROC Procedure Before you run your grammar, procedure, IFFC, or engine, use SAS/TOOLKIT software’s USERPROC procedure.

  • For grammars, the USERPROC procedure produces a grammar function.
  • For procedures, IFFCs, and engines, the USERPROC procedure produces a program constants object file, which is necessary for linking all of the compiled object files into an executable module.

Compile and link the output of PROC USERPROC with the SAS System so that the system can access the procedure, IFFC, or engine when a user invokes it.

Using User-Written Procedures, IFFCs, and Engines After you have created a SAS procedure, IFFC, or engine, you need to tell the SAS System where to find the module in order to run it. You can store your executable modules in any appropriate library. Before you invoke the SAS System, use operating system control language to specify the fileref SASLIB for the directory or load library where your executables are stored. When you invoke the SAS System and use the name of your procedure, IFFC, or engine, the SAS System checks its own libraries first and then looks in the SASLIB library for a module with that name.

Debugging Capabilities The TLKTDBG facility allows you to obtain debug information concerning SAS routines called by your code, and works with any of the supported programming languages. You can turn this facility on and off without having to recompile or relink your code. Debug messages are sent to the SAS log. In addition to the SAS/TOOLKIT internal debugger, the C language compiler used to create your extension to the SAS System can be used to debug your program.

The SAS/C Compiler, the VMS Compiler, and the dbx debugger for AIX can all be used. NOTE: SAS/TOOLKIT software is used to develop procedures, IFFCs, and engines. Users do not need to license SAS/TOOLKIT software to run procedures developed with the software

SAS/C Compiler attention

March 2008 Level B support is effective beginning January 1, 2008 until December 31, 2009.March 2005 The SAS/C and SAS/C++ compiler and runtime components are reclassified as SAS Retired products for z/OS, VM/ESA and cross-compiler platforms. SAS has no plans to develop or deliver a new release of the SAS/C product.

 

The SAS/C and SAS/C++ family of products provides a versatile development environment for IBM zSeries® and System/390® processors. Enhancements and product features for SAS/C 7.50F include support for z/Architecture instructions and 64-bit addressing, IEEE floating-point, C99 math library and a number of C++ language enhancements and extensions. The SAS/C runtime library, optimizer and debugging environments have been updated and enhanced to fully support the breadth of C/C++ 64-bit addressing, IEEE and C++ product features.

Finally, the SAS/C and SAS/C++ 7.50.06 Cross-compiler products for Windows, Linux, Solaris and Aix incorporate the same enhancements and features that are provided with SAS/C and SAS/C++ 7.50F for z/OS.

Also see- http://support.sas.com/kb/15/647.html

The Great Game- How social media changes the Intelligence Industry

Since time immemorial, countries and corporations have used spies to displace existing equilibriums in balance of power or market share dynamics. An integral part of that was technology. From the pox infested rugs given to natives, to the plague rats, to the smuggling of the secret of silk and gunpowder from China to the West to the latest research in cloud seeding by China and Glaciars melting by India- technology espionage has been an integral part in keeping up with each other.

For the first time in history, technology has evolved to the point where tools for communicating securely , storing data has become cheap to the point of just having a small iPhone 3GS with applications for secure transmission. From an analytical purpose the need for analyzing signal from noise and the criticality in mapping chatter with events (like Major Hasan’s online activities)  has also created an opportunity for social media as well as an headache for the people involved. With Citizen Journalism, foreign relations office, and ambassadors with their bully pulpits have been brought down to defending news leaked by Twitter ( Iran) You Tube ( Thailand/Burma/Tibet) and Blogs ( Russia/Georgia). The rise of bot nets, dark clouds to create disruptions as well as hack into accounts for enhancing favourable noise and reducing unfavourable signals has only increased. Blogs have potential to influence customer behavior as they are seen more credible than public relations which is mostly public and rarely on relations.

Techniques like sentiment analysis , social network analysis, text mining and co relation of keywords to triggers remain active research points.

[tweetmeme=”decisionstats”]

The United States remains a leader as you can only think creatively out of a box if you are permitted to behave accordingly out of the box. The remaining countries are torn between a  mix of admiration , envy and plain old copy cat techniques. The rising importance of communities that act more tribal than hitherto loyal technology user lists is the reason almost all major corporates actively seek to cultivate social media communities. The market for blogs and twitter in China or Iran or Russia will have impacts on those government’s efforts to manage their growth as per their national strategic interests. Just like the title of an old and quaint novel- “The Brave New World” of social media and it’s convergence with increasing amounts of text data generated on customers, or citizens is evolving into creating new boundaries and space for itself.A fascinating Great Game in itself.

News on R Commercial Development -Rattle- R Data Mining Tool

R RANT- while the European R Core leadership led by the Great Dane, Pierre Dalgaard focuses on the small picture and virtually handing the whole commercial side to Prof Nie and David Smith at Revo Computing other smaller package developers have refused to be treated as cheap R and D developers for enterprise software. How’s the book sales coming along, Prof Peter? Any plans to write another R Book or are you done with writing your version of Mathematica (Ref-Newton). Running the R Core project team must be so hard I recommend the Tarantino movie “Inglorious B…” for Herr Doktors. -END

I believe that individual R Package creators like Prof Harell (Hmisc) , or Hadley Wickham (plyr) deserve a share of the royalties or REVENUE that Revolution Computing, or ANY software company that uses R.

On this note-Some updated news on Rattle the Data Mining Tool created by Dr Graham Williams. Once again R development taken ahead by Down Under chaps while the Big Guys thrash out the road map across the Pond.

Data Mining Resources

Citation –http://datamining.togaware.com/

Rattle is a free and open source data mining toolkit written in the statistical language R using the Gnome graphical interface. It runs under GNU/Linux, Macintosh OS X, and MS/Windows. Rattle is being used in business, government, research and for teaching data mining in Australia and internationally. Rattle can be purchased on DVD (or made available as a downloadable CD image) as a standalone installation for $450USD ($560AUD), using one of the following payment buttons.

The free and open source book, The Data Mining Desktop Survival Guide (ISBN 0-9757109-2-3) simply explains the otherwise complex algorithms and concepts of data mining, with examples to illustrate each algorithm using the statistical language R. The book is being written by Dr Graham Williams, based on his 20 years research and consulting experience in machine learning and data mining. An electronic PDF version is available for a small fee from Togaware ($40AUD/$35USD to cover costs and ongoing development);

Other Resources

  • The Data Mining Software Repository makes available a collection of free (as in libre) open source software tools for data mining
  • The Data Mining Catalogue lists many of the free and commercial data mining tools that are available on the market.
  • The Australasian Data Mining Conferences are supported by Togaware, which also hosts the web site.
  • Information about the Pacific Asia Knowledge Discovery and Data Mining series of conferences is also available.
  • Data Mining course is taught at the Australian National University.
  • See also the Canberra Analytics Practise Group.
  • A Data Mining Course was held at the Harbin Institute of Technology Shenzhen Graduate School, China, 6 December – 13 December 2006. This course introduced the basic concepts and algorithms of data mining from an applications point of view and introduced the use of R and Rattle for data mining in practise.
  • Data Mining Workshop was held over two days at the University of Canberra, 27-28 November, 2006. This course introduced the basic concepts and algorithms for data mining and the use of R and Rattle.

Using R for Data Mining

The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.

R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.

R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.

Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM’s Intelligent Miner).

See earlier interview-

https://decisionstats.wordpress.com/2009/01/13/interview-dr-graham-williams/