model – Page 8 – DECISION STATS

Open Source Compiler for SAS language/ GNU -DAP

I am still testing this out.

But if you know bit more about make and .compile in Ubuntu check out

http://www.gnu.org/software/dap/

I loved the humorous introduction

Dap is a small statistics and graphics package based on C. Version 3.0 and later of Dap can read SBS programs (based on the utterly famous, industry standard statistics system with similar initials – you know the one I mean)! The user wishing to perform basic statistical analyses is now freed from learning and using C syntax for straightforward tasks, while retaining access to the C-style graphics and statistics features provided by the original implementation. Dap provides core methods of data management, analysis, and graphics that are commonly used in statistical consulting practice (univariate statistics, correlations and regression, ANOVA, categorical data analysis, logistic regression, and nonparametric analyses).

Anyone familiar with the basic syntax of C programs can learn to use the C-style features of Dap quickly and easily from the manual and the examples contained in it; advanced features of C are not necessary, although they are available. (The manual contains a brief introduction to the C syntax needed for Dap.) Because Dap processes files one line at a time, rather than reading entire files into memory, it can be, and has been, used on data sets that have very many lines and/or very many variables.

I wrote Dap to use in my statistical consulting practice because the aforementioned utterly famous, industry standard statistics system is (or at least was) not available on GNU/Linux and costs a bundle every year under a lease arrangement. And now you can run programs written for that system directly on Dap! I was generally happy with that system, except for the graphics, which are all but impossible to use, but there were a number of clumsy constructs left over from its ancient origins.

http://www.gnu.org/software/dap/#Sample output

Unbalanced ANOVA

Crossed, nested ANOVA

Random model, unbalanced

Mixed model, balanced

Mixed model, unbalanced

Split plot

Latin square

Missing treatment combinations

Linear regression

Linear regression, model building

Ordinal cross-classification

Stratified 2×2 tables

Loglinear models

Logit model for linear-by-linear association

Logistic regression

sounds too good to be true- GNU /DAP joins WPS workbench and Dulles Open’s Carolina as the third SAS language compiler (besides the now defunct BASS software) see http://en.wikipedia.org/wiki/SAS_language#Controversy

Also see http://en.wikipedia.org/wiki/DAP_(software)

Dap was written to be a free replacement for SAS, but users are assumed to have a basic familiarity with the C programming language in order to permit greater flexibility. Unlike R it has been designed to be used on large data sets.

It has been designed so as to cope with very large data sets; even when the size of the data exceeds the size of the computer’s memory

R courses from Statistics.com (revolutionanalytics.com)
Categorical Data Analysis for the Behavioral and Social Sciences (psypress.com)
Skills of a good data miner (zyxo.wordpress.com)
GNU Octave 3.4 has just been released (gnu.org)
Revolution R Enterprise 4.2 now available (revolutionanalytics.com)

R Commander Plugins-20 and growing!

R Commander Extensions: Enhancing a Statistical Graphical User Interface by extending menus to statistical packages

R Commander ( see paper by Prof J Fox at http://www.jstatsoft.org/v14/i09/paper ) is a well known and established graphical user interface to the R analytical environment.
While the original GUI was created for a basic statistics course, the enabling of extensions (or plug-ins http://www.r-project.org/doc/Rnews/Rnews_2007-3.pdf ) has greatly enhanced the possible use and scope of this software. Here we give a list of all known R Commander Plugins and their uses along with brief comments.

DoE – http://cran.r-project.org/web/packages/RcmdrPlugin.DoE/RcmdrPlugin.DoE.pdf
doex
EHESampling
epack- http://cran.r-project.org/web/packages/RcmdrPlugin.epack/RcmdrPlugin.epack.pdf
Export- http://cran.r-project.org/web/packages/RcmdrPlugin.Export/RcmdrPlugin.Export.pdf
FactoMineR
HH
IPSUR
MAc- http://cran.r-project.org/web/packages/RcmdrPlugin.MAc/RcmdrPlugin.MAc.pdf
MAd
orloca
PT
qcc- http://cran.r-project.org/web/packages/RcmdrPlugin.qcc/RcmdrPlugin.qcc.pdf and http://cran.r-project.org/web/packages/qcc/qcc.pdf
qual
SensoMineR
SLC
sos
survival-http://cran.r-project.org/web/packages/RcmdrPlugin.survival/RcmdrPlugin.survival.pdf
SurvivalT
Teaching Demos

Note the naming convention for above e plugins is always with a Prefix of “RCmdrPlugin.” followed by the names above
Also on loading a Plugin, it must be already installed locally to be visible in R Commander’s list of load-plugin, and R Commander loads the e-plugin after restarting.Hence it is advisable to load all R Commander plugins in the beginning of the analysis session.

However the notable E Plugins are
1) DoE for Design of Experiments-
Full factorial designs, orthogonal main effects designs, regular and non-regular 2-level fractional
factorial designs, central composite and Box-Behnken designs, latin hypercube samples, and simple D-optimal designs can currently be generated from the GUI. Extensions to cover further latin hypercube designs as well as more advanced D-optimal designs (with blocking) are planned for the future.
2) Survival- This package provides an R Commander plug-in for the survival package, with dialogs for Cox models, parametric survival regression models, estimation of survival curves, and testing for differences in survival curves, along with data-management facilities and a variety of tests, diagnostics and graphs.
3) qcc -GUI for Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts
4) epack- an Rcmdr “plug-in” based on the time series functions. Depends also on packages like , tseries, abind,MASS,xts,forecast. It covers Log-Exceptions garch
and following Models -Arima, garch, HoltWinters
5)Export- The package helps users to graphically export Rcmdr output to LaTeX or HTML code,
via xtable() or Hmisc::latex(). The plug-in was originally intended to facilitate exporting Rcmdr
output to formats other than ASCII text and to provide R novices with an easy-to-use,
easy-to-access reference on exporting R objects to formats suited for printed output. The
package documentation contains several pointers on creating reports, either by using
conventional word processors or LaTeX/LyX.
6) MAc- This is an R-Commander plug-in for the MAc package (Meta-Analysis with
Correlations). This package enables the user to conduct a meta-analysis in a menu-driven,
graphical user interface environment (e.g., SPSS), while having the full statistical capabilities of
R and the MAc package. The MAc package itself contains a variety of useful functions for
conducting a research synthesis with correlational data. One of the unique features of the MAc
package is in its integration of user-friendly functions to complete the majority of statistical steps
involved in a meta-analysis with correlations. It uses recommended procedures as described in
The Handbook of Research Synthesis and Meta-Analysis (Cooper, Hedges, & Valentine, 2009).

A query to help for ??Rcmdrplugins reveals the following information which can be quite overwhelming given that almost 20 plugins are now available-

RcmdrPlugin.DoE::DoEGlossary
Glossary for DoE terminology as used in
RcmdrPlugin.DoE
RcmdrPlugin.DoE::Menu.linearModelDesign
RcmdrPlugin.DoE Linear Model Dialog for
experimental data
RcmdrPlugin.DoE::Menu.rsm
RcmdrPlugin.DoE response surface model Dialog
for experimental data
RcmdrPlugin.DoE::RcmdrPlugin.DoE-package
R-Commander plugin package that implements
design of experiments facilities from packages
DoE.base, FrF2 and DoE.wrapper into the
R-Commander
RcmdrPlugin.DoE::RcmdrPlugin.DoEUndocumentedFunctions
Functions used in menus
RcmdrPlugin.doex::ranblockAnova
Internal RcmdrPlugin.doex objects
RcmdrPlugin.doex::RcmdrPlugin.doex-package
Install the DOEX Rcmdr Plug-In
RcmdrPlugin.EHESsampling::OpenSampling1
Internal functions for menu system of
RcmdrPlugin.EHESsampling
RcmdrPlugin.EHESsampling::RcmdrPlugin.EHESsampling-package
Help with EHES sampling
RcmdrPlugin.Export::RcmdrPlugin.Export-package
Graphically export objects to LaTeX or HTML
RcmdrPlugin.FactoMineR::defmacro
Internal RcmdrPlugin.FactoMineR objects
RcmdrPlugin.FactoMineR::RcmdrPlugin.FactoMineR
Graphical User Interface for FactoMineR
RcmdrPlugin.IPSUR::IPSUR-package
An IPSUR Plugin for the R Commander
RcmdrPlugin.MAc::RcmdrPlugin.MAc-package
Meta-Analysis with Correlations (MAc) Rcmdr
Plug-in
RcmdrPlugin.MAd::RcmdrPlugin.MAd-package
Meta-Analysis with Mean Differences (MAd) Rcmdr
Plug-in
RcmdrPlugin.orloca::activeDataSetLocaP
RcmdrPlugin.orloca: A GUI for orloca-package
(internal functions)
RcmdrPlugin.orloca::RcmdrPlugin.orloca-package
RcmdrPlugin.orloca: A GUI for orloca-package
RcmdrPlugin.orloca::RcmdrPlugin.orloca.es
RcmdrPlugin.orloca.es: Una interfaz grafica
para el paquete orloca
RcmdrPlugin.qcc::RcmdrPlugin.qcc-package
Install the Demos Rcmdr Plug-In
RcmdrPlugin.qual::xbara
Internal RcmdrPlugin.qual objects
RcmdrPlugin.qual::RcmdrPlugin.qual-package
Install the quality Rcmdr Plug-In
RcmdrPlugin.SensoMineR::defmacro
Internal RcmdrPlugin.SensoMineR objects
RcmdrPlugin.SensoMineR::RcmdrPlugin.SensoMineR
Graphical User Interface for SensoMineR
RcmdrPlugin.SLC::Rcmdr.help.RcmdrPlugin.SLC
RcmdrPlugin.SLC: A GUI for slc-package
(internal functions)
RcmdrPlugin.SLC::RcmdrPlugin.SLC-package
RcmdrPlugin.SLC: A GUI for SLC R package
RcmdrPlugin.sos::RcmdrPlugin.sos-package
Efficiently search R Help pages
RcmdrPlugin.steepness::Rcmdr.help.RcmdrPlugin.steepness
RcmdrPlugin.steepness: A GUI for
steepness-package (internal functions)
RcmdrPlugin.steepness::RcmdrPlugin.steepness
RcmdrPlugin.steepness: A GUI for steepness R
package
RcmdrPlugin.survival::allVarsClusters
Internal RcmdrPlugin.survival Objects
RcmdrPlugin.survival::RcmdrPlugin.survival-package
Rcmdr Plug-In Package for the survival Package
RcmdrPlugin.TeachingDemos::RcmdrPlugin.TeachingDemos-package
Install the Demos Rcmdr Plug-In

New edition of “R Companion to Applied Regression” – by John Fox and Sandy Weisberg (r-bloggers.com)
Reasons for Transitioning to Vim: Bringing LaTeX, R, Sweave and More under One Roof (r-bloggers.com)

Data Visualization: Central Banks

Iron Ore Company of Canada — Image via Wikipedia

Trying to compare the transparency of central banks via the data visualization of two very different central banks.

One is Reserve Bank of India and the other is Federal Reserve Bank of New York

Here are some points-

1) The federal bank gives you a huge clutter of charts to choose from and sometimes gives you very difficult to understand charts.

see http://www.newyorkfed.org/research/global_economy/usecon_charts.html

and http://www.newyorkfed.org/research/directors_charts/us18chart.pdf

us18chart

2) The Reserve bank of India choose Business Objects and gives you a proper drilldown kind of graph and tables. ( thats a lot of heavy metal and iron ore China needs from India 😉 😉

Foreign Trade – Export Time-line: ALL

TIME LINE	COUNTRY	COMMODITY	AMOUNT (US $ MILLION)	EXPORT QUANTITY
2010:07 (JUL) – P	China	IRON ORE (Units: TON)	205.06	1878456
2010:06 (JUN) – P	China	IRON ORE (Units: TON)	427.68	6808528
2010:05 (MAY) – P	China	IRON ORE (Units: TON)	550.67	5290450
2010:04 (APR) – P	China	IRON ORE (Units: TON)	922.46	9931500
2010:03 (MAR) – P	China	IRON ORE (Units: TON)	829.75	13177672
2010:02 (FEB) – P	China	IRON ORE (Units: TON)	706.04	10141259
2010:01 (JAN) – P	China	IRON ORE (Units: TON)	577.13	8498784
2009:12 (DEC) – P	China	IRON ORE (Units: TON)	545.68	9264544
2009:11 (NOV) – P	China	IRON ORE (Units: TON)	508.17	9509213
2009:10 (OCT) – P	China	IRON ORE (Units: TON)	422.6	7691652
2009:09 (SEP) – P	China	IRON ORE (Units: TON)	278.04	4577943
2009:08 (AUG) – P	China	IRON ORE (Units: TON)	276.96	4371847
2009:07 (JUL)	China	IRON ORE (Units: TON)	266.11	4642237
2009:06 (JUN)	China	IRON ORE (Units: TON)	241.08	4584354

Source : DGCI & S, Ministry of Commerce & Industry, GoI

You can see the screenshots of the various visualization tools of the New York Fed Reserve Bank and Indian Reserve Bank- if the US Fed is serious about cutting the debt maybe it should start publishing better visuals

Iron-Ore Miners Raise Prices (online.wsj.com)
Rival bidders join forces to take over key iron ore deposit in Canadian Arctic (mnn.com)
UPDATE 5-Floods knock BHP coal output; iron ore hits record (reuters.com)
BHP quarterly iron ore hits record, floods knock coal (reuters.com)
Swaps with foreign cen banks total $70 million – NY Fed (reuters.com)
Foreign central banks’ U.S. debt holdings fall – Fed (reuters.com)
The U.S. FED: Role Model for Brazil’s Central Bank: Estadao, Brazil (themoderatevoice.com)
Data Visualization References (dashboardspy.wordpress.com)

Common Analytical Tasks

WorldWarII-DeathsByCountry-Barchart — Image via Wikipedia

Some common analytical tasks from the diary of the glamorous life of a business analyst-

1) removing duplicates from a dataset based on certain key values/variables
2) merging two datasets based on a common key/variable/s
3) creating a subset based on a conditional value of a variable
4) creating a subset based on a conditional value of a time-date variable
5) changing format from one date time variable to another
6) doing a means grouped or classified at a level of aggregation
7) creating a new variable based on if then condition
8) creating a macro to run same program with different parameters
9) creating a logistic regression model, scoring dataset,
10) transforming variables
11) checking roc curves of model
12) splitting a dataset for a random sample (repeatable with random seed)
13) creating a cross tab of all variables in a dataset with one response variable
14) creating bins or ranks from a certain variable value
15) graphically examine cross tabs
16) histograms
17) plot(density())
18)creating a pie chart
19) creating a line graph, creating a bar graph
20) creating a bubbles chart
21) running a goal seek kind of simulation/optimization
22) creating a tabular report for multiple metrics grouped for one time/variable
23) creating a basic time series forecast

and some case studies I could think of-

As the Director, Analytics you have to examine current marketing efficiency as well as help optimize sales force efficiency across various channels. In addition you have to examine multiple sales channels including inbound telephone, outgoing direct mail, internet email campaigns. The datawarehouse is an RDBMS but it has multiple data quality issues to be checked for. In addition you need to submit your budget estimates for next year’s annual marketing budget to maximize sales return on investment.

As the Director, Risk you have to examine the overdue mortgages book that your predecessor left you. You need to optimize collections and minimize fraud and write-offs, and your efforts would be measured in maximizing profits from your department.

As a social media consultant you have been asked to maximize social media analytics and social media exposure to your client. You need to create a mechanism to report particular brand keywords, as well as automated triggers between unusual web activity, and statistical analysis of the website analytics metrics. Above all it needs to be set up in an automated reporting dashboard .

As a consultant to a telecommunication company you are asked to monitor churn and review the existing churn models. Also you need to maximize advertising spend on various channels. The problem is there are a large number of promotions always going on, some of the data is either incorrectly coded or there are interaction effects between the various promotions.

As a modeller you need to do the following-
1) Check ROC and H-L curves for existing model
2) Divide dataset in random splits of 40:60
3) Create multiple aggregated variables from the basic variables

4) run regression again and again

5) evaluate statistical robustness and fit of model

6) display results graphically

All these steps can be broken down in little little pieces of code- something which i am putting down a list of.

Are there any common data analysis tasks that you think I am missing out- any common case studies ? let me know.

Some Basics about Stats (psipsychologytutor.org)
Learn Logistic Regression (and beyond) (win-vector.com)
So You Call Yourself an Analyst? Part 2: Analysis Redefined (seomoz.org)
3 More Google Analytics Tips (gigaom.com)
What do practitioners need to know about regression? (stat.columbia.edu)
Using R for Introductory Statistics, Chapter 4, Model Formulae (r-bloggers.com)
sab-R-metrics: Basics of Vectors and Data Calling (r-bloggers.com)
sab-R-metrics: Subsetting, Conditional Statements, ‘tapply()’, and VERY simple ‘for loops’ (r-bloggers.com)
A Volatile Look At HSQLDB (designbygravity.wordpress.com)
Calpont InfiniDB Speeds Web Analytics Performance For Cognitive Match (prweb.com)
Case Study: Using a Scenario to Select Business Intelligence Software (customerthink.com)

R for Predictive Modeling:Workshop

A workshop on using R for Predictive Modeling, by the Director, Non Clinical Stats, Pfizer. Interesting Bay Area Event- part of next edition of Predictive Analytics World

Sunday, March 13, 2011 in San Francisco

R for Predictive Modeling:
A Hands-On Introduction

Intended Audience: Practitioners who wish to learn how to execute on predictive analytics by way of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.”

Knowledge Level: Either hands-on experience with predictive modeling (without R) or hands-on familiarity with any programming language (other than R) is sufficient background and preparation to participate in this workshop.

Workshop Description

This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed.

The instructor, a leading R developer and the creator of CARET, a core R package that streamlines the process for creating predictive models, will guide attendees on hands-on execution with R, covering:

A working knowledge of the R system
The strengths and limitations of the R language
Preparing data with R, including splitting, resampling and variable creation
Developing predictive models with R, including decision trees, support vector machines and ensemble methods
Visualization: Exploratory Data Analysis (EDA), and tools that persuade
Evaluating predictive models, including viewing lift curves, variable importance and avoiding overfitting

Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download.

Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.

Price and Registration Info:

Schedule

Workshop starts at 9:00am
Morning Coffee Break at 10:30am – 11:00am
Lunch provided at 12:30 – 1:15pm
Afternoon Coffee Break at 2:30pm – 3:00pm
End of the Workshop: 4:30pm

Instructor

Max Kuhn, Director, Nonclinical Statistics, Pfizer

Max Kuhn is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been apply models in the pharmaceutical industries for over 15 years.

He is a leading R developer and the author of several R packages including the CARET package that provides a simple and consistent interface to over 100 predictive models available in R.

Mr. Kuhn has taught courses on modeling within Pfizer and externally, including a class for the India Ministry of Information Technology.

http://www.predictiveanalyticsworld.com/sanfrancisco/2011/r_for_predictive_modeling.php

In-Depth Hands-on Workshops Delivered By Analytics Experts and Leading Practitioners at Predictive Analytics World March 13-17, 2011, San Francisco, California (prweb.com)
Rapid Insight Provides Low-Cost Options for Desktop Data Transformation and Predictive Modeling (customerthink.com)
An R interface to the Google Prediction API (revolutionanalytics.com)
THE NEW NEXT: TrendsSpotting’s Trend Prediction Model (trendsspotting.com)
JMP Launches Global Online Store Powered by e-academy, Inc. (prweb.com)
Drug’s likelihood of causing birth defects predicted by model (physorg.com)

Interview Luis Torgo Author Data Mining with R

Example of k-nearest neighbour classification — Image via Wikipedia

Here is an interview with Prof Luis Torgo, author of the recent best seller “Data Mining with R-learning with case studies”.

Ajay- Describe your career in science. How do you think can more young people be made interested in science.

Luis- My interest in science only started after I’ve finished my degree. I’ve entered a research lab at the University of Porto and started working on Machine Learning, around 1990. Since then I’ve been involved generally in data analysis topics both from a research perspective as well as from a more applied point of view through interactions with industry partners on several projects. I’ve spent most of my career at the Faculty of Economics of the University of Porto, but since 2008 I’m at the department of Computer Science of the Faculty of Sciences of the same university. At the same time I’ve been a researcher at LIAAD / Inesc Porto LA (www.liaad.up.pt).

I like a lot what I do and like science and the “scientific way of thinking”, but I cannot say that I’ve always thought of this area as my “place”. Most of all I like solving challenging problems through data analysis. If that translates into some scientific outcome than I’m more satisfied but that is not my main goal, though I’m kind of “forced” to think about that because of the constraints of an academic career.

That does not mean I’m not passionate about science, I just think there are many more ways of “doing science” than what is reflected in the usual “scientific indicators” that most institutions seem to be more and more obsessed about.

Regards interesting young people in science that is a hard question that I’m not sure I’m qualified to answer. I do tend to think that young people are more sensible to concrete examples of problems they think are interesting and that science helps in solving, as a way of finding a motivation for facing the hard work they will encounter in a scientific career. I do believe in case studies as a nice way to learn and motivate, and thus my book 😉

Ajay- Describe your new book “Data Mining with R, learning with case studies” Why did you choose a case study based approach? who is the target audience? What is your favorite case study from the book

Luis- This book is about learning how to use R for data mining. The book follows a “learn by doing it” approach to data mining instead of the more common theoretical description of the available techniques in this discipline. This is accomplished by presenting a series of illustrative case studies for which all necessary steps, code and data are provided to the reader. Moreover, the book has an associated web page (www.liaad.up.pt/~ltorgo/DataMiningWithR) where all code inside the book is given so that easy copy-paste is possible for the more lazy readers.

The language used in the book is very informal without many theoretical details on the used data mining techniques. For obtaining these theoretical insights there are already many good data mining books some of which are referred in “further readings” sections given throughout the book. The decision of following this writing style had to do with the intended target audience of the book.

In effect, the objective was to write a monograph that could be used as a supplemental book for practical classes on data mining that exist in several courses, but at the same time that could be attractive to professionals working on data mining in non-academic environments, and thus the choice of this more practically oriented approach.

Regards my favorite case study that is a hard question for an author… still I would probably choose the “Predicting Stock Market Returns” case study (Chapter 3). Not only because I like this challenging problem, but mainly because the case study addresses all aspects of knowledge discovery in a real world scenario and not only the construction of predictive models. It tackles data collection, data pre-processing, model construction, transforming predictions into actions using different trading policies, using business-related performance metrics, implementing a trading simulator for “real-world” evaluation, and laying out grounds for constructing an online trading system.

Obviously, for all these steps there are far too many options to be possible to describe/evaluate all of them in a chapter, still I do believe that for the reader it is important to see the overall picture, and read about the relevant questions on this problem and some possible paths that can be followed at these different steps.

In other words: do not expect to become rich with the solution I describe in the chapter !

Ajay- Apart from R, what other data mining software do you use or have used in the past. How would you compare their advantages and disadvantages with R

Luis- I’ve played around with Clementine, Weka, RapidMiner and Knime, but really only playing with teaching goals, and no serious use/evaluation in the context of data mining projects. For the latter I mainly use R or software developed by myself (either in R or other languages). In this context, I do not think it is fair to compare R with these or other tools as I lack serious experience with them. I can however, tell you about what I see as the main pros and cons of R. The main reason for using R is really not only the power of the tool that does not stop surprising me in terms of what already exists and keeps appearing as contributions of an ever growing community, but mainly the ability of rapidly transforming ideas into prototypes. Regards some of its drawbacks I would probably mention the lack of efficiency when compared to other alternatives and the problem of data set sizes being limited by main memory.

I know that there are several efforts around for solving this latter issue not only from the community (e.g. http://cran.at.r-project.org/web/views/HighPerformanceComputing.html), but also from the industry (e.g. Revolution Analytics), but I would prefer that at this stage this would be a standard feature of the language so the the “normal” user need not worry about it. But then this is a community effort and if I’m not happy with the current status instead of complaining I should do something about it!

Ajay- Describe your writing habit- How do you set about writing the book- did you write a fixed amount daily or do you write in bursts etc

Luis- Unfortunately, I write in bursts whenever I find some time for it. This is much more tiring and time consuming as I need to read back material far too often, but I cannot afford dedicating too much consecutive time to a single task. Actually, I frequently tease my PhD students when they “complain” about the lack of time for doing what they have to, that they should learn to appreciate the luxury of having a single task to complete because it will probably be the last time in their professional life!

Ajay- What do you do to relax or unwind when not working?

Luis- For me, the best way to relax from work is by playing sports. When I’m involved in some game I reset my mind and forget about all other things and this is very relaxing for me. A part from sports I enjoy a lot spending time with my family and friends. A good and long dinner with friends over a good bottle of wine can do miracles when I’m too stressed with work! Finally,I do love traveling around with my family.

Luis Torgo

Short Bio: Luis Torgo has a degree in Systems and Informatics Engineering and a PhD in Computer Science. He is an Associate Professor of the Department of Computer Science of the Faculty of Sciences of the University of Porto. He is also a researcher of the Laboratory of Artificial Intelligence and Data Analysis (LIAAD) belonging to INESC Porto LA. Luis Torgo has been an active researcher in Machine Learning and Data Mining for more than 20 years. He has lead several academic and industrial Data Mining research projects. Luis Torgo accompanies the R project almost since its beginning, using it on his research activities. He teaches R at different levels and has given several courses in different countries.

For reading “Data Mining with R” – you can visit this site, also to avail of a 20% discount the publishers have generously given (message below)-

For more information and to place an order, visit us at http://www.crcpress.com. Order online and apply 20% Off discount code 907HM at checkout. CRC is pleased to offer free standard shipping on all online orders!

link to the book page http://www.crcpress.com/product/isbn/9781439810187

Price: $79.95
Cat. #: K10510
ISBN: 9781439810187
ISBN 10: 1439810184
Publication Date: November 09, 2010
Number of Pages: 305
Availability: In Stock
Binding(s): Hardback

Finally! A practical R book on Data Mining: “Data Mining With R, Learning with Case Studies,” by Luis Torgo (r-bloggers.com)
INFORMS Data Mining Competition leaders used Open Source software (r-bloggers.com)
Is Data-Mining Free Speech? The Supreme Court Agrees to Decide a Crucial Case (dailyfinance.com)
Mining of Massive Data Sets (kinlane.com)
Case Study (jonathanlewis.wordpress.com)
Statistical Aspects of Data Mining (kinlane.com)
5 of the Best Free and Open Source Data Mining Software (junauza.com)
US top court to decide state drug data mining law (reuters.com)
Data-mining Google Books: Does the Reader Have To Be Human? (scholarlykitchen.sspnet.org)
Data Mining Competitions | TunedIT (tunedit.org)

PySpread Magic

Image via Wikipedia

Just working with PySpread- and worked on a 1 million by 1 million spreadsheet- Python sure looks promising for the way ahead for stat computing ( you need to

sudo apt-get install python-numpy python-rpy python-scipy python-gmpy wxpython*,

cd to the untarred bz2 file from http://pyspread.sourceforge.net/download.html, (like

:~/Downloads$ cd pyspread-0.1.2

:~/Downloads/pyspread-0.1.2

sudo python setup.py install

)

http://pyspread.sourceforge.net/

by Martin Manns

about	Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python. Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.
features	Three dimensional grid with up to 85,899,345 rows and 14,316,555 columns (64 bit systems, depends on row height and column width). Note that a million cells require about 500 MB of memory. Complex data types such as lists, trees or matrices within a single cell. Macros for functionalities that are too complex for a single Python expression. Python module access from each cell, which allows: Arbitrary size rational numbers (via gmpy), Fixed point decimal numbers for business calculations, (via the decimal module from the standard library) Advanced statistics including plotting functions (via RPy) Much more via <your favourite module>. CSV import and export Clipboard access
warning	The concept of pyspread allows doing everything from each cell that a Python script can do. This powerful feature has its drawbacks. A spreadsheet may very well delete your hard drive or send your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets. Since this is not the case for everyone (see discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can approve a file manually. Inspect carefully.

about

Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python.

Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.

features

Three dimensional grid with up to 85,899,345 rows and 14,316,555 columns (64 bit systems, depends on row height and column width). Note that a million cells require about 500 MB of memory.
Complex data types such as lists, trees or matrices within a single cell.
Macros for functionalities that are too complex for a single Python expression.
Python module access from each cell, which allows:
- Arbitrary size rational numbers (via gmpy),
- Fixed point decimal numbers for business calculations, (via the decimal module from the standard library)
- Advanced statistics including plotting functions (via RPy)
- Much more via <your favourite module>.
CSV import and export
Clipboard access

warning

The concept of pyspread allows doing everything from each cell that a Python script can do. This powerful feature has its drawbacks. A spreadsheet may very well delete your hard drive or send your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets.

Since this is not the case for everyone (see discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can approve a file manually. Inspect carefully.

Python Package Index : PyPI (pypi.python.org)
SciPy – (scipy.org)
Top Ten Articles of 2010 (blog.pythonlibrary.org)
Ride the snake: Calling Python libraries from Haskell (john-millikin.com)
PyPy 1.4: Ouroboros in practice (morepypy.blogspot.com)
pyFLTK Home Page (pyfltk.sourceforge.net)
PyPy 1.4.1 (morepypy.blogspot.com)
python -me : a silly but useful command line trick (voidspace.org.uk)
PyPM Index for Python Developers (descentintodarkness.wordpress.com)
Python Extension Packages for Windows – Christoph Gohlke (lfd.uci.edu)
Ruby, Python, and Science (johndcook.com)
Compiling Python Code (effbot.org)
Deep end is deep (ask.metafilter.com)

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

Sunday, March 13, 2011 in San Francisco

R for Predictive Modeling: A Hands-On Introduction

Workshop Description

Schedule

Instructor

Max Kuhn, Director, Nonclinical Statistics, Pfizer

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

R for Predictive Modeling:
A Hands-On Introduction