Chapman/Hall announces new series on R

R Authors get more choice and variety now-
http://www.mail-archive.com/r-help@r-project.org/msg122965.html
We are pleased to announce the launch of a new series of books on R. 

Chapman & Hall/CRC: The R Series

Aims and Scope
This book series reflects the recent rapid growth in the development and 
application of R, the programming language and software environment for 
statistical computing and graphics. R is now widely used in academic research, 
education, and industry. It is constantly growing, with new versions of the 
core software released regularly and more than 2,600 packages available. It is 
difficult for the documentation to keep pace with the expansion of the 
software, and this vital book series provides a forum for the publication of 
books covering many aspects of the development and application of R.

The scope of the series is wide, covering three main threads:
• Applications of R to specific disciplines such as biology, epidemiology, 
genetics, engineering, finance, and the social sciences.
• Using R for the study of topics of statistical methodology, such as linear 
and mixed modeling, time series, Bayesian methods, and missing data.
• The development of R, including programming, building packages, and graphics.

The books will appeal to programmers and developers of R software, as well as 
applied statisticians and data analysts in many fields. The books will feature 
detailed worked examples and R code fully integrated into the text, ensuring 
their usefulness to researchers, practitioners and students.

Series Editors
John M. Chambers (Department of Statistics, Stanford University, USA; 
j...@stat.stanford.edu)
Torsten Hothorn (Institut für Statistik, Ludwig-Maximilians-Universität, 
München, Germany; torsten.hoth...@stat.uni-muenchen.de)
Duncan Temple Lang (Department of Statistics, University of California, Davis, 
USA; dun...@wald.ucdavis.edu)
Hadley Wickham (Department of Statistics, Rice University, Houston, Texas, USA; 
had...@rice.edu)

Call for Proposals
We are interested in books covering all aspects of the development and 
application of R software. If you have an idea for a book, please contact one 
of the series editors above or one of the Chapman & Hall/CRC statistics 
acquisitions editors below. Please provide brief details of topic, audience, 
aims and scope, and include an outline if possible.

We look forward to hearing from you.

Best regards,Rob Calver (rob.cal...@informa.com)
David Grubbs (david.gru...@taylorandfrancis.com)
John Kimmel (john.kim...@taylorandfrancis.com)

Call for proposals for writing a book about R (via Chapman & Hall/CRC) (r-statistics.com)

Common Analytical Tasks

WorldWarII-DeathsByCountry-Barchart — Image via Wikipedia

Some common analytical tasks from the diary of the glamorous life of a business analyst-

1) removing duplicates from a dataset based on certain key values/variables
2) merging two datasets based on a common key/variable/s
3) creating a subset based on a conditional value of a variable
4) creating a subset based on a conditional value of a time-date variable
5) changing format from one date time variable to another
6) doing a means grouped or classified at a level of aggregation
7) creating a new variable based on if then condition
8) creating a macro to run same program with different parameters
9) creating a logistic regression model, scoring dataset,
10) transforming variables
11) checking roc curves of model
12) splitting a dataset for a random sample (repeatable with random seed)
13) creating a cross tab of all variables in a dataset with one response variable
14) creating bins or ranks from a certain variable value
15) graphically examine cross tabs
16) histograms
17) plot(density())
18)creating a pie chart
19) creating a line graph, creating a bar graph
20) creating a bubbles chart
21) running a goal seek kind of simulation/optimization
22) creating a tabular report for multiple metrics grouped for one time/variable
23) creating a basic time series forecast

and some case studies I could think of-

As the Director, Analytics you have to examine current marketing efficiency as well as help optimize sales force efficiency across various channels. In addition you have to examine multiple sales channels including inbound telephone, outgoing direct mail, internet email campaigns. The datawarehouse is an RDBMS but it has multiple data quality issues to be checked for. In addition you need to submit your budget estimates for next year’s annual marketing budget to maximize sales return on investment.

As the Director, Risk you have to examine the overdue mortgages book that your predecessor left you. You need to optimize collections and minimize fraud and write-offs, and your efforts would be measured in maximizing profits from your department.

As a social media consultant you have been asked to maximize social media analytics and social media exposure to your client. You need to create a mechanism to report particular brand keywords, as well as automated triggers between unusual web activity, and statistical analysis of the website analytics metrics. Above all it needs to be set up in an automated reporting dashboard .

As a consultant to a telecommunication company you are asked to monitor churn and review the existing churn models. Also you need to maximize advertising spend on various channels. The problem is there are a large number of promotions always going on, some of the data is either incorrectly coded or there are interaction effects between the various promotions.

As a modeller you need to do the following-
1) Check ROC and H-L curves for existing model
2) Divide dataset in random splits of 40:60
3) Create multiple aggregated variables from the basic variables

4) run regression again and again

5) evaluate statistical robustness and fit of model

6) display results graphically

All these steps can be broken down in little little pieces of code- something which i am putting down a list of.

Are there any common data analysis tasks that you think I am missing out- any common case studies ? let me know.

Some Basics about Stats (psipsychologytutor.org)
Learn Logistic Regression (and beyond) (win-vector.com)
So You Call Yourself an Analyst? Part 2: Analysis Redefined (seomoz.org)
3 More Google Analytics Tips (gigaom.com)
What do practitioners need to know about regression? (stat.columbia.edu)
Using R for Introductory Statistics, Chapter 4, Model Formulae (r-bloggers.com)
sab-R-metrics: Basics of Vectors and Data Calling (r-bloggers.com)
sab-R-metrics: Subsetting, Conditional Statements, ‘tapply()’, and VERY simple ‘for loops’ (r-bloggers.com)
A Volatile Look At HSQLDB (designbygravity.wordpress.com)
Calpont InfiniDB Speeds Web Analytics Performance For Cognitive Match (prweb.com)
Case Study: Using a Scenario to Select Business Intelligence Software (customerthink.com)

Libreoffice 3.3 released

What does LibreOffice give you?

http://www.libreoffice.org/features/

WRITER is the word processor inside LibreOffice. Use it for everything, from dashing off a quick letter to producing an entire book with tables of contents, embedded illustrations, bibliographies and diagrams. The while-you-type auto-completion, auto-formatting and automatic spelling checking make difficult tasks easy (but are easy to disable if you prefer). Writer is powerful enough to tackle desktop publishing tasks such as creating multi-column newsletters and brochures. The only limit is your imagination.

CALC tames your numbers and helps with difficult decisions when you’re weighing the alternatives. Analyze your data with Calc and then use it to present your final output. Charts and analysis tools help bring transparency to your conclusions. A fully-integrated help system makes easier work of entering complex formulas. Add data from external databases such as SQL or Oracle, then sort and filter them to produce statistical analyses. Use the graphing functions to display large number of 2D and 3D graphics from 13 categories, including line, area, bar, pie, X-Y, and net – with the dozens of variations available, you’re sure to find one that suits your project.

IMPRESS is the fastest and easiest way to create effective multimedia presentations. Stunning animation and sensational special effects help you convince your audience. Create presentations that look even more professional than the standard presentations you commonly see at work. Get your collegues’ and bosses’ attention by creating something a little bit different.

DRAW lets you build diagrams and sketches from scratch. A picture is worth a thousand words, so why not try something simple with box and line diagrams? Or else go further and easily build dynamic 3D illustrations and special effects. It’s as simple or as powerful as you want it to be.

BASE is the database front-end of the LibreOffice suite. With Base, you can seamlessly integrate into your existing database structures. Based on imported and linked tables and queries from MySQL, PostgreSQL or Microsoft Access and many other data sources, you can build powerful databases containing forms, reports, views and queries. Full integration is possible with the in-built HSQL database.

MATH is a simple equation editor that lets you lay-out and display your mathematical, chemical, electrical or scientific equations quickly in standard written notation. Even the most-complex calculations can be understandable when displayed correctly. E=mc²

Open Documentation just announced release candidate 3 of Libre office.

New Features-

http://www.libreoffice.org/download/new-features/

General

Added the LibreColors to the palette;
Added Quickstarter for Unix builds;
Introduced Linux “Libertine G” and Linux “Biolinum G” fonts;
Implement import of alpha channel for RGBA .tiffs [http://bugs.freedesktop.org/show_bug.cgi?id=30472];
Show all appropiate formats by default on “Save As” [http://qa.openoffice.org/issues/show_bug.cgi?id=113141];
Use radio buttons for mutually exclusive menu options;
Replace the “Help Support” menu item by the “License Information” one;
Load and save documents in flat XML;
Made Help system available via the WikiHelp;
Option to enable saving of documents at all times (see Tools -> Options -> LibreOffice -> General -> “Allow to save document…”).

Calc

[http://bugs.freedesktop.org/show_bug.cgi?id=30559]: Added new tab page ‘Compatibility’ in the Options dialog;
Better default key bindings;
Use Ctrl-Shift-D to launch selection list in LibreOffice;
Added new image file used in the “insert new sheet” button. This image is not visible in read-only mode;
Fix fake small caps resizing factor [http://qa.openoffice.org/issues/show_bug.cgi?id=1526];
Added dotted/dashed borders in Calc;
Added icons for toggling sheet grids in Calc;
Better performance and interoperability on Excel doc import;
Better performance on DBF import;
Slightly better performance on ODS import;
Possibility to use English formula names;
Distributed alignment – allows one to specify ‘distributed’ horizontal alignment and ‘justified’ and ‘distributed’ vertical alignments within cells. This is notably useful for CJK locales;
Support for 3 different formula syntaxes: Calc A1, Excel A1 and Excel R1C1;
Configurable argument and array separators in formula expressions;
External reference works within OFFSET function;
Hitting TAB during auto-complete commits current selection and moves to the next cell;
Shift-TAB cycles through auto-complete selections;
Find and replace skips those cells that are filtered out (thus hidden);
Protecting sheet provides two additional sheet protection options, to optionally limit cursor placement in protected and unprotected areas;
Copying a range highlights the range being copied. It also allows you to paste it by hitting ENTER key. Hitting ESC removes the range highlight;
Jumping to and from references in formula cells via “Ctrl-[” and “Ctrl-]”;
Cell cursor stays at the original cell during range selection.

Writer

AutoCorrections match case of the words that AutoCorrect replaces. (Issuezilla 2838);
Ability to turn off number recognition in Writer;
RTF export (from GSoc);
Port of Lotus Word Pro filter;
New dialog box for title page.

Impress/Draw

PPTX chart import feature;
[http://qa.openoffice.org/issues/show_bug.cgi?id=112421] make “Presenter Screen” default to laptop, not projector;
Improve randomization in “Dissolve” transition.

Math

Default to just printing the formula itself in Math;
[http://qa.openoffice.org/issues/show_bug.cgi?id=113400] Maths brackets malformed in presentation mode.

Base

[http://qa.openoffice.org/issues/show_bug.cgi?id=112597] Added display properties to control shapes.

Development

UNO APIs for size and moveProtect of notes;
Via Issuezilla bug #i80184: allow addition of drawing documents to gallery via API.

Productivity Enhancements

New custom properties handling;
Embedding of standard PDF fonts;
New “Narrow” font family;
Increased document protection in Writer and Calc;
Automatic decimals digits for “General” format in Calc;
1 million rows in a spreadsheet;
New options for CSV (Comma-Separated Value) importation in Calc;
Insert drawing objects in charts;
Hierarchical axis labels for charts;
Improved slide layout handling in Impress;
Manual setting for primary key support for databases;
Support of Read-Only database registration;
New Math command: ‘nospace’.

Internationalization

Additional locale data.

Usability and Interface

Common search toolbar;
New easier-to-use print interface;
More options for changing case;
Redesign of thesaurus;
Resetting of text to the default language in Writer;
Text rendering of form controls in Writer;
Changed defaults for charts;
Colored sheet tabs in Calc;
Adaptation to marked selection for filter area in Calc;
Sort dialog box for DataPilot in Calc;
Display custom names for DataPilot fields, items and totals in Calc.

Developer Features and Extensibility

Grid control enhancements;
New MetaData node for database;
Extending database drivers using extensions.

Make Numbers Easier to Read in OpenOffice Calc (helpdeskgeek.com)
Libre Office, Using Java To A Lesser Extent (lockergnome.com)
OpenOffice vs. Office 2011: Rooting for the Underdog (appreaders.com)
LibreOffice RC 3 now available (omgubuntu.co.uk)
Libre Office Beta 3 released (omgubuntu.co.uk)
Rumblings From the LibreOffice Camp Signal Good Things Ahead (ostatic.com)
LibreOffice 3.3 RC2 released, available for download (omgubuntu.co.uk)
LibreOffice: Ready for Liftoff (zdnet.com)
LibreOffice – The Likely Future of OpenOffice (maketecheasier.com)
Replace OpenOffice.org with LibreOffice in Ubuntu [Linux Tip] (lifehacker.com)
LibreOffice Ubuntu PPA makes installation easy (omgubuntu.co.uk)

Using R from within Python

Image via Wikipedia

I came across this excellent JSS paper at www.jstatsoft.org/v35/c02/paper

on a Python package called PypeR which allows you to use R from within Python using the pipe functionality.

It is an interesting package and given Python’s increasing buzz , one worthy to be checked out by people using or thinking Python in their packages.

























Citation:
	@article{Xia:McClelland:Wang:2010:JSSOBK:v35c02,
	  author =	"Xiao-Qin Xia and Michael McClelland and Yipeng Wang",
	  title =	"PypeR, A Python Package for Using R in Python",
	  journal =	"Journal of Statistical Software, Code Snippets",
	  volume =	"35",
	  number =	"2",
	  pages =	"1--8",
	  day =  	"30",
	  month =	"7",
	  year = 	"2010",
	  CODEN =	"JSSOBK",
	  ISSN = 	"1548-7660",
	  bibdate =	"2010-03-23",
	  URL =  	"http://www.jstatsoft.org/v35/c02",
	  accepted =	"2010-03-23",
	  acknowledgement = "",
	  keywords =	"",
	  submitted =	"2009-10-23",
	}

Pyke – Python Logic Programming (pyke.sourceforge.net)
Python Package Index : PyPI (pypi.python.org)
Machine Learning in Python (dorai.wordpress.com)
Paul Hummer: Book Report: Gray Hat Python (theironlion.net)
Sad that Python doesn’t moo? A reason to be happy (voidspace.org.uk)
PSF core grant, day 1 (sayspy.blogspot.com)
Keeping Python packages tidy on AppEngine. (stochastictechnologies.com)
python -me : a silly but useful command line trick (voidspace.org.uk)
pyhi: Python Package/Module Hello World with all the Trimmings (lingpipe-blog.com)

Interview Luis Torgo Author Data Mining with R

Example of k-nearest neighbour classification — Image via Wikipedia

Here is an interview with Prof Luis Torgo, author of the recent best seller “Data Mining with R-learning with case studies”.

Ajay- Describe your career in science. How do you think can more young people be made interested in science.

Luis- My interest in science only started after I’ve finished my degree. I’ve entered a research lab at the University of Porto and started working on Machine Learning, around 1990. Since then I’ve been involved generally in data analysis topics both from a research perspective as well as from a more applied point of view through interactions with industry partners on several projects. I’ve spent most of my career at the Faculty of Economics of the University of Porto, but since 2008 I’m at the department of Computer Science of the Faculty of Sciences of the same university. At the same time I’ve been a researcher at LIAAD / Inesc Porto LA (www.liaad.up.pt).

I like a lot what I do and like science and the “scientific way of thinking”, but I cannot say that I’ve always thought of this area as my “place”. Most of all I like solving challenging problems through data analysis. If that translates into some scientific outcome than I’m more satisfied but that is not my main goal, though I’m kind of “forced” to think about that because of the constraints of an academic career.

That does not mean I’m not passionate about science, I just think there are many more ways of “doing science” than what is reflected in the usual “scientific indicators” that most institutions seem to be more and more obsessed about.

Regards interesting young people in science that is a hard question that I’m not sure I’m qualified to answer. I do tend to think that young people are more sensible to concrete examples of problems they think are interesting and that science helps in solving, as a way of finding a motivation for facing the hard work they will encounter in a scientific career. I do believe in case studies as a nice way to learn and motivate, and thus my book 😉

Ajay- Describe your new book “Data Mining with R, learning with case studies” Why did you choose a case study based approach? who is the target audience? What is your favorite case study from the book

Luis- This book is about learning how to use R for data mining. The book follows a “learn by doing it” approach to data mining instead of the more common theoretical description of the available techniques in this discipline. This is accomplished by presenting a series of illustrative case studies for which all necessary steps, code and data are provided to the reader. Moreover, the book has an associated web page (www.liaad.up.pt/~ltorgo/DataMiningWithR) where all code inside the book is given so that easy copy-paste is possible for the more lazy readers.

The language used in the book is very informal without many theoretical details on the used data mining techniques. For obtaining these theoretical insights there are already many good data mining books some of which are referred in “further readings” sections given throughout the book. The decision of following this writing style had to do with the intended target audience of the book.

In effect, the objective was to write a monograph that could be used as a supplemental book for practical classes on data mining that exist in several courses, but at the same time that could be attractive to professionals working on data mining in non-academic environments, and thus the choice of this more practically oriented approach.

Regards my favorite case study that is a hard question for an author… still I would probably choose the “Predicting Stock Market Returns” case study (Chapter 3). Not only because I like this challenging problem, but mainly because the case study addresses all aspects of knowledge discovery in a real world scenario and not only the construction of predictive models. It tackles data collection, data pre-processing, model construction, transforming predictions into actions using different trading policies, using business-related performance metrics, implementing a trading simulator for “real-world” evaluation, and laying out grounds for constructing an online trading system.

Obviously, for all these steps there are far too many options to be possible to describe/evaluate all of them in a chapter, still I do believe that for the reader it is important to see the overall picture, and read about the relevant questions on this problem and some possible paths that can be followed at these different steps.

In other words: do not expect to become rich with the solution I describe in the chapter !

Ajay- Apart from R, what other data mining software do you use or have used in the past. How would you compare their advantages and disadvantages with R

Luis- I’ve played around with Clementine, Weka, RapidMiner and Knime, but really only playing with teaching goals, and no serious use/evaluation in the context of data mining projects. For the latter I mainly use R or software developed by myself (either in R or other languages). In this context, I do not think it is fair to compare R with these or other tools as I lack serious experience with them. I can however, tell you about what I see as the main pros and cons of R. The main reason for using R is really not only the power of the tool that does not stop surprising me in terms of what already exists and keeps appearing as contributions of an ever growing community, but mainly the ability of rapidly transforming ideas into prototypes. Regards some of its drawbacks I would probably mention the lack of efficiency when compared to other alternatives and the problem of data set sizes being limited by main memory.

I know that there are several efforts around for solving this latter issue not only from the community (e.g. http://cran.at.r-project.org/web/views/HighPerformanceComputing.html), but also from the industry (e.g. Revolution Analytics), but I would prefer that at this stage this would be a standard feature of the language so the the “normal” user need not worry about it. But then this is a community effort and if I’m not happy with the current status instead of complaining I should do something about it!

Ajay- Describe your writing habit- How do you set about writing the book- did you write a fixed amount daily or do you write in bursts etc

Luis- Unfortunately, I write in bursts whenever I find some time for it. This is much more tiring and time consuming as I need to read back material far too often, but I cannot afford dedicating too much consecutive time to a single task. Actually, I frequently tease my PhD students when they “complain” about the lack of time for doing what they have to, that they should learn to appreciate the luxury of having a single task to complete because it will probably be the last time in their professional life!

Ajay- What do you do to relax or unwind when not working?

Luis- For me, the best way to relax from work is by playing sports. When I’m involved in some game I reset my mind and forget about all other things and this is very relaxing for me. A part from sports I enjoy a lot spending time with my family and friends. A good and long dinner with friends over a good bottle of wine can do miracles when I’m too stressed with work! Finally,I do love traveling around with my family.

Luis Torgo

Short Bio: Luis Torgo has a degree in Systems and Informatics Engineering and a PhD in Computer Science. He is an Associate Professor of the Department of Computer Science of the Faculty of Sciences of the University of Porto. He is also a researcher of the Laboratory of Artificial Intelligence and Data Analysis (LIAAD) belonging to INESC Porto LA. Luis Torgo has been an active researcher in Machine Learning and Data Mining for more than 20 years. He has lead several academic and industrial Data Mining research projects. Luis Torgo accompanies the R project almost since its beginning, using it on his research activities. He teaches R at different levels and has given several courses in different countries.

For reading “Data Mining with R” – you can visit this site, also to avail of a 20% discount the publishers have generously given (message below)-

For more information and to place an order, visit us at http://www.crcpress.com. Order online and apply 20% Off discount code 907HM at checkout. CRC is pleased to offer free standard shipping on all online orders!

link to the book page http://www.crcpress.com/product/isbn/9781439810187

Price: $79.95
Cat. #: K10510
ISBN: 9781439810187
ISBN 10: 1439810184
Publication Date: November 09, 2010
Number of Pages: 305
Availability: In Stock
Binding(s): Hardback

Finally! A practical R book on Data Mining: “Data Mining With R, Learning with Case Studies,” by Luis Torgo (r-bloggers.com)
INFORMS Data Mining Competition leaders used Open Source software (r-bloggers.com)
Is Data-Mining Free Speech? The Supreme Court Agrees to Decide a Crucial Case (dailyfinance.com)
Mining of Massive Data Sets (kinlane.com)
Case Study (jonathanlewis.wordpress.com)
Statistical Aspects of Data Mining (kinlane.com)
5 of the Best Free and Open Source Data Mining Software (junauza.com)
US top court to decide state drug data mining law (reuters.com)
Data-mining Google Books: Does the Reader Have To Be Human? (scholarlykitchen.sspnet.org)
Data Mining Competitions | TunedIT (tunedit.org)

Quick-R and Statmethods.net

Image via Wikipedia

I was searching for some basic syntax in R (basically cross tabs and density plots) and I came across the Quick R site.

http://www.statmethods.net/

Its really a nice site for R beginners and anyone trying to remember some syntax.

R syntax can be very simple- a histoigram is just hist(), boxplot is just boxplot() and t test is just t.test(dataset)

Here is an example from the site-

http://www.statmethods.net/graphs/density.html

# Simple Histogram hist(mtcars$mpg)

click to view

# Colored Histogram with Different Number of Bins hist(mtcars$mpg, breaks=12, col="red")

click to view

# Add a Normal Curve (Thanks to Peter Dalgaard) x <- mtcars$mpg h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram with Normal Curve") xfit<-seq(min(x),max(x),length=40) yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) yfit <- yfit*diff(h$mids[1:2])*length(x) lines(xfit, yfit, col="blue", lwd=2)

click to view

Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.

KERNEL DENSITY PLOTS

Kernal density plots are usually a much more effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.

# Kernel Density Plot d <- density(mtcars$mpg) # returns the density data plot(d) # plots the results

click to view

# Filled Density Plot d <- density(mtcars$mpg) plot(d, main="Kernel Density of Miles Per Gallon") polygon(d, col="red", border="blue")

click to view

COMPARING GROUPS VIA KERNAL DENSITY

The sm.density.compare( ) function in the sm package allows you to superimpose the kernal density plots of two or more groups. The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable.

# Compare MPG distributions for cars with # 4,6, or 8 cylinders library(sm) attach(mtcars)


# create value labels

cyl.f <- factor(cyl, levels= c(4,6,8),

labels = c("4 cylinder", "6 cylinder", "8 cylinder"))
# plot densities

sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")

title(main="MPG Distribution by Car Cylinders")

# add legend via mouse click colfill<-c(2:(2+length(levels(cyl.f)))) legend(locator(1), levels(cyl.f), fill=colfill)

click to view

It is not as exhaustive as http://cran.r-project.org/doc/manuals/R-intro.html

but it is much more simpler and easy to follow.

The site is created by Robert I. Kabacoff, Ph.D.

and he is working on a book called “R in Action”

I have received numerous requests for a hardcopy version of this site, so over the past year I have been writing a book that takes the material here and significantly expands upon it. If you are interested, early access is available.

If you have not been to that website, I recommend it highly (though the tagline or logo of R for SAS/SPSS/Stata users seems a bit familiar)-http://www.statmethods.net/index.html

Quick-R

for SAS/SPSS/Stata Users

Two Thoughts on Lisp Syntax. (kazimirmajorinc.blogspot.com)
Some Basics about Stats (psipsychologytutor.org)
Bone Density Tests: A Clue to Your Future (webmd.com)
Net Access Corporation Unveils 50,000 Square Foot, State-of-the-Art Data Center in Parsippany, New Jersey (prweb.com)
programming languages – What makes lisp macros so special – Stack Overflow (stackoverflow.com)
Thinking about Syntax (latenightpc.com)
Our minds use syntax to understand actions, just like with language [Mad Psychology] (io9.com)
Syntax highlighting for Django using Pygments (ofbrooklyn.com)
People of HTML5 – Bruce Lawson (hacks.mozilla.org)
Haskell syntax vs. Lisp syntax | LispCast (lispcast.com)

Interview Ajay Ohri Decisionstats.com with DMR

From-

http://www.dataminingblog.com/data-mining-research-interview-ajay-ohri/

Here is the winner of the Data Mining Research People Award 2010: Ajay Ohri! Thanks to Ajay for giving some time to answer Data Mining Research questions. And all the best to his blog, Decision Stat!

Data Mining Research (DMR): Could you please introduce yourself to the readers of Data Mining Research?

Ajay Ohri (AO): I am a business consultant and writer based out of Delhi- India. I have been working in and around the field of business analytics since 2004, and have worked with some very good and big companies primarily in financial analytics and outsourced analytics. Since 2007, I have been writing my blog at http://decisionstats.com which now has almost 10,000 views monthly.

All in all, I wrote about data, and my hobby is also writing (poetry). Both my hobby and my profession stem from my education ( a masters in business, and a bachelors in mechanical engineering).

My research interests in data mining are interfaces (simpler interfaces to enable better data mining), education (making data mining less complex and accessible to more people and students), and time series and regression (specifically ARIMAX)
In business my research interests software marketing strategies (open source, Software as a service, advertising supported versus traditional licensing) and creation of technology and entrepreneurial hubs (like Palo Alto and Research Triangle, or Bangalore India).

DMR: I know you have worked with both SAS and R. Could you give your opinion about these two data mining tools?

AO: As per my understanding, SAS stands for SAS language, SAS Institute and SAS software platform. The terms are interchangeably used by people in industry and academia- but there have been some branding issues on this.
I have not worked much with SAS Enterprise Miner , probably because I could not afford it as business consultant, and organizations I worked with did not have a budget for Enterprise Miner.
I have worked alone and in teams with Base SAS, SAS Stat, SAS Access, and SAS ETS- and JMP. Also I worked with SAS BI but as a user to extract information.
You could say my use of SAS platform was mostly in predictive analytics and reporting, but I have a couple of projects under my belt for knowledge discovery and data mining, and pattern analysis. Again some of my SAS experience is a bit dated for almost 1 year ago.

I really like specific parts of SAS platform – as in the interface design of JMP (which is better than Enterprise Guide or Base SAS ) -and Proc Sort in Base SAS- I guess sequential processing of data makes SAS way faster- though with computing evolving from Desktops/Servers to even cheaper time shared cloud computers- I am not sure how long Base SAS and SAS Stat can hold this unique selling proposition.

I dislike the clutter in SAS Stat output, it confuses me with too much information, and I dislike shoddy graphics in the rendering output of graphical engine of SAS. Its shoddy coding work in SAS/Graph and if JMP can give better graphics why is legacy source code preventing SAS platform from doing a better job of it.

I sometimes think the best part of SAS is actually code written by Goodnight and Sall in 1970’s , the latest procs don’t impress me much.

SAS as a company is something I admire especially for its way of treating employees globally- but it is strange to see the rest of tech industry not following it. Also I don’t like over aggression and the SAS versus Rest of the Analytics /Data Mining World mentality that I sometimes pick up when I deal with industry thought leaders.

I think making SAS Enterprise Miner, JMP, and Base SAS in a completely new web interface priced at per hour rates is my wishlist but I guess I am a bit sentimental here- most data miners I know from early 2000’s did start with SAS as their first bread earning software. Also I think SAS needs to be better priced in Business Intelligence- it seems quite cheap in BI compared to Cognos/IBM but expensive in analytical licensing.

If you are a new stats or business student, chances are – you may know much more R than SAS today. The shift in education at least has been very rapid, and I guess R is also more of a platform than a analytics or data mining software.

I like a lot of things in R- from graphics, to better data mining packages, modular design of software, but above all I like the can do kick ass spirit of R community. Lots of young people collaborating with lots of young to old professors, and the energy is infectious. Everybody is a CEO in R ’s world. Latest data mining algols will probably start in R, published in journals.

Which is better for data mining SAS or R? It depends on your data and your deadline. The golden rule of management and business is -it depends.

Also I have worked with a lot of KXEN, SQL, SPSS.

DMR: Can you tell us more about Decision Stats? You have a traffic of 120′000 for 2010. How did you reach such a success?

AO: I don’t think 120,000 is a success. Its not a failure. It just happened- the more I wrote, the more people read.In 2007-2008 I used to obsess over traffic. I tried SEO, comments, back linking, and I did some black hat experimental stuff. Some of it worked- some didn’t.

In the end, I started asking questions and interviewing people. To my surprise, senior management is almost always more candid , frank and honest about their views while middle managers, public relations, marketing folks can be defensive.

Social Media helped a bit- Twitter, Linkedin, Facebook really helped my network of friends who I suppose acted as informal ambassadors to spread the word.
Again I was constrained by necessity than choices- my middle class finances ( I also had a baby son in 2007-my current laptop still has some broken keys – by my inability to afford traveling to conferences, and my location Delhi isn’t really a tech hub.

The more questions I asked around the internet, the more people responded, and I wrote it all down.

I guess I just was lucky to meet a lot of nice people on the internet who took time to mentor and educate me.

I tried building other websites but didn’t succeed so i guess I really don’t know. I am not a smart coder, not very clever at writing but I do try to be honest.

Basic economics says pricing is proportional to demand and inversely proportional to supply. Honest and candid opinions have infinite demand and an uncertain supply.

DMR: There is a rumor about a R book you plan to publish in 2011 Can you confirm the rumor and tell us more?

AO: I just signed a contract with Springer for ” R for Business Analytics”. R is a great software, and lots of books for statistically trained people, but I felt like writing a book for the MBAs and existing analytics users- on how to easily transition to R for Analytics.

Like any language there are tricks and tweaks in R, and with a focus on code editors, IDE, GUI, web interfaces, R’s famous learning curve can be bent a bit.

Making analytics beautiful, and simpler to use is always a passion for me. With 3000 packages, R can be used for a lot more things and a lot more simply than is commonly understood.
The target audience however is business analysts- or people working in corporate environments.

Brief Bio-
Ajay Ohri has been working in the field of analytics since 2004 , when it was a still nascent emerging Industries in India. He has worked with the top two Indian outsourcers listed on NYSE,and with Citigroup on cross sell analytics where he helped sell an extra 50000 credit cards by cross sell analytics .He was one of the very first independent data mining consultants in India working on analytics products and domestic Indian market analytics .He regularly writes on analytics topics on his web site www.decisionstats.com and is currently working on open source analytical tools like R besides analytical software like SPSS and SAS.

Skills of a good data miner (zyxo.wordpress.com)
Data Mining with WEKA (r-bloggers.com)
How Data Mining Can Help You Score on the First Date (volokh.com)
Upcoming webinar on investigative analytics (dbms2.com)
IBM SPSS 19 Now Available to the Global Academic Community via e-academy’s OnTheHub eStore (prweb.com)

Related Articles

Please share:

Related Articles

Please share:

What does LibreOffice give you?

General

Calc

Writer

Impress/Draw

Math

Base

Development

Productivity Enhancements

Internationalization

Usability and Interface

Developer Features and Extensibility

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

KERNEL DENSITY PLOTS

COMPARING GROUPS VIA KERNAL DENSITY

Quick-R

for SAS/SPSS/Stata Users

Related Articles

Please share:

Related Articles

Please share: