Here is an interview with one of the younger researchers and rock stars of the R Project, John Myles White, co-author of Machine Learning for Hackers.
Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?
John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.
Ajay- What are the key things that a potential reader can learn from this book?
John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.
Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?
John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?
Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?
John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.
(TIL he has played in several rock bands!)
While Bob M, my old mentor and fellow TN man maintains the website http://r4stats.com/ how popular R is across various forums, I am interested in who within R community of 3 million (give or take a few) is contributing more. I am very sure by 2014, we can have a new fork of R called Hadley R, in which all packages would be made by Hadley Wickham and you wont need anything else.
But jokes apart, since I didnt have the time to
1) scrape CRAN for all package authors
2) scrape for lines of code across all packages
3) allocate lines of code (itself a dubious software productivity metric) to various authors of R packages-
1) scraping the entire and 2011’s R help list
2) determine who is the most frequent r question and answer user (ala SAS-L’s annual MVP and rookie of the year awards)
I did the following to atleast who is talking about R across easily scrapable Q and A websites
Stack Overflow still rules over all.
http://stackoverflow.com/tags/r/topusers shows the statistics on who made whom in R on Stack Overflow
All in all, initial ardour seems to have slowed for #Rstats on Stack Overflow ? or is it just summer?
No the answer- credit to Rob J Hyndman is most(?) activity is shifting to Stats Exchange
You could also paste this in Notepad and some graphs on Average Score / Answer or even make a social network graph if you had the time.
Do NOT (Go/Bi) search for Stack Overflow API or web scraping stack overflow- it gives you all the answers on the website but 0 answers on how to scrape these websites.
I have added a new website called Meta Optimize to this list based on Tal G’s interview of Joseph Turian, at http://www.r-statistics.com/2010/07/statistical-analysis-qa-website-did-stackoverflow-just-lose-it-to-metaoptimize-and-is-it-good-or-bad/
There are only 17 questions tagged R but it seems a lot of views is being generated.
I also decided to add views from Quora since it is Q and A site (and one which I really like)
Again very few questions but lot many followers