Interview John Myles White , Machine Learning for Hackers

Here is an interview with one of the younger researchers  and rock stars of the R Project, John Myles White,  co-author of Machine Learning for Hackers.

Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?

John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.

We once said that Machine Learning for Hackers  is supposed to be a chemistry set for Machine Learning and I still think that’s the right description: it’s meant to get readers excited about Machine Learning and hopefully expose them to enough ideas and tools that they can start to explore on their own more effectively. It’s like a warmup for standard academic books like Bishop’s.
The public response to the book has been phenomenal. It’s been amazing to see how many people have bought the book and how many people have told us they found it helpful. Even friends with substantial expertise in statistics have said they’ve found a few nuggets of new information in the book, especially regarding text analysis and social network analysis — topics that Drew and I spend a lot of time thinking about, but are not thoroughly covered in standard statistics and Machine Learning  undergraduate curricula.
I hope we write a second edition. It was our first book and we learned a ton about how to write at length from the experience. I’m about to announce later this week that I’m writing a second book, which will be a very short eBook for O’Reilly. Stay tuned for details.

Ajay-  What are the key things that a potential reader can learn from this book?

John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.

Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?

John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?

Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?

John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.

The changes that are required in academia to prepare students for this kind of work are pretty numerous, but the most obvious required change is that quantitative people need to be learning how to program properly, which is rare in academia, even in many CS departments. Writing one-off programs that no one will ever have to reuse and that only work on toy data sets doesn’t prepare you for working with huge amounts of messy data that exhibit shifting patterns. If you need to learn how to program seriously before you can do useful work, you’re not very valuable to companies who need employees that can hit the ground running. The companies that have done best in building up data teams, like LinkedIn, have learned to train people as they come in since the proper training isn’t typically available outside those companies.
Of course, on the flipside, the people who do know how to program well need to start learning more about theory and need to start to have a better grasp of basic mathematical models like linear and logistic regressions. Lots of CS students seem not to enjoy their theory classes, but theory really does prepare you for thinking about what you can learn from data. You may not use automata theory if you work at Foursquare, but you will need to be able to reason carefully and analytically. Doing math is just like lifting weights: if you’re not good at it right now, you just need to dig in and get yourself in shape.
About-
John Myles White is a Phd Student in  Ph.D. student in the Princeton Psychology Department, where he studies human decision-making both theoretically and experimentally. Along with the political scientist Drew Conway, he is  the author of a book published by O’Reilly Media entitled “Machine Learning for Hackers”, which is meant to introduce experienced programmers to the machine learning toolkit. He is also working with Mark Hansenon a book for laypeople about exploratory data analysis.John is the lead maintainer for several R packages, including ProjectTemplate and log4r.

(TIL he has played in several rock bands!)

—–
You can read more in his own words at his blog at http://www.johnmyleswhite.com/about/
He can be contacted via social media at Google Plus at https://plus.google.com/109658960610931658914 or twitter at twitter.com/johnmyleswhite/

Interview Kelci Miclaus, SAS Institute Using #rstats with JMP

Here is an interview with Kelci Miclaus, a researcher working with the JMP division of the SAS Institute, in which she demonstrates examples of how the R programming language is a great hit with JMP customers who like to be flexible.

 

Ajay- How has JMP been using integration with R? What has been the feedback from customers so far? Is there a single case study you can point out where the combination of JMP and R was better than any one of them alone?

Kelci- Feedback from customers has been very positive. Some customers are using JMP to foster collaboration between SAS and R modelers within their organizations. Many are using JMP’s interactive visualization to complement their use of R. Many SAS and JMP users are using JMP’s integration with R to experiment with more bleeding-edge methods not yet available in commercial software. It can be used simply to smooth the transition with regard to sending data between the two tools, or used to build complete custom applications that take advantage of both JMP and R.

One customer has been using JMP and R together for Bayesian analysis. He uses R to create MCMC chains and has found that JMP is a great tool for preparing the data for analysis, as well as displaying the results of the MCMC simulation. For example, the Control Chart platform and the Bubble Plot platform in JMP can be used to quickly verify convergence of the algorithm. The use of both tools together can increase productivity since the results of an analysis can be achieved faster than through scripting and static graphics alone.

I, along with a few other JMP developers, have written applications that use JMP scripting to call out to R packages and perform analyses like multidimensional scaling, bootstrapping, support vector machines, and modern variable selection methods. These really show the benefit of interactive visual analysis of coupled with modern statistical algorithms. We’ve packaged these scripts as JMP add-ins and made them freely available on our JMP User Community file exchange. Customers can download them and now employ these methods as they would a regular JMP platform. We hope that our customers familiar with scripting will also begin to contribute their own add-ins so a wider audience can take advantage of these new tools.

(see http://www.decisionstats.com/jmp-and-r-rstats/)

Ajay- Are there plans to extend JMP integration with other languages like Python?

Kelci- We do have plans to integrate with other languages and are considering integrating with more based on customer requests. Python has certainly come up and we are looking into possibilities there.

 Ajay- How is R a complimentary fit to JMP’s technical capabilities?

Kelci- R has an incredible breadth of capabilities. JMP has extensive interactive, dynamic visualization intrinsic to its largely visual analysis paradigm, in addition to a strong core of statistical platforms. Since our brains are designed to visually process pictures and animated graphs more efficiently than numbers and text, this environment is all about supporting faster discovery. Of course, JMP also has a scripting language (JSL) allowing you to incorporate SAS code, R code, build analytical applications for others to leverage SAS, R and other applications for users who don’t code or who don’t want to code.

JSL is a powerful scripting language on its own. It can be used for dialog creation, automation of JMP statistical platforms, and custom graphic scripting. In other ways, JSL is very similar to the R language. It can also be used for data and matrix manipulation and to create new analysis functions. With the scripting capabilities of JMP, you can create custom applications that provide both a user interface and an interactive visual back-end to R functionality. Alternatively, you could create a dashboard using statistical and/or graphical platforms in JMP to explore the data and with the click of a button, send a portion of the data to R for further analysis.

Another JMP feature that complements R is the add-in architecture, which is similar to how R packages work. If you’ve written a cool script or analysis workflow, you can package it into a JMP add-in file and send it to your colleagues so they can easily use it.

Ajay- What is the official view on R from your organization? Do you think it is a threat, or a complimentary product or another statistical platform that coexists with your offerings?

Kelci- Most definitely, we view R as complimentary. R contributors are providing a tremendous service to practitioners, allowing them to try a wide variety of methods in the pursuit of more insight and better results. The R community as a whole is providing a valued role to the greater analytical community by focusing attention on newer methods that hold the most promise in so many application areas. Data analysts should be encouraged to use the tools available to them in order to drive discovery and JMP can help with that by providing an analytic hub that supports both SAS and R integration.

Ajay-  While you do use R, are there any plans to give back something to the R community in terms of your involvement and participation (say at useR events) or sponsoring contests.

 Kelci- We are certainly open to participating in useR groups. At Predictive Analytics World in NY last October, they didn’t have a local useR group, but they did have a Predictive Analytics Meet-up group comprised of many R users. We were happy to sponsor this. Some of us within the JMP division have joined local R user groups, myself included.  Given that some local R user groups have entertained topics like Excel and R, Python and R, databases and R, we would be happy to participate more fully here. I also hope to attend the useR! annual meeting later this year to gain more insight on how we can continue to provide tools to help both the JMP and R communities with their work.

We are also exploring options to sponsor contests and would invite participants to use their favorite tools, languages, etc. in pursuit of the best model. Statistics is about learning from data and this is how we make the world a better place.

About- Kelci Miclaus

Kelci is a research statistician developer for JMP Life Sciences at SAS Institute. She has a PhD in Statistics from North Carolina State University and has been using SAS products and R for several years. In addition to research interests in statistical genetics, clinical trials analysis, and multivariate analysis/visualization methods, Kelci works extensively with JMP, SAS, and R integration.

.

 

Interview JJ Allaire Founder, RStudio

Here is an interview with JJ Allaire, founder of RStudio. RStudio is the IDE that has overtaken other IDE within the R Community in terms of ease of usage. On the eve of their latest product launch, JJ talks to DecisionStats on RStudio and more.

Ajay-  So what is new in the latest version of RStudio and how exactly is it useful for people?

JJ- The initial release of RStudio as well as the two follow-up releases we did last year were focused on the core elements of using R: editing and running code, getting help, and managing files, history, workspaces, plots, and packages. In the meantime users have also been asking for some bigger features that would improve the overall work-flow of doing analysis with R. In this release (v0.95) we focused on three of these features:

Projects. R developers tend to have several (and often dozens) of working contexts associated with different clients, analyses, data sets, etc. RStudio projects make it easy to keep these contexts well separated (with distinct R sessions, working directories, environments, command histories, and active source documents), switch quickly between project contexts, and even work with multiple projects at once (using multiple running versions of RStudio).

Version Control. The benefits of using version control for collaboration are well known, but we also believe that solo data analysis can achieve significant productivity gains by using version control (this discussion on Stack Overflow talks about why). In this release we introduced integrated support for the two most popular open-source version control systems: Git and Subversion. This includes changelist management, file diffing, and browsing of project history, all right from within RStudio.

Code Navigation. When you look at how programmers work a surprisingly large amount of time is spent simply navigating from one context to another. Modern programming environments for general purpose languages like C++ and Java solve this problem using various forms of code navigation, and in this release we’ve brought these capabilities to R. The two main features here are the ability to type the name of any file or function in your project and go immediately to it; and the ability to navigate to the definition of any function under your cursor (including the definition of functions within packages) using a keystroke (F2) or mouse gesture (Ctrl+Click).

Ajay- What’s the product road map for RStudio? When can we expect the IDE to turn into a full fledged GUI?

JJ- Linus Torvalds has said that “Linux is evolution, not intelligent design.” RStudio tries to operate on a similar principle—the world of statistical computing is too deep, diverse, and ever-changing for any one person or vendor to map out in advance what is most important. So, our internal process is to ship a new release every few months, listen to what people are doing with the product (and hope to do with it), and then start from scratch again making the improvements that are considered most important.

Right now some of the things which seem to be top of mind for users are improved support for authoring and reproducible research, various editor enhancements including code folding, and debugging tools.

What you’ll see is us do in a given release is to work on a combination of frequently requested features, smaller improvements to usability and work-flow, bug fixes, and finally architectural changes required to support current or future feature requirements.

While we do try to base what we work on as closely as possible on direct user-feedback, we also adhere to some core principles concerning the overall philosophy and direction of the product. So for example the answer to the question about the IDE turning into a full-fledged GUI is: never. We believe that textual representations of computations provide fundamental advantages in transparency, reproducibility, collaboration, and re-usability. We believe that writing code is simply the right way to do complex technical work, so we’ll always look for ways to make coding better, faster, and easier rather than try to eliminate coding altogether.

Ajay -Describe your journey in science from a high school student to your present work in R. I noticed you have been very successful in making software products that have been mostly proprietary products or sold to companies.

Why did you get into open source products with RStudio? What are your plans for monetizing RStudio further down the line?

JJ- In high school and college my principal areas of study were Political Science and Economics. I also had a very strong parallel interest in both computing and quantitative analysis. My first job out of college was as a financial analyst at a government agency. The tools I used in that job were SAS and Excel. I had a dim notion that there must be a better way to marry computation and data analysis than those tools, but of course no concept of what this would look like.

From there I went more in the direction of general purpose computing, starting a couple of companies where I worked principally on programming languages and authoring tools for the Web. These companies produced proprietary software, which at the time (between 1995 and 2005) was a workable model because it allowed us to build the revenue required to fund development and to promote and distribute the software to a wider audience.

By 2005 it was however becoming clear that proprietary software would ultimately be overtaken by open source software in nearly all domains. The cost of development had shrunken dramatically thanks to both the availability of high-quality open source languages and tools as well as the scale of global collaboration possible on open source projects. The cost of promoting and distributing software had also collapsed thanks to efficiency of both distribution and information diffusion on the Web.

When I heard about R and learned more about it, I become very excited and inspired by what the project had accomplished. A group of extremely talented and dedicated users had created the software they needed for their work and then shared the fruits of that work with everyone. R was a platform that everyone could rally around because it worked so well, was extensible in all the right ways, and most importantly was free (as in speech) so users could depend upon it as a long-term foundation for their work.

So I started RStudio with the aim of making useful contributions to the R community. We started with building an IDE because it seemed like a first-rate development environment for R that was both powerful and easy to use was an unmet need. Being aware that many other companies had built successful businesses around open-source software, we were also convinced that we could make RStudio available under a free and open-source license (the AGPLv3) while still creating a viable business. At this point RStudio is exclusively focused on creating the best IDE for R that we can. As the core product gets where it needs to be over the next couple of years we’ll then also begin to sell other products and services related to R and RStudio.

About-

http://rstudio.org/docs/about

Jjallaire

JJ Allaire

JJ Allaire is a software engineer and entrepreneur who has created a wide variety of products including ColdFusion,Windows Live WriterLose It!, and RStudio.

From http://en.wikipedia.org/wiki/Joseph_J._Allaire
In 1995 Joseph J. (JJ) Allaire co-founded Allaire Corporation with his brother Jeremy Allaire, creating the web development tool ColdFusion.[1] In March 2001, Allaire was sold to Macromedia where ColdFusion was integrated into the Macromedia MX product line. Macromedia was subsequently acquired by Adobe Systems, which continues to develop and market ColdFusion.
After the sale of his company, Allaire became frustrated at the difficulty of keeping track of research he was doing using Google. To address this problem, he co-founded Onfolio in 2004 with Adam Berrey, former Allaire co-founder and VP of Marketing at Macromedia.
On March 8, 2006, Onfolio was acquired by Microsoft where many of the features of the original product are being incorporated into the Windows Live Toolbar. On August 13, 2006, Microsoft released the public beta of a new desktop blogging client called Windows Live Writer that was created by Allaire’s team at Microsoft.
Starting in 2009, Allaire has been developing a web-based interface to the widely used R technical computing environment. A beta version of RStudio was publicly released on February 28, 2011.
JJ Allaire received his B.A. from Macalester College (St. Paul, MN) in 1991.
RStudio-

RStudio is an integrated development environment (IDE) for R which works with the standard version of R available from CRAN. Like R, RStudio is available under a free software license. RStudio is designed to be as straightforward and intuitive as possible to provide a friendly environment for new and experienced R users alike. RStudio is also a company, and they plan to sell services (support, training, consulting, hosting) related to the open-source software they distribute.

Oracle Public Cloud

The slick website of Oracle Public Cloud- coming soon to an office near your location.

 

and including the oracle social network 😉

http://cloud.oracle.com/mycloud/f?p=service:social:0

Oracle Social Network

A secure collaboration tool for everyone you work with.

Public Cloud

http://cloud.oracle.com/mycloud/f?p=5001:1:0#

 

Google Cloud Connect

 

Interestingly a Google Plugin to share Microsoft Office on the Cloud.

Google Cloud Connect is a plug-in for Microsoft Office® 2003, 2007, and 2010 that lets you share and edit Microsoft Word, PowerPoint, and Excel documents simultaneously with other people in your organization. You get the collaboration benefits of Google Docs, while still using Microsoft Office.

http://tools.google.com/dlpage/cloudconnect

Google Cloud Connect for Microsoft Office

Google Cloud Connect for Microsoft Office brings collaborative multi-person editing to the familiar Microsoft® Office experience. You can share, backup, and simultaneously edit Microsoft Word, PowerPoint®, and Excel® documents with coworkers.

Watch the videos below to learn how Google Cloud Connect teaches your old docs new tricks.

Learn how Cloud Connect helped Mazda Raceway Laguna Seca (English only)

Help and information for users
Help and information for administrators

Download Google Cloud Connect

It’s free and downloads in seconds.

Requirements:

  • Windows XP with .NET Framework 2.0, Windows Vista, or Windows 7
  • Microsoft Office 2003, Office 2007, or Office 2010

Scholarships for students via #rstatsjobs and R-lings

A vector drawing of the University of York coa...
Image via Wikipedia

Outstandingly attractive scholarships are available for students willing to travel to Yorkshire. Thats where the Battle of Roses was fought by the British Royal Family.

see http://en.wikipedia.org/wiki/Wars_of_the_Roses

Emphasis  and spaces in email above are made by me.

Message from Dr Top i   bell ow-


It is not New York but very old York, in the North of England.

The scholarships carry a tax-free stipend and financial assistance will be
given for travel expenses to and from York. Accommodation for successful
students is available on the University of York Campus.

For information about the tax-free stipend please write to
scholarships@yccsa.org.

Continue reading “Scholarships for students via #rstatsjobs and R-lings”