So many R Packages Everywhere, which one do I use? #rstats

Some thoughts on R Packages

  • CRAN is no longer the sole repository for many useful R packages. This includes R Forge, Google Code and increasingly Github
  • CRAN lacks the flexibility and social aspect of Github.
  • CRAN Views is the only thing that lists subject wide listing of R packages. The categorization is however done more on methods than on use cases or business domains.
  • Multiple R packages for the same thing. Which one do I use? Only Stack Overflow helps with that. No rating , no recommendation system
  • The packages suggested by R package feature needs better and automatic association analysis . Right now it is manual and dependent on package author and maintainer.
  • Quis custodiet ipsos custodes? Who guards the guardians of R packages. In an era of cyber security, we need better transparency on security measures within R packages especially given the international nature of the project.  I am very sure I ( or anyone) can create R code to communicate discretely especially on Windows

  • I would rather not install anything on my local machine, and read the package directly from the CRAN . CRAN was designed in an era of low bandwidth- this needs to be upgraded.
  • Note I am refraining respectfully from the atrocious nature of aesthetics in the home website. Many statisticians feel no use of making R user friendly. My professors at U tenn (from which I dropped out in 2 sems) were horrified when I took courses in graphic design as I wanted to know more on the A and B, which make the A/B testing of statistical design. Now that I am getting older, I get horrified by the lack of HTML, CSS and JQuery by some of the brightest programmers in this project.
  • Please comment below.

 

Interview Jeff Allen Trestle Technology #rstats #rshiny

Here is an interview with Jeff Allen who works with R and the new package Shiny in his technology startup. We featured his RGL Demo in our list of Shiny Demos- here

30cfc91

Ajay- Describe how you started using R. What are some of the benefits you noticed on moving to R?

Jeff- I began using R in an internship while working on my undergraduate degree. I was provided with some unformatted R code and asked to modularize the code then wrap it up into an R package for distribution alongside a publication.

To be honest, as a Computer Science student with training more heavily emphasizing the big high-level languages, R took some getting used to for me. It wasn’t until after I concluded that initial project and began using R to do my own data analysis that I began to realize its potential and value. It was the first scripting language which really made interactive use appealing to me — the experience of exploring a dataset in R was unlike anything Continue reading “Interview Jeff Allen Trestle Technology #rstats #rshiny”

Shiny 0.3 released . New era for #rstats

Message from Winston Cheng of R Studio.

—-

We’ve released Shiny 0.3.0, and it’s available on CRAN now. Glimmer will be updated with the latest version of Shiny some time later today.
To update your installation of Shiny, run:
  install.packages(‘shiny’)
Highlights of the new version include:
* Some bugs were fixed in `reactivePrint()` and `reactiveText()`, so that they have slightly different rules for collecting the output. Please be aware that some changes to your apps’ text output is possible. The help pages for these functions explain the behavior.
* New `runGitHub()` function, which can run apps directly from a repository on GitHub
* New `runUrl()` function, which can run apps stored as zip or tar files on a remote web server.
* New `isolate()` function, which allows you to access reactive values (from input) without making the function dependent on them.
* Improved scheduling of evaluation of reactive functions, which should reduce the number of “extra” times a reactive function is called.

Data Visualization for R packages at Github #rstats

I noticed this article sometime back by the most excellent hacker, John Myles White ( author Machine learning for Hackers)

http://www.johnmyleswhite.com/notebook/2012/08/12/the-social-dynamics-of-the-r-core-team/

Professor John Fox, whom we have interviewed here as the creator of R Commander, talked on this at  User 2008 http://www.statistik.uni-dortmund.de/useR-2008/slides/Fox.pdf

I also noticed that R Project is stuck on SVN ( yes or no??, comment please) while some part of the rest of the World has moved on to Git. See http://en.wikipedia.org/wiki/Git_%28software%29

Is Git really that good compared to SVN http://stackoverflow.com/questions/871/why-is-git-better-than-subversion

Maybe, I think with 5000 packages and more , R -project needs to have more presence on Github and atleast consider Git for the distributed and international project R is becoming.

Continue reading “Data Visualization for R packages at Github #rstats”

Little Book of R For Time Series #rstats

I loved this book. Only 75 pages and very lucidly written and available on Github for free. Nice job by Avril Coghlan a.coghlan@ucc.ie

.Of course My usual suspects for Time Series Readings are –

1) The seminal pdf (2008!!) by  a certain Prof Hyndman

Click to access Rtimeseries-ohp.pdf

 

2) JSS Paper -Automatic Time Series Forecasting: The forecast
Package for R http://www.jstatsoft.org/v27/i03/paper

3) The CRAN View http://cran.r-project.org/web/views/TimeSeries.html

This is cluttered and getting more and more cluttered. Some help on helping recent converts to R, especially in the field of corporate forecasting or time series for business analytics would really help.

Avril does an awesome job with this curiously named ( 😉 ) booklet  at http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html

Interview BigML.com

Here is an interview with Charlie Parker, head of large scale online algorithms at http://bigml.com

Ajay-  Describe your own personal background in scientific computing, and how you came to be involved with machine learning, cloud computing and BigML.com

Charlie- I am a machine learning Ph.D. from Oregon State University. Francisco Martin (our founder and CEO), Adam Ashenfelter (the lead developer on the tree algorithm), and myself were all studying machine learning at OSU around the same time. We all went our separate ways after that.

Francisco started Strands and turned it into a 100+ million dollar company building recommender systems. Adam worked for CleverSet, a probabilistic modeling company that was eventually sold to Cisco, I believe. I worked for several years in the research labs at Eastman Kodak on data mining, text analysis, and computer vision.

When Francisco left Strands to start BigML, he brought in Justin Donaldson who is a brilliant visualization guy from Indiana, and an ex-Googler named Jose Ortega who is responsible for most of our data infrastructure. They pulled in Adam and I a few months later. We also have Poul Petersen, a former Strands employee, who manages our herd of servers. He is a wizard and makes everyone else’s life much easier.

Ajay- You use clojure for the back end of BigML.com .Are there any other languages and packages you are considering? What makes clojure such a good fit for cloud computing ?

Charlie- Clojure is a great language because it offers you all of the benefits of Java (extensive libraries, cross-platform compatibility, easy integration with things like Hadoop, etc.) but has the syntactical elegance of a functional language. This makes our code base small and easy to read as well as powerful.

We’ve had occasional issues with speed, but that just means writing the occasional function or library in Java. As we build towards processing data at the Terabyte level, we’re hoping to create a framework that is language-agnostic to some extent. So if we have some great machine learning code in C, for example, we’ll use Clojure to tie everything together, but the code that does the heavy lifting will still be in C. For the API and Web layers, we use Python and Django, and Justin is a huge fan of HaXe for our visualizations.

 Ajay- Current support is for Decision Trees. When can we see SVM, K Means Clustering and Logit Regression?

Charlie- Right now we’re focused on perfecting our infrastructure and giving you new ways to put data in the system, but expect to see more algorithms appearing in the next few months. We want to make sure they are as beautiful and easy to use as the trees are. Without giving too much away, the first new thing we will probably introduce is an ensemble method of some sort (such as Boosting or Bagging). Clustering is a little further away but we’ll get there soon!

Ajay- How can we use the BigML.com API using R and Python.

Charlie- We have a public github repo for the language bindings. https://github.com/bigmlcom/io Right now, there there are only bash scripts but that should change very soon. The python bindings should be there in a matter of days, and the R bindings in probably a week or two. Clojure and Java bindings should follow shortly after that. We’ll have a blog post about it each time we release a new language binding. http://blog.bigml.com/

Ajay-  How can we predict large numbers of observations using a Model  that has been built and pruned (model scoring)?

Charlie- We are in the process of refactoring our backend right now for better support for batch prediction and model evaluation. This is something that is probably only a few weeks away. Keep your eye on our blog for updates!

Ajay-  How can we export models built in BigML.com for scoring data locally.

Charlie- This is as simple as a call to our API. https://bigml.com/developers/models The call gives you a JSON object representing the tree that is roughly equivalent to a PMML-style representation.

About-

You can read about Charlie Parker at http://www.linkedin.com/pub/charles-parker/11/85b/4b5 and the rest of the BigML team at

https://bigml.com/team

 

%d bloggers like this: