open source – DECISION STATS

Hack for Change #hackforchange

As part of National Day of Hacking, I took part in the hackathon in New York. Here were some insights

A few teams had come prepared reading the challenge questions. These teams had an advantage on time

Creating something in time was a big challenge ( how do you make a product in a single day)

Hackathon consists of 1) organizer giving challenge questions 2) people coming to venue 3) making teams 4) working together as team 5) presenting results (usually one person per team)

The idea is the most important in how relevant and closely aligned to the questions in hackthon you were. Creativity rules

The next important thing was making a balanced team in which everyone gels well, and have skill sets that are complementary ( one front end, one back end, one data scientist in Python, one person who is good at presentation etc)

The next important thing was not getting intimidated by other teams and working on your team idea till last moment

The presentation should be given to a person who is best at expressing 1) what you did 2) how the solution is innovative 3) how it is relevant and useful to challenge

Lastly have fun hacking. People who have fun hacking generally tend to be better hackers.

Screenshot from 2016-06-10 06:40:04

Running R and RStudio Server on Red Hat Linux RHEL #rstats

Installing R

sudo rpm -ivh http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm

(OR sudo rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm )

THEN

sudo yum install R

THEN

sudo R

(and to paste in Linux Window- just use Shift + Insert)

To Install RStudio (from http://www.rstudio.com/ide/download/server)

32-bit

wget http://download2.rstudio.org/rstudio-server-0.97.320-i686.rpm

sudo yum install --nogpgcheck rstudio-server-0.97.320-i686.rpm

OR 64-bit

wget http://download2.rstudio.org/rstudio-server-0.97.320-x86_64.rpm

sudo yum install --nogpgcheck rstudio-server-0.97.320-x86_64.rpm

Then

sudo rstudio-server verify-installation

Changing Firewalls in your RHEL

-Change to Root

```
sudo bash 
```

-Change directory

cd etc/sysconfig

-Read Iptables ( or firewalls file)

vi iptables

( to quite vi , press escape, then colon : then q )

-Change Iptables to open port 8787

/sbin/iptables -A INPUT -p tcp --dport 8787 -j ACCEPT

Add new user name (here newuser1)

sudo useradd newuser1

Change password in new user name

sudo passwd newuser1

Now just login to IPADDRESS:8787 with user name and password above

(credit- IBM SmartCloud Support ,http://www.youtube.com/watch?v=woVjq83gJkg&feature=player_embedded, Rstudio help, David Walker http://datamgmt.com/installing-r-and-rstudio-on-redhat-or-centos-linux/, www.google.com ,Michael Grieb)

The dichotomy in being a writer on open source with a non-open access publisher

The publisher adds credibility to your work

versus

A self fulfilling prophecy where researchers want to publish in exclusive journals and closed -access books, for the sole reason that others did so as well before them and thereby donate their knowledge and money to the publisher

The dichotomy in being a writer on open source with a non-open access publisher?

I write on open source R ,
and I have been published (one book )
and am on contract to write two more ( R for Cloud Computing) and (R for Web and Social Media Analytics)
My publisher does have open access journals.
But the book is at $50. Most of India lives at less than 2$ per day. Thats 800 million people in my country alone.

But the publisher is the most reputed in this field. So what are my choices? How do I get more people to have choices to read books.

Take open knowledge , curate it, and turn it behind a $50 paywall. I am sorry, Aaron. People like me are the reason ……

The making of a R startup Part 1 #rstats

Note- Decisionstats.com has done almost 105 interviews in the field of analytics, technology startups and thought leaders ( you can see them here http://goo.gl/m3l31). We have covered some of the R authors ( R for SAS and SPSS users, Data Mining using R, Machine Learning for Hackers) , and noted R package creators (ggplot2, RCommander, rattle GUI, forecast)

But what we truly enjoy is interviews with startups in R ecosystem , including founders of Revolution Analytics,Inference for R, RStudio, Cloudnumbers

The latest startup in the R ecosystem with a promising product is RApporter.net . It has actually been there for some time, but with the launch of their new product we ask them the trials and tribulations of creating an open source startup in the data science field.

This is part 1 of the interview with Gergely Daróczi, co-founder of the Rapporter project.

Ajay- Describe the journey of Rapporter till now, and your product plans for 2013.

Greg- The idea of Rapporter presented itself more then 3 years ago while giving statistics, SPSS and R courses at different Hungarian universities and also creating custom statistical reports for a number of companies for a living at the same time.
Long story short, the three Hungarian co-founder faced similar problems at both sectors: students, just like business clients, admired the capabilities of R and the wide variety of tools found on CRAN,but were not eager at all to get into learn how to use that.
So we tried to make up some plans how to let the non-R users also build on the resources of R, and we came up with the idea of an intuitive web-interface as an R front-end.

The real development of a helper R package (which later become “rapport”) started in the January of 2011 by Aleksandar Blagotić and me1 in our spare time and rather just for fun, as we had a dream about using “annotated statistical templates” in R after a few conversations on StackOverflow. We also worked on a front-end in the means of an Rserve driven PHP engine with MySQL – to be dropped and completely rewritten later after some trying experiences and serious benchmarking.

We have released “rapport” package to the public at the end of 2011 on GitHub, and after a few weeks on CRAN too. Despite the fact that we did our best with creating a decent documentation and also some live examples, we somehow forgot to spread the news of the new package to the R community, so “rapport” did not attract any serious attention.

Even so, our enthusiasm for annotated R “templates” did not wane as time passed, so we continued to work on “rapport” by adding new features and also Aleksandar started to fortify his Ruby on Rails skills. We also dropped Rserve with MySQL back-end, and introduced Jeffrey Horner’s awesome RApache with some NoSQL databases.
To be honest, this change resulted in a one-year delay of releasing Rapporter and no ends of headaches on our end, but in the long run, it was a really smart move after all, as we own an easily scalable and a highly available cluster of servers at the moment.

But back to 2012.

As “rapport” got too complex as time passed with newly added features, Aleksandar and I decided to split the package, which move gave birth to “pander”. At that time “knitr” got more and more familiar among R users, so it was a brave move to release “another” similar package, but the roots of “pander” were more then one year old, we used some custom methods not available in “knitr” (like
capturing the R object beside the printed output of chunks), we needed tweakable global options instead of chunk options and we really wanted to build on the power of Pandoc – just like before.

So we had a package for converting R objects to Pandoc’s markdown with a general S3 method, another package to automatically run that and also capture plots and images a brew-like document with various output formats – like pdf, docx, odt etc.
In the summer, while Aleksandar dealt with the web interface, I worked on some new features in our packages:
• automatic and robust caching of chunks with various options for performance reasons,
• automatically unifying “base”, “lattice” and “ggplot2” images to the same style with user options – like major/minor grid color, font family, color palette, margins etc.
• adding other global options to “pander”, to let our expected clients later personalize their
custom report style with a few clicks.

At the same time, we were searching for different options to prevent running malicious code in the parallel R sessions, which might compromise all our users’ sensitive data. Unfortunately no full blown solution existed at that time, and we really wanted to stand clear of running some Java based interpreters in our network.
So I started to create a parser for R commands, which was supposed to filter out malicious R commands before evaluation, and a handful flu got me some spare time to implement “sandboxR” with an open and live “hack my R server” demo, which ended up in a great challenge on my side, but proved to really work after all.
I also had a few conversations with Jeroen Ooms (the author of the awesome OpenCPU), who faced similar problems on his servers and was eager to prevent the issues with the help of AppArmor. The great news of “RAppArmor” did make “sandboxR” needless (as AppArmor just cannot regulate inner R calls), but we started to evaluate all user specified R commands in a separate hat, which allowed me to make “sanboxR” more permissive with black-filtered functions.
In the middle of the summer, I realized that we have an almost working web application with any number of R workers being able to serve tons of users based on the flexible NoSQL database back- ends, but we had no legal background to release such a service, nor had I any solid financial background to found one – moreover the Rapporter project already took huge amount from my family budget.

As I was against of letting some venture capital to dominate the project, and did not found any accelerator that would take on a project with a maturing, almost market-ready product, me and a few associates decided to found a UK company on our own and having confidence in the future and God.

So we founded Easystats Ltd, the company running rapporter.net, in July, and decided to release the first beta and pretty stable version of the application to the public at the end of September. At that time users could:
• upload and use text or SPSS sav data sets,
• specify more then 20 global options to be applied to all generated reports (like plot themes, table width, date format, decimal mark and number of digits, separators and copula in vectors etc.),
• create reports with the help of predefined statistical “templates”,
• “fork” (clone) any of our templates and modify without restriction, or create new statistical templates from scratch,
• edit the body or remove any part of the reports, resize images with the mouse or even with finger on touch-devices,
• and export reports to pdf, odt or docx formats.

A number of new features were introduced since then:

OpenBUGS integration with more permissive security profiles, users can create custom styles for the exported documents (in LaTeX, docx and odt format) to generate unique and possibly branded reports, to share public or even private reports with anyone without the need for registering on rapporter.net by a simple hyperlink, and to let our users to integrate their templates in any homepage, blog post or even HTML mail, so that let anyone use the power of R with a few clicks building on the knowledge of template authors and our reliable back-end.
Although 2 years ago I was pretty sure that this job would be finished in a few months and that we would possibly have a successful project in a year or two, now I am certain, that bunch of new features will make Rapporter more and more user-friendly, intuitive and extensible in the next few years.
Currently, we are working hard on a redesigned GUI with the help of a dedicated UX team at last (which was a really important structural change in the life of Rapporter, as we can really assign and split tasks now just like we dreamed of when the project was a two-men show), which is to be finished no later then the first quarter of the year. Beside design issues, this change would also result
in some new features, like ordering the templates, data sets and reports by popularity, rating or relevance for the currently active data set; and also letting users to alter the style of the resulting reports in a more seamless way.

The next planned tasks for 2013 include:
• a “data transformation” front-end, which would let users to rename and label variables in any uploaded data set, specify the level of measurement, recode/categorize or create new variables with the help of existing ones and/or any R functions,
• edit tables in reports on the fly (change the decimal mark, highlight some elements, rename columns and split tables to multiple pages with a simple click),
• a more robust API to let third-party users temporary upload data to be used in the analysis,
• option to use multiple data sets in a template and to let users merge or connect data online,
• and some top-secret surprises.

Beside the above tasks, which was made up by us, our team is really interested in any feedback from the users, which might change the above order or add new tasks with higher priority, so be sure to add your two cent on our support page.

And we will have to come up with some account plans with reasonable pricing in 2013 for the hosted service to let us cover the server fees and development expenses. But of course Rapporter will remain free for ever for users with basic needs (like analyzing data sets with only a few hundreds of cases) or anyone in the academic sector, and we also plan to provide an option to run Rapporter “off-site” on any Unix-like environment.

Ajay- What are some of the Big Data use cases I can do with Rapporter?

Greg- Although we have released Rapporter beta only a few months ago, we already heard some pretty promising use-cases from our (potential) clients.

But I must emphasize that at first we are not committed to deal with Big Data in the means of user contributed data sets with billions of cases, but rather concentrating on providing an intuitive and responsive way of analyzing traditional, survey-like data frames up to about 100.000 cases.

Anyway, to be on topic: a really promising project of Optimum Dosing Strategies has been using Rapporter’s API for a number of weeks even in 2012 to compute optimal doses for different kind of antibiotics based on Monte-Carlo simulation and Bayesian adaptive feedback among other methods.
This collaboration lets the ID-ODS team develop a powerful calculator with full-blown reports ready to be attached to medical records – without any special technical knowledge on their side, as we maintain the R engine and the integration part, they code in R. This results in pleased clients all over the world, which makes us happy too.

We really look forward to ship a number of educational templates to be used in real life at several (multilingual) universities from September 2013. These templates would let teachers show customizable and interactive reports to the students with any number of comments and narrative paragraphs, which statistical introductory modules would provide a free alternative to other desktop
software used in education.

In the next few months, a part of our team will focus on spatial analysis templates, which would mean that our users could not just map, but really analyze any of their spatially related data with a few clicks and clear parameters.

Another feature request of a client seems to be a really exciting idea. Currently, Google Analytics and other tracking services provide basic options to view, filter and export the historical data of websites, blogs etc.
As creating an interface between Rapporter and the tracking services to be able to fetch the most recent data is not beyond possibility any more with the help of existing API resources, so our clients could generate annotated usage reports of any specified period of time – without restrictions. Just to emphasize some potential add-ons: using the time-series R packages in the analysis or creating real- time “dashboards” with optional forecasts about live data.

Of course you could think of other kind of live or historical data instead of Google Analytics, as creating a template for e.g. transaction data or gas usage of a household could be addressed at any time, and please do not forget about the above referenced use-cases in the 3 rd question (“[…]Rapporter can help: […]”).

But wait: the beauty of Rapporter is that you could implement all of the above ideas by yourself in our system, even without any help from us.

Ajay- What are some of things that can be easily done with Rapporter than with your plain vanilla R?

Greg- Rapporter is basically developed for creating reproducible, literative and annotated statistical modules (a.k.a. “templates”), which means the passing a data set and the list of variables with some optional arguments would end up in a full-blown written report with automatically styled tables and charts.

So using Rapporter is like writing “Sweave” or “knitr” documents, but you write the template only once, and then apply that to any number of data sets with a simple click on an intuitive user interface.

Beside this major objective: as Rapporter is running in the cloud and sharing reports and templates (or even data sets) with collaborators or with anyone on the Internet is really easy, our users can post, share any R code for free and without restrictions or release the templates with specified license and/or fees in a secured environment.

This means that Rapporter can help:

scholars sharing scientific results or methods with reproducible and instantly available demo and/or dedicated implementation along with publications,
teachers to create self-explanatory statistical templates which would help the students internationalize the subject by practice,
any R developer to share a live and interactive demo of the implemented features of the functions with a few clicks,
businesses could use a statistical platform without restrictions for a reasonable monthly fee instead of expensive and non-portable statistical programs,
governments and national statistical offices to publicize census or other big data with a scientific and reliable analytic tool with annotated and clear reports while insuring the anonymity of the respondents by automatically applying custom methods (like data swapping, rounding, micro-aggregation, PRAM, adding noise etc.) to the tables and results, etc.

And of course, do not forget about one of our main objectives to let us open up the world of R to non-R users too with an intuitive, driving user interface.

(To be continued)-

About

Gergely Daróczi is co-ordinating the development of Rapporter and maintaining their R packages. Beside he tries to be active in some open-source projects and on StackOverflow, he is a PhD candidate in sociology and also a lecturer at Corvinus University of Budapest and Pázmány Péter Catholic University in Hungary

Rapporter is a web application helping you to create comprehensive, reliable statistical reports on any mobile device or PC, using an intuitive user interface.

The application builds on the power of R beside other technologies and intended to be used in any browser doing the heavy computations on the server side. Some might consider Rapporter as a customizable graphical user interface to R – running in the cloud.

Currently, Rapporter is under heavily development and only invited alpha testers can access the application. Please sign up for an invitation if you want to have an early-bird insight on Rapporter.

Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline

—————————————

If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)

Usage:

        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                        optional:
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)

(2) pandas http://pandas.pydata.org/

Library Highlights

A fast and efficient DataFrame object for data manipulation with integrated indexing;
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
Columns can be inserted and deleted from data structures for size mutability;
Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
High performance merging and joining of data sets;
Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed

—————————————————————–

Modeling in Python

Continue reading “Data Frame in Python”

JMP Student Edition

I really liked the initiatives at JMP/Academic. Not only they offer the software bundled with a textbook, which is both good common sense as well as business sense given how fast students can get confused

(Rant 1 Bundling with textbooks is something I think is Revolution Analytics should think of doing instead of just offering the academic version for free downloading- it would be interesting to see the penetration of R academic market with Revolution’s version and the open source version with the existing strategy)

From http://www.jmp.com/academic/textbooks.shtml

Major publishers of introductory statistics textbooks offer a 12-month license to JMP Student Edition, a streamlined version of JMP, with their textbooks.

and a glance through this http://www.jmp.com/academic/pdf/jmp_se_comparison.pdf shows it is a credible and not extremely whittled down version which would be just dishonest.

And I loved this Reference Card at http://www.jmp.com/academic/pdf/jmp10_se_quick_guide.pdf

Oracle, SAP- Hana, Revolution Analytics and even SAS/STAT itself can make more reference cards like this- elegant solutions for students and new learners!

More- creative-rants Honestly why do corporate sites use PDFs anymore when they can use Instapaper , or any of these SlideShare/Scribd formats to show information in a better way without diverting the user from the main webpage.

But I digress, back to JMP

Resources for Faculty Using JMP^® Student Edition

Faculty who select a JMP Student Edition bundle for their courses may be eligible for additional resources, including course materials and training.

Special JMP^® Student Edition for AP Statistics

JMP Student Edition is available in a convenient five-year license for qualified Advanced Placement statistics programs.

Try and have a look yourself at http://www.jmp.com/academic/student.shtml

Using Rapid Miner and R for Sports Analytics #rstats

Rapid Miner has been one of the oldest open source analytics software, long long before open source or even analytics was considered a fashion buzzword. The Rapid Miner software has been a pioneer in many areas (like establishing a marketplace for Rapid Miner Extensions) and the Rapid Miner -R extension was one of the most promising enablers of using R in an enterprise setting.

The following interview was taken with a manager of analytics for a sports organization. The sports organization considers analytics as a strategic differentiator , hence the name is confidential. No part of the interview has been edited or manipulated.

Ajay- Why did you choose Rapid Miner and R? What were the other software alternatives you considered and discarded?

Analyst- We considered most of the other major players in statistics/data mining or enterprise BI. However, we found that the value proposition for an open source solution was too compelling to justify the premium pricing that the commercial solutions would have required. The widespread adoption of R and the variety of packages and algorithms available for it, made it an easy choice. We liked RapidMiner as a way to design structured, repeatable processes, and the ability to optimize learner parameters in a systematic way. It also handled large data sets better than R on 32-bit Windows did. The GUI, particularly when 5.0 was released, made it more usable than R for analysts who weren’t experienced programmers.

Ajay- What analytics do you do think Rapid Miner and R are best suited for?

Analyst- We use RM+R mainly for sports analysis so far, rather than for more traditional business applications. It has been quite suitable for that, and I can easily see how it would be used for other types of applications.

Ajay- Any experiences as an enterprise customer? How was the installation process? How good is the enterprise level support?

Analyst- Rapid-I has been one of the most responsive tech companies I’ve dealt with, either in my current role or with previous employers. They are small enough to be able to respond quickly to requests, and in more than one case, have fixed a problem, or added a small feature we needed within a matter of days. In other cases, we have contracted with them to add larger pieces of specific functionality we needed at reasonable consulting rates. Those features are added to the mainline product, and become fully supported through regular channels. The longer consulting projects have typically had a turnaround of just a few weeks.

Ajay- What challenges if any did you face in executing a pure open source analytics bundle ?

Analyst- As Rapid-I is a smaller company based in Europe, the availability of training and consulting in the USA isn’t as extensive as for the major enterprise software players, and the time zone differences sometimes slow down the communications cycle. There were times where we were the first customer to attempt a specific integration point in our technical environment, and with no prior experiences to fall back on, we had to work with Rapid-I to figure out how to do it. Compared to the what traditional software vendors provide, both R and RM tend to have sparse, terse, occasionally incomplete documentation. The situation is getting better, but still lags behind what the traditional enterprise software vendors provide.

Ajay- What are the things you can do in R ,and what are the things you prefer to do in Rapid Miner (comparison for technical synergies)

Analyst- Our experience has been that RM is superior to R at writing and maintaining structured processes, better at handling larger amounts of data, and more flexible at fine-tuning model parameters automatically. The biggest limitation we’ve had with RM compared to R is that R has a larger library of user-contributed packages for additional data mining algorithms. Sometimes we opted to use R because RM hadn’t yet implemented a specific algorithm. The introduction the R extension has allowed us to combine the strengths of both tools in a very logical and productive way.

In particular, extending RapidMiner with R helped address RM’s weakness in the breadth of algorithms, because it brings the entire R ecosystem into RM (similar to how Rapid-I implemented much of the Weka library early on in RM’s development). Further, because the R user community releases packages that implement new techniques faster than the enterprise vendors can, this helps turn a potential weakness into a potential strength. However, R packages tend to be of varying quality, and are more prone to go stale due to lack of support/bug fixes. This depends heavily on the package’s maintainer and its prevalence of use in the R community. So when RapidMiner has a learner with a native implementation, it’s usually better to use it than the R equivalent.

Please share:

Please share:

Please share:

Please share:

Library Highlights

Why not R?

Please share:

Please share:

Please share: