Home » Posts tagged 'Analytics' (Page 2)
Tag Archives: Analytics
Dhiraj Rajaram, got featured in Economic Times recently as the CEO- founder of India’s first billion dollar valuation analytics startup.
This year, the company which employs 2,500 people across a development centre in Bangalore and offices in the US, UK and Australia, will build a data analytics lab in the US and hire 400 data scientists there.
I first met Dhiraj in 2008 Q1 for a job. We didnt agree partly because I needed to be close to my son ( who was 4 mth old) and I ended up taking a contract with another Bangalore based company. What impressed me at that time was something I rarely see in India’s analytics entrepreneurs-
1) A Grand Vision- Dhiraj said- I am trying to build the largest math factory on the world.
2) Focus- Dhiraj was focused only on analytics projects. No quick and easy outsourcing low end tasks and outsourcing for him.
3) Positivity- Not once during the entire two hour interaction did he say a negative word on competition, attrition, challenges, pressures.
4) Flamboyance- I wonder sometimes why a colorful culture like India’s end up with people being so meek in corporate culture. Dhiraj was probably one of the most flamboyant senior analytics leaders.
But there were some concerns I had in 2008 q1- including plans for IPO ( I thought that was early) and senior management flux ( the COO left in a few months).
Anyways Dhiraj grew the 200 strong team to around 900 by 2010 q3. This time again he called me for a job interview. This time we again found that there was nothing I was really good at in analytics company- with my interest in open source, blogging and writing books, and my morbid fear of managing people in operations. However I noticed some changes-
- There were greater signs of process driven orientation ( including messages to keep meetings short)
- There were newer people in senior management
- Dhiraj was slightly more restrained in his frank talk ( given his increasing stature and demands on his time and attention on him)
- I loved the sign on his Office- Jugad. Literally that means ingenuity in Hindi- and shows a glimpse into the maveric, brilliant and flamboyant nature of the CEO.
Again, there were some odd points. Mu Sigma continued to have the perception ( true or false, I dont know) of having a large number of attrition at junior levels. Again there were rumours that Dhiraj had become a bit autocratic in management ( which I found no clue of). I found that the biggest problem that Mu Sigma, Dhiraj had – they were creating enemies just by shaking up the slow IT Services mindset of India- where easy money was available just by low quality labor arbitrage. This cultural opposition to anything new (like a pure analytics company), or anything rapid ( like a company that scales up organically) could have stopped lesser men, but Mu Sigma moved on.
So it was quite nice to read the news, finally an Indian company , had broken the 1 billion mark. Allow me some leeway here. I truly believe analytics and maths have no nationality. But if you see the rampant poverty in India , what we need is more aggressive and impatient businessmen like Mr Rajaram, than the chalta hain _ ” it is okay” attitude.
Dhiraj and team, take a bow. You make us proud!
Ajay- Describe how you started using R. What are some of the benefits you noticed on moving to R?
Jeff- I began using R in an internship while working on my undergraduate degree. I was provided with some unformatted R code and asked to modularize the code then wrap it up into an R package for distribution alongside a publication.
To be honest, as a Computer Science student with training more heavily emphasizing the big high-level languages, R took some getting used to for me. It wasn’t until after I concluded that initial project and began using R to do my own data analysis that I began to realize its potential and value. It was the first scripting language which really made interactive use appealing to me — the experience of exploring a dataset in R was unlike anything (more…)
Note- Decisionstats.com has done almost 105 interviews in the field of analytics, technology startups and thought leaders ( you can see them here http://goo.gl/m3l31). We have covered some of the R authors ( R for SAS and SPSS users, Data Mining using R, Machine Learning for Hackers) , and noted R package creators (ggplot2, RCommander, rattle GUI, forecast)
The latest startup in the R ecosystem with a promising product is RApporter.net . It has actually been there for some time, but with the launch of their new product we ask them the trials and tribulations of creating an open source startup in the data science field.
This is part 1 of the interview with Gergely Daróczi, co-founder of the Rapporter project.
Ajay- Describe the journey of Rapporter till now, and your product plans for 2013.
Greg- The idea of Rapporter presented itself more then 3 years ago while giving statistics, SPSS and R courses at different Hungarian universities and also creating custom statistical reports for a number of companies for a living at the same time.
Long story short, the three Hungarian co-founder faced similar problems at both sectors: students, just like business clients, admired the capabilities of R and the wide variety of tools found on CRAN,but were not eager at all to get into learn how to use that.
So we tried to make up some plans how to let the non-R users also build on the resources of R, and we came up with the idea of an intuitive web-interface as an R front-end.
The real development of a helper R package (which later become “rapport”) started in the January of 2011 by Aleksandar Blagotić and me1 in our spare time and rather just for fun, as we had a dream about using “annotated statistical templates” in R after a few conversations on StackOverflow. We also worked on a front-end in the means of an Rserve driven PHP engine with MySQL – to be dropped and completely rewritten later after some trying experiences and serious benchmarking.
We have released “rapport” package to the public at the end of 2011 on GitHub, and after a few weeks on CRAN too. Despite the fact that we did our best with creating a decent documentation and also some live examples, we somehow forgot to spread the news of the new package to the R community, so “rapport” did not attract any serious attention.
Even so, our enthusiasm for annotated R “templates” did not wane as time passed, so we continued to work on “rapport” by adding new features and also Aleksandar started to fortify his Ruby on Rails skills. We also dropped Rserve with MySQL back-end, and introduced Jeffrey Horner’s awesome RApache with some NoSQL databases.
To be honest, this change resulted in a one-year delay of releasing Rapporter and no ends of headaches on our end, but in the long run, it was a really smart move after all, as we own an easily scalable and a highly available cluster of servers at the moment.
But back to 2012.
As “rapport” got too complex as time passed with newly added features, Aleksandar and I decided to split the package, which move gave birth to “pander”. At that time “knitr” got more and more familiar among R users, so it was a brave move to release “another” similar package, but the roots of “pander” were more then one year old, we used some custom methods not available in “knitr” (like
capturing the R object beside the printed output of chunks), we needed tweakable global options instead of chunk options and we really wanted to build on the power of Pandoc – just like before.
So we had a package for converting R objects to Pandoc’s markdown with a general S3 method, another package to automatically run that and also capture plots and images a brew-like document with various output formats – like pdf, docx, odt etc.
In the summer, while Aleksandar dealt with the web interface, I worked on some new features in our packages:
• automatic and robust caching of chunks with various options for performance reasons,
• automatically unifying “base”, “lattice” and “ggplot2” images to the same style with user options – like major/minor grid color, font family, color palette, margins etc.
• adding other global options to “pander”, to let our expected clients later personalize their
custom report style with a few clicks.
At the same time, we were searching for different options to prevent running malicious code in the parallel R sessions, which might compromise all our users’ sensitive data. Unfortunately no full blown solution existed at that time, and we really wanted to stand clear of running some Java based interpreters in our network.
So I started to create a parser for R commands, which was supposed to filter out malicious R commands before evaluation, and a handful flu got me some spare time to implement “sandboxR” with an open and live “hack my R server” demo, which ended up in a great challenge on my side, but proved to really work after all.
I also had a few conversations with Jeroen Ooms (the author of the awesome OpenCPU), who faced similar problems on his servers and was eager to prevent the issues with the help of AppArmor. The great news of “RAppArmor” did make “sandboxR” needless (as AppArmor just cannot regulate inner R calls), but we started to evaluate all user specified R commands in a separate hat, which allowed me to make “sanboxR” more permissive with black-filtered functions.
In the middle of the summer, I realized that we have an almost working web application with any number of R workers being able to serve tons of users based on the flexible NoSQL database back- ends, but we had no legal background to release such a service, nor had I any solid financial background to found one – moreover the Rapporter project already took huge amount from my family budget.
As I was against of letting some venture capital to dominate the project, and did not found any accelerator that would take on a project with a maturing, almost market-ready product, me and a few associates decided to found a UK company on our own and having confidence in the future and God.
So we founded Easystats Ltd, the company running rapporter.net, in July, and decided to release the first beta and pretty stable version of the application to the public at the end of September. At that time users could:
• upload and use text or SPSS sav data sets,
• specify more then 20 global options to be applied to all generated reports (like plot themes, table width, date format, decimal mark and number of digits, separators and copula in vectors etc.),
• create reports with the help of predefined statistical “templates”,
• “fork” (clone) any of our templates and modify without restriction, or create new statistical templates from scratch,
• edit the body or remove any part of the reports, resize images with the mouse or even with finger on touch-devices,
• and export reports to pdf, odt or docx formats.
A number of new features were introduced since then:
OpenBUGS integration with more permissive security profiles, users can create custom styles for the exported documents (in LaTeX, docx and odt format) to generate unique and possibly branded reports, to share public or even private reports with anyone without the need for registering on rapporter.net by a simple hyperlink, and to let our users to integrate their templates in any homepage, blog post or even HTML mail, so that let anyone use the power of R with a few clicks building on the knowledge of template authors and our reliable back-end.
Although 2 years ago I was pretty sure that this job would be finished in a few months and that we would possibly have a successful project in a year or two, now I am certain, that bunch of new features will make Rapporter more and more user-friendly, intuitive and extensible in the next few years.
Currently, we are working hard on a redesigned GUI with the help of a dedicated UX team at last (which was a really important structural change in the life of Rapporter, as we can really assign and split tasks now just like we dreamed of when the project was a two-men show), which is to be finished no later then the first quarter of the year. Beside design issues, this change would also result
in some new features, like ordering the templates, data sets and reports by popularity, rating or relevance for the currently active data set; and also letting users to alter the style of the resulting reports in a more seamless way.
The next planned tasks for 2013 include:
• a “data transformation” front-end, which would let users to rename and label variables in any uploaded data set, specify the level of measurement, recode/categorize or create new variables with the help of existing ones and/or any R functions,
• edit tables in reports on the fly (change the decimal mark, highlight some elements, rename columns and split tables to multiple pages with a simple click),
• a more robust API to let third-party users temporary upload data to be used in the analysis,
• option to use multiple data sets in a template and to let users merge or connect data online,
• and some top-secret surprises.
Beside the above tasks, which was made up by us, our team is really interested in any feedback from the users, which might change the above order or add new tasks with higher priority, so be sure to add your two cent on our support page.
And we will have to come up with some account plans with reasonable pricing in 2013 for the hosted service to let us cover the server fees and development expenses. But of course Rapporter will remain free for ever for users with basic needs (like analyzing data sets with only a few hundreds of cases) or anyone in the academic sector, and we also plan to provide an option to run Rapporter “off-site” on any Unix-like environment.
Ajay- What are some of the Big Data use cases I can do with Rapporter?
Greg- Although we have released Rapporter beta only a few months ago, we already heard some pretty promising use-cases from our (potential) clients.
But I must emphasize that at first we are not committed to deal with Big Data in the means of user contributed data sets with billions of cases, but rather concentrating on providing an intuitive and responsive way of analyzing traditional, survey-like data frames up to about 100.000 cases.
Anyway, to be on topic: a really promising project of Optimum Dosing Strategies has been using Rapporter’s API for a number of weeks even in 2012 to compute optimal doses for different kind of antibiotics based on Monte-Carlo simulation and Bayesian adaptive feedback among other methods.
This collaboration lets the ID-ODS team develop a powerful calculator with full-blown reports ready to be attached to medical records – without any special technical knowledge on their side, as we maintain the R engine and the integration part, they code in R. This results in pleased clients all over the world, which makes us happy too.
We really look forward to ship a number of educational templates to be used in real life at several (multilingual) universities from September 2013. These templates would let teachers show customizable and interactive reports to the students with any number of comments and narrative paragraphs, which statistical introductory modules would provide a free alternative to other desktop
software used in education.
In the next few months, a part of our team will focus on spatial analysis templates, which would mean that our users could not just map, but really analyze any of their spatially related data with a few clicks and clear parameters.
Another feature request of a client seems to be a really exciting idea. Currently, Google Analytics and other tracking services provide basic options to view, filter and export the historical data of websites, blogs etc.
As creating an interface between Rapporter and the tracking services to be able to fetch the most recent data is not beyond possibility any more with the help of existing API resources, so our clients could generate annotated usage reports of any specified period of time – without restrictions. Just to emphasize some potential add-ons: using the time-series R packages in the analysis or creating real- time “dashboards” with optional forecasts about live data.
Of course you could think of other kind of live or historical data instead of Google Analytics, as creating a template for e.g. transaction data or gas usage of a household could be addressed at any time, and please do not forget about the above referenced use-cases in the 3 rd question (“[...]Rapporter can help: [...]”).
But wait: the beauty of Rapporter is that you could implement all of the above ideas by yourself in our system, even without any help from us.
Ajay- What are some of things that can be easily done with Rapporter than with your plain vanilla R?
Greg- Rapporter is basically developed for creating reproducible, literative and annotated statistical modules (a.k.a. “templates”), which means the passing a data set and the list of variables with some optional arguments would end up in a full-blown written report with automatically styled tables and charts.
So using Rapporter is like writing “Sweave” or “knitr” documents, but you write the template only once, and then apply that to any number of data sets with a simple click on an intuitive user interface.
Beside this major objective: as Rapporter is running in the cloud and sharing reports and templates (or even data sets) with collaborators or with anyone on the Internet is really easy, our users can post, share any R code for free and without restrictions or release the templates with specified license and/or fees in a secured environment.
This means that Rapporter can help:
- scholars sharing scientific results or methods with reproducible and instantly available demo and/or dedicated implementation along with publications,
- teachers to create self-explanatory statistical templates which would help the students internationalize the subject by practice,
- any R developer to share a live and interactive demo of the implemented features of the functions with a few clicks,
- businesses could use a statistical platform without restrictions for a reasonable monthly fee instead of expensive and non-portable statistical programs,
- governments and national statistical offices to publicize census or other big data with a scientific and reliable analytic tool with annotated and clear reports while insuring the anonymity of the respondents by automatically applying custom methods (like data swapping, rounding, micro-aggregation, PRAM, adding noise etc.) to the tables and results, etc.
And of course, do not forget about one of our main objectives to let us open up the world of R to non-R users too with an intuitive, driving user interface.
(To be continued)-
Gergely Daróczi is co-ordinating the development of Rapporter and maintaining their R packages. Beside he tries to be active in some open-source projects and on StackOverflow, he is a PhD candidate in sociology and also a lecturer at Corvinus University of Budapest and Pázmány Péter Catholic University in Hungary
Rapporter is a web application helping you to create comprehensive, reliable statistical reports on any mobile device or PC, using an intuitive user interface.
The application builds on the power of R beside other technologies and intended to be used in any browser doing the heavy computations on the server side. Some might consider Rapporter as a customizable graphical user interface to R – running in the cloud.
Currently, Rapporter is under heavily development and only invited alpha testers can access the application. Please sign up for an invitation if you want to have an early-bird insight on Rapporter.
A free online course on learning R ( sponsored by O Reilly)
Table of Contents
- R Syntax: A gentle introduction to R expressions, variables, and functions
- Vectors: Grouping values into vectors, then doing arithmetic and graphs with them
- Matrices: Creating and graphing two-dimensional data sets
- Summary Statistics: Calculating and plotting some basic statistics: mean, median, and standard deviation
- Factors: Creating and plotting categorized data
- Data Frames: Organizing values into data frames, loading frames from files and merging them
- Working With Real-World Data: Testing for correlation between data sets, linear models and installing additional packages
- Learn R by trying R (r-bloggers.com)
Inspired by David Smith ‘s blog post at http://blog.revolutionanalytics.com/2012/10/r-user-group-sponsorship-applications-open-for-2013.html I set up a meetup group for New Delhi at http://www.meetup.com/New-Delhi-R-UseR-Group/ ( India to my surprise has only 1 R user meetup group before this in Bangalore). The first meeting was awesome, we met in a cafe, and the plan going forward is to cover cross domain learning and collaboration on tools, startups, mashups and training.
Hopefully we can reach out to analytics enthusiasts in Mumbai and Chennai to help kickstart the R User groups. Indian companies like Mu Sigma have been using R more and more in analytics (offshoring). You can even use the sponsorship from Revolution Analytics to start your meetup group , Meetup.com gives you a 50% discount if you pay 6 months in advance, and given Oracle’s and IBM/Google\s big Indian presence I hope they lend a hand to User groups for R in India as well.
I came across this lovely analytics company. Think Big Analytics. and I really liked their lovely explanation of the whole she-bang big data etc stuff. Because Hadoop isnt rocket science and can be made simpler to explain and deploy.
Check them out yourself at http://www.thinkbiganalytics.com/resources_reference
Also they have an awesome series of lectures coming up-
Up and Running with Big Data: 3 Day Deep-Dive
Over three days, explore the Big Data tools, technologies and techniques which allow organisations to gain insight and drive new business opportunities by finding signal in their data. Using Amazon Web Services, you’ll learn how to use the flexible map/reduce programming model to scale your analytics, use Hadoop with Elastic MapReduce, write queries with Hive, develop real world data flows with Pig and understand the operational needs of a production data platform
- MapReduce concepts
- Hadoop implementation: Jobtracker, Namenode, Tasktracker, Datanode, Shuffle & Sort
- Introduction to Amazon AWS and EMR with console and command-line tools
- Implementing MapReduce with Java and Streaming
- Hive Introduction
- Hive Relational Operators
- Hive Implementation to MapReduce
- Hive Partitions
- Hive UDFs, UDAFs, UDTFs
- Pig Introduction
- Pig Relational Operators
- Pig Implementation to MapReduce
- Pig UDFs
- NoSQL discussion
- What Is Hadoop? (blogs.sap.com)
- Big Data and NoSQL: The Problem with Relational Databases (infocus.emc.com)
- Big data, analytics as a service: Likely boom on deck (zdnet.com)
- IBM’s Big Data Analytics Empire (zdnet.com)