2013 Thank You Note

I would like to write a thank you note to  some of the people who helped make Decisionstats.com possible . We had a total of 150,644 views this year.For that, I have to thank you dear readers for putting up with me- it is now our seventh year.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
13,940 12,153 12,948 13,371 12,778  12,085  12,894  11,934  9,914  14,764  12,907  10,956  150,644

I would like to thank Chris  (of Mashape) for helping me with some of the interviews I wrote here .I did 26 interviews this year for Programmable Web and a total of 30+ articles including the interviews in 2013.

Of course- we have now reached 116 excellent interviews on Decisionstats.com alone ( see http://goo.gl/V6UsCG )I would like to thank each one of the interviewees who took precious time to fill out the questions.

Sponsors- I would like to thank Dr Eric Siegel ( individually as an author and as founder chair of www.pawcon.com ) , Nadja and Ingo (for Rapid-Miner) , Dr Jonathan ( for Datamind) , Chris M (for Statace.com ) , Gergely ( Author) and many more during all these six years who have kept us afloat and the servers warm in these days of cold reflection, including Gregory (of KDNuggets.com) and erstwhile AsterData founders.

Training Partners- I would like to thank Lovleen Bhatia ( of Edureka  for giving me the opportunity to make http://www.edureka.in/r-for-analytics which now has 1721 learners as per http://www.edureka.in/)

I would also specially say Thank you to Jigsaw Academy for giving me the opportunity to create
the first affordable and quality R course in Asia http://analyticstraining.com/2013/jigsaw-completes-training-of-300-students-on-r/

These training courses including those by Datamind and Coursera remain a formidable and affordable alternative to many others catching up in the analytics education game in India ( an issue I wrote here)

Each and Everyone of my students (past and present) and Everyone in the #rstats  and SAS-L community, including people who may have been left out.

Thank you sir, for helping me and Decisionstats.com !

Wish each one of you a very happy and Joyous Happy New Year and a great and prosperous 2014!

Karl Rexer Interview on the state of Analytics

To cap off a wonderful year, we have decided to interview Karl Rexer , founder of http://www.rexeranalytics.com/ and of the data mining survey that is considered the Industry benchmark for the state of the industry in analytics.

Ajay: Describe the history behind doing the survey , how you came up with the idea and what all players do you think survey the data mining and statistical software market apart from you

 Karl: Since the early 2000s I’ve been involved on the organizing and review committees for several data mining conferences and workshops. Early in the 2000s, in the hallways at these conferences I heard many analytic practitioners discussing and comparing their algorithms, data sources, challenges, tools, etc. Since we were already conducting online surveys for several of our clients, and my network of data miners is pretty large, I realized that we could easily do a survey of data miners, and share the results with the data mining community. I saw that the gap was there (and the interest), and we could help fill it. It was a way to give back to the data mining community, and also to raise awareness in the marketplace for my company, Rexer Analytics. So in 2007 we launched the first Data Miner Survey. In the first year, 314 data miners participated, and it’s just grown from there. In each of the last two surveys, over 1200 people participated. The interest we’ve seen in our research summary reports has also been astounding – we get thousands of requests for the summary reports each year. Overall, this just confirms what we originally thought: both inside the industry and beyond, people are hungry for information about data mining.

Are there other surveys and reviews of analytic professionals and the analytic marketplace? Sure. And there’s room for a variety of methodologies and perspectives. Forester and Gartner produce several reports that cover the analytic marketplace – they largely focus on software evaluations and IT trends. There are also surveys of CIOs and IT professionals that sometimes cover analytic topics. James Taylor (Decision Management Solutions) conducted an interesting study this year of Predictive Analytics in the Cloud. And of course, there are also the KDnuggets single-question polls that provide a pulse on people’s views of topical issues.

Ajay: Over the years- what broad trends have you seen in the survey in terms of paradigms- name your top 5 insights over these years

Karl: Well, I can’t think of a fifth one, but I’ve got four key findings and trends we’ve seen over the years we’ve been doing the Data Miner Surveys:

  1. The dramatic rise of open-source data mining tools, especially R. Since 2010, R has been the most-used data mining tool. And in 2013, 70% of data miners report using R. R is frequently used along with other tools, but we also see an increasing number of data miners selecting R as their primary tool.
  2. Data miners consistently report that regression, decision trees, and cluster analysis are the key algorithms they turn to. Each of the surveys, from 2007 through 2013, has shown this same core triad of algorithms.
  3. The challenges data miners face are also consistent: Across multiple years, the #1 challenge data miners report has been “dirty data.”. The other top challenges are “explaining data mining to others” and “difficult access to data”. In response to the 2010 survey, data miners described their best practices in overcoming these three key challenges. A summary of their ideas is available on our website here: http://www.rexeranalytics.com/Overcoming_Challenges.html. And three linked “challenge” pages contain almost 200 verbatim best practice ideas collected from survey respondents.
  4. We also see that there is excitement among analytic professionals, high job satisfaction, and room for more and better analytics. People report that the number of analytic projects is increasing, and the size of analytic teams is increasing too. But still there’s room for much wider and more sophisticated use of analytics – only a minority of data miners consider their companies to be analytically sophisticated.

 Ajay: What percentage of people are now doing analytics on the cloud, on mobile, tablet , versus desktop

Karl: In the past few years we’ve seen a doubling in the percent of people who report doing some of their analytics using cloud environments. It’s still the minority of data miners, but it’s grown from 7% in 2010 to 10% in 2011, and 19% reporting using cloud environments in 2013.

Ajay:Your survey is free. How does it help your consulting practice?

Karl: Our main motivation for doing the Data Miner Survey is to contribute to the data mining community. We don’t want to charge a fee for the summary reports, because we want to get the information into as many people’s hands as possible. And we want people to feel free to send the report on to their friends and colleagues.

However, the Data Miner Survey does also help Rexer Analytics. It helps to raise the visibility of our company. It increases the traffic and links to our website, and therefore helps our Google rankings. And it is a great conversation starter.

Ajay: Name some statistics on how popular your survey has become over time- in terms of people filling the survey , and people reading the survey

Karl: In 2007 when we launched the first Data Miner Survey, 314 data miners participated, and it’s grown nicely from there. In each of the last two surveys, over 1200 people participated. The interest we’ve seen in our research summary reports has also been growing at a dramatic rate – recently we’ve been getting thousands of requests for the summary reports each year. Additionally, we have been unveiling the highlights of the surveys with a presentation at the Fall Predictive Analytics World conferences, and it is always a popular talk.

But the most gratifying aspects about the expanded interest in our Data Miner Survey are two things:

  1. The great conversations that the Data Miner Survey has initiated. I have wonderful conversations with people by phone, email and at conferences and at colleges about the findings, the trends, and about all the great ideas people have for new and exciting ways that they want to apply analytics in their domains – everything from human resource planning to cancer research, and customer retention to fraud detection. And many people have contributed ideas for new questions or topics that we have incorporated into the survey.
  2. Seeing that people in the data mining community find the survey results useful. Many students and young people entering the field have told us the summary reports provide a great overview of the field and emerging trends. And many software vendors have told us that the survey helps them better understand the needs and preferences of hands-on data mining practitioners. I’m often surprised to see new people and places that are reading and appreciating our survey. We get emails from all corners of the globe, asking questions about the survey, or asking to share it with others. Sometime last year after receiving a question from an academic researcher in Asia, I decided to check Google Scholar to see who is citing the Data Miner Survey in their books and published papers. The list was long. And the list of online news stories, blogs and other mentions of the Data Miner Survey was even longer. I started a list of citations, with links back to the places that are citing the Data Miner Survey – you can look at the list here: http://www.rexeranalytics.com/Data_Miner_Survey_Citations.html – there are over 100 places citing our research, and the list includes 15 languages. But even more surprising was finding that someone had created a Wikipedia entry about the Data Miner Surveys. I made a couple small edits, but then I stopped. The accepted rule in the Wikipedia community is to not edit things that one has a personal interest in. However, I want to encourage any Wikipedia authors out there to go and help update https://en.wikipedia.org/wiki/Rexer%27s_Annual_Data_Miner_Survey.

 Ajay -What do you think are the top 3 insightful charts from your 2013 Report

Karl-  OK, it’s tough for me to pick only 3.  I think that you should pick the three that you think are the most insightful, and then blog about them and the reasons you think they’re important.

 But if you want me to pick 3, then here are three good ones:
— R Usage graph on page 16 
Screenshot from 2013-12-26 06:37:34
— Algorithm graph on page 36  
Screenshot from 2013-12-26 06:39:10
— The pair of graphs on page 19 that show that there’s still a lot of room for improvement
Happy new year!
(Ajay- You can see the wonderful report at http://www.rexeranalytics.com/ especially  the collection of links in the top right corner of the  home page that cite this survey)

Interview Christian Mladenov CEO StatAce Excellent and Hot #rstats StartUp

Here is an interview with Christian Mladenov, CEO of Statace , a hot startup in cloud based data science and statistical computing.39c1c29

Ajay Ohri (AO)- What is the difference between using R by StatAce and using R by RStudio on a R Studio server hosted on Amazon EC2 

Christian Mladenov (CM)- There are a few ways in which I think StatAce is better:

  • You do not need the technical skills to set up a server. You can instead start straight away at the click of a button.

  • You can save the full results for later reference. With an RStudio server you need to manually save and organize the text output and the graphics.

  • We are aiming to develop a visual interface for all the standard stuff. Then you will not need to know R at all.

  • We are developing features for collaboration, so that you can access and track changes to data, scripts and results in a team. With an RStudio server, you manage commits yourself, and Git is not suitable for large data files.

AO- How do you aim to differentiate yourself from other providers of R based software including Revolution, RStudio, Rapporter and even Oracle R Enterprise

CM- We aim to build a scalable, collaborative and easy to use environment. Pretty much everything else in the R ecosystem is lacking one, if not two of these. Most of the GUIs lack a visual way of doing the standard analyses. The ones that have it (e.g. Deducer) have a rather poor usability. Collaboration tools are hardly built in. RStudio has Git integration, but you need to set it up yourself, and you cannot really track large source data in Git.

Revolution Analytics have great technology, but you need to know R and you need to know how to maintain servers for large scale work. It is not very collaborative and can become quite expensive.

Rapporter is great for generating reports, but it is not very interactive – editing templates is a bit cumbersome if you just need to run a few commands. I think it wants to be the place to go to after you have finalized the development of the R code, so that you can share it.  Right now, I also do not see the scalability.

With Oracle R Enterprise you again need to know R. It is a targeted at large enterprises and I imagine it is quite expensive, considering it only works with Oracle’s database. For that you need an IT team. Screenshot from 2013-11-18 21:31:08

AO- How do you see the space for using R on a cloud?

CM- I think this is an area that has not received enough quality attention – there are some great efforts (e.g. ElasticR), but they are targeted at experienced R users. I see a few factors that facilitate the migration to the cloud:

  • Statisticians collaborate more and more, which means they need to have a place to share data, scripts and results.

  • The number of devices people use is increasing, and now frequently includes a tablet. Having things accessible through the web gives more freedom.

  • More and more data lives on servers. This is both because it is generated there (e.g. click streams) and because it is too big to fit on a user’s PC (e.g. raw DNA data). Using it where it already is prevents slow download/upload.

  • Centralizing data, scripts and results improves compliance (everybody knows where it is), reproducibility and reliability (it is easily backed up).

For me, having R to the cloud is a great opportunity.

AO-  What are some of the key technical challenges you currently face and are seeking to solve for R based cloud solutions

CM- Our main challenge is CPU use, since cloud servers typically have multiple slow cores and R is mostly single-threaded. We have yet to fully address that and are actively following the projects that aim to improve R’s interpreter – pqR, Renjin, Riposte, etc. One option is to move to bare metal servers, but then we will lose a lot of flexibility.

Another challenge is multi-server processing. This is also an area of progress where we have do not yet have a stable solution.

AO- What are some of the advantages and disadvantages of being a Europe based tech startup vis a vis a San Fransisco based tech startup

CM-In Eastern Europe at least, you can live quite cheaply, therefore you can better focus on the product and the customers. In the US you need to spend a lot of time courting investors.

Eastern Europe also has a lot of technical talent – it is not that difficult or expensive to hire experienced engineers.

The disadvantages are many, and I think they out-weigh the advantages:

  • Capital is scarce, especially after the seed stage. This means startups either have to focus on profit which limits their ability to execute a grander vision or they need to move to the US which wastes a lot of time and resources.

  • There is limited access to customers, partners, mentors and advisors. Most of the startup innovation happens in the US and its users prefer to deal with local companies.

  • The environment in Europe is not as supportive in terms of events, media coverage, and even social acceptance. In many countries founders are viewed with a bit of suspicion, and failure frequently means the end to one’s credibility. Screenshot from 2013-11-18 21:30:46

AO- What advice would you give to aspiring data scientists

CM-Use open-source. R, Julia, Octave and the others are seeing a level of innovation that the commercial solutions just cannot match. They are very flexible and versatile, and if you need something specific, you should learn some Python and do it yourself.

Keep things reproducible, or at some point you will get lost. This includes a version control system.

Be active in the community. While books are great, sharing and seeking advice will improve your skills much faster.

Focus more on “why” you do something and “what” you want to achieve. Only then get technical about “how” you want to do it. Use a good IDE that facilitates your work and allows you to do the simple things fast. You know, like StatAce 🙂

AO- Describe your career journey from Student to CEO

CM-During my bachelor studies I worked as a software developer and customer intelligence analyst. This gave me a lot of perspective on software and data.

After graduating I got a job where I coordinated processes and led projects. This is where I discovered the importance of listening to customers, planning well in advance, and having good data to base decisions on.

In my master studies, it was my statistics-heavy thesis that made me think “why is there not a place where I can easily use the power of R on a machine with a lot of RAM?” This is when the idea for StatAce was born.

statacebetapitch

About StatAce-

Bulgarian StatAce is the winner of betapitch | global, which was held in Berlin on 6 July (read more about it here). The  team, driven by the lack of software for low, student budgets, came up with the idea of building “Google docs for professional statisticians” and eventually took home the first prize of the startup competition.

Interview Joseph Eapen CEO RMInsights #rstats #shiny

Here is an interview with Joseph Eapen, CEO of RMInsights   an exciting data science company that have been using R for their work in providing decision support for the Entertainment Industry (do not confuse them with rminsight- which is almost the same url but without the s)

I found some of the work in applied data science cool enough to request an interview and they were gracious to respond at short notice.

Joseph Eapen

Ajay Ohri (AO)- What are the innovative steps , products services and initiatives that you have been executing at rminsights.net

Joseph Eapen (JE)– Cinema: RMI has launched Cinema Audience Measurement with focus on Audience Appreciation. As you know, film making is an expensive and time-consuming exercise; to add to it, the audience sit through a movie, with great expectation, hoping it is worth their time, effort and money. Even though Box office collections is a yardstick of movie revenues, it is still far from any audience appreciation measures. CAM through http://screenratings.org provides a scientific tool that will aid you with movie appreciation scores and help you opine to aid others in choosing a movie to watch this week.

Television: Apart from CAM, RMI focuses on Data Analytics, especially forecasting techniques using R and Shiny and has developed a TV ratings forecaster as a Service.

Data Collection: Designs Mobile based Custom Surveys using ODK (OpenDataKit –Build, Collect, Aggregate)

AO- What are some of the applications and products that you have been developing using R language and R Shiny Applications

JE-  TV Ratings Forecaster as a Service is the product we have developed using R and Shiny. It uses Professor Rob J Hyndman’s Forecast package that provides methods and tools for displaying and analyzing univariate time series forecasts including exponential smoothing; using the Holt-Winters HW approach.

The user has to simply upload a csv file containing historical TV ratings of any TV channel(s), usually 48 time-segments for each day, decide on the forecast horizon and the system predicts and plots the forecast.

TV Ratings Forecaster as a Service

AO- What is your vision for research and analytics going forward ?

JE- We are all at an interesting juncture, almost everything we do is captured at a transactional level and the so called Bigdata will be accessible for analysis. So in future, I see that we will need research to bridge gaps in this transactional data, so a large role in decision making will be played by analytics more than research (as we know it now). Both will co-exists, but data mining and analytics will take center stage.

AO- Describe your career journey from students day to CEO

JE- After I finished my engineering (Electronics & Telecommunication) in Mumbai, I went to Dubai in search for a job. Within 10 days, I saw a job vacancy in the local newspaper for a Project Engineer at a leading Market Research company – PARC (A Gallup affiliate). I applied and after 2 rounds of interview, landed the job. The company had just signed a JV with AGB Italia (an Television Audience Measurement company) to conduct peoplemeter based TAM in the Gulf region, starting with UAE. The JV was called AGB Gulf. During the expansion phase of the project I was send to Milan to test and initiate the purchase of peoplemeters for the expansion, and trained at their headquarters in Switzerland. Soon I was promoted as Head of AGB Gulf, and ran it for 10 years.. During this time I led the service through an successful audit by the Industry.

Following this I was instrumental in the launch of TGI (Single Source from BMRB) in the Gulf and LEVANT markets, again through a JV between PARC and BMRB in UK. The JV was called TGI Arabia.

This was the first 13 years of my career. Post that I joined as Head of Research & Development at MediaEdge:CIA (A WPP Company) in Dubai. After 3 years, I returned to Mumbai and joined aMap (An overnight Television Ratings agency, competing fiercely with TAM Media Research – A Nielsen/Kantar Company) as Director and promoted to CEO within 11 months.

After 3 years with aMap, I was appointed as CEO of MRUC (An industry body that issues IRS – world’s largest readership study, in India)

Post MRUC, I joined RMI as CEO… and the journey continues…

AO- What advice would you give to aspiring data scientists

JE- As much as, there will be large amount of data to mine and lot of techniques available to make sense of it all, it may get overwhelming and we should not lose track of the purpose of mining, that is, application of the learning and understanding to enhance our life and the world around us. It should not be – Exciting to Mine and Good to Know… It should be way beyond that.. totally purpose driven and put to use, aggressively

AO- What do you think are the criterion that companies should take into factor while outsourcing research and analytics (apart from cost)

JE- Apart from cost, the key factor should be turn-around speed and accountability. Not only should they be fast & meticulous, but how much they can stand behind their work and see that it serves the purpose, it was commissioned for, in the first place. So you see it goes beyond ‘just doing a good job’

About RMInsights 

We have studied human nature and behavior for more than 20 years. Our reputation for delivering relevant, timely, and visionary solutions on what people around the world think and feel is the cornerstone of the organization. We employ many of the world’s leading scientists in management, economics, psychology, and sociology. We study market conditions in local, regional, or national areas to examine potential of a product or service. We help companies understand what products people want, who will buy them, and at what price.

Joseph Eapen
CEO

http://www.rminsights.net/

Interview Dr. Ian Fellows Fellstat.com #rstats Deducer

Here is an interview with Dr Ian Fellows, creator of acclaimed packages in R like Deducer and the Founder and President of Fellstat.com
fellstat
Ajay- Describe your involvement with the Deducer Project, and the various plugins associated with it. What has been the usage and response for Deducer from R Community.
Ian- Deducer is a graphical user interface for data analysis built on R. It sprung out of a disconnect between the toolchain used by myself and the toolchain of the psychologists that I worked with at the University of California, San DIego. They were primarily SPSS user, whereas I liked to use R, especially for anything that was not a standard analysis.
I felt that there was a big gap in the audience that R serves. Not all consumers or producers of statistics can be expected to have the computational background (command-line programming) that R requires. I think it is important to recognize and work with the areas of expertise that statistical users have. I’m not an expert in psychology, and they didn’t expect me to be one. They are not experts in computation, and I don’t think that we should expect them to be in order to be a part of the R toolchain community.ian
This was the impetus behind Deducer, so it is fundamentally designed to be a familiar experience for users coming from an SPSS background and provides a full implementation of the standard methods in statistics, and data manipulation from descriptives to generalized linear models. Additionally, it has an advanced GUI for creating visualizations which has been well received, and won the John Chambers award for statistical software in 2011.
Uptake of the system is difficult to measure as CRAN does not track package downloads, but from what I can tell there has been a steadily increasing user base. The online manual has been accessed by over 75,000 unique users, with over 400,000 page views. There is a small, active group of developers creating add-on packages supporting various sub-diciplines of statistics. There are 8 packages on CRAN extending/using Deducer, and quite a few more on r-forge.
Ajay- Do you see any potential for Deducer as an enterprise software product (like R Studio et al)
Ian- Like R Studio, Deducer is used in enterprise environments but is not specifically geared towards that environment. I do see potential in that realm, but don’t have any particular plan to make an enterprise version of Deducer.
Ajay- Describe your work in Texas Hold’em Poker. Do you see any potential for R for diversifying into the casino analytics – which has hitherto been served exclusively by non open source analytics vendors.
Ian- As a Statistician, I’m very much interested in problems of inference under uncertainty, especially when the problem space is huge. Creating an Artificial Intelligence that can play (heads-up limit) Texas Hold’em Poker at a high level is a perfect example of this. There is uncertainty created by the random drawing of cards, the problem space is 10^{18}, and our opponent can adapt to any strategy that we employ.
While high level chess A.I.s have existed for decades, the first viable program to tackle full scale poker was introduced in 2003 by the incomparable Computer Poker Research group at the University of Alberta. Thus poker represents a significant challenge which can be used as a test bed to break new ground in applied game theory. In 2007 and 2008 I submitted entries to the AAA’s annual computer poker competition, which pits A.I.s from universities across the world against each other. My program, which was based on an approximate game theoretic equilibrium calculated using a co-evolutionary process called fictitious play, came in second behind the Alberta team.
Ajay- Describe your work in social media analytics for R. What potential do you see for Social Network Analysis given the current usage of it in business analytics and business intelligence tools for enterprise.
Ian- My dissertation focused on new model classes for social network analysis (http://arxiv.org/pdf/1208.0121v1.pdf and http://arxiv.org/pdf/1303.1219.pdf). R has a great collection of tools for social network analysis in the statnet suite of packages, which represents the forefront of the literature on the statistical modeling of social networks. I think that if the analytics data is small enough for the models to be fit, these tools can represent a qualitative leap in the understanding and prediction of user behavior.
Most uses of social networks in enterprise analytics that I have seen are limited to descriptive statistics (what is a user’s centrality; what is the degree distribution), and the use of these descriptive statistics as fixed predictors in a model. I believe that this approach is an important first step, but ignores the stochastic nature of the network, and the dynamics of tie formation and dissolution. Realistic modeling of the network can lead to more principled, and more accurate predictions of the quantities that enterprise users care about.
The rub is that the Markov Chain Monte Carlo Maximum Likelihood algorithms used to fit modern generative social network models (such as exponential-family random graph models) do not scale well at all. These models are typically limited to fitting networks with fewer than 50,000 vertices, which is clearly insufficient for most analytics customers who have networks more on the order of 50,000,000.
This problem is not insoluble though. Part of my ongoing research involves scalable algorithms for fitting social network models.
Ajay- You decided to go from your Phd into consulting (www.fellstat.com) . What were some of the options you considered in this career choice.
Ian– I’ve been working in the role of a statistical consultant for the last 7 years, starting as an in-house consultant at UCSD after obtaining my MS. Fellows Statistics has been operating for the last 3 years, though not fulltime until January of this year. As I had already been consulting, it was a natural progression to transition to consulting fulltime once I graduated with my Phd.
This has allowed me to both work on interesting corporate projects, and continue research related to my dissertation via sub-awards from various universities.
Ajay- What does Fellstat.com offer in its consulting practice.
Ian– Fellows Statistics offers personalized analytics services to both corporate and academic clients. We are a boutique company, that can scale from a single statistician to a small team of analysts chosen specifically with the client’s needs in mind. I believe that by being small, we can provide better, close-to-the-ground responsive service to our clients.
As a practice, we live at the intersection of mathematical sophistication, and computational skill, with a hint of UI design thrown into the mix. Corporate clients can expect a diverse range of analytic skills from the development of novel algorithms to the design and presentation of data for a general audience. We’ve worked with Revolution Analytics developing algorithms for their ScaleR product, the Center for Disease Control developing graphical user interfaces set to be deployed for world-wide HIV surveillance, and Prospectus analyzing clinical trial data for retinal surgery. With access to the cutting edge research taking place in the academic community, and the skills to implement them in corporate environments, Fellows Statistics is able to offer clients world-class analytics services.
Ajay- How does big data affect the practice of statistics in business decisions.
Ian– There is a big gap in terms of how the basic practice of statistics is taught in most universities, and the types of analyses that are useful when data sizes become large. Back when I was at UCSD, I remember a researcher there jokingly saying that everything is correlated rho=.2. He was joking, but there is a lot of truth to that statement. As data sizes get larger everything becomes significant if a hypothesis test is done, because the test has the power to detect even trivial relationships.
Ajay- How is the R community including developers coping with the Big Data era? What do you think R can do more for Big Data?
Ian- On the open source side, there has been a lot of movement to improve R’s handling of big data. The bigmemory project and the ff package both serve to extend R’s reach beyond in-memory data structures.  Revolution Analytics also has the ScaleR package, which costs money, but is lightning fast and has an ever growing list of analytic techniques implemented. There are also several packages integrating R with hadoop.
Ajay- Describe your research into data visualization including word cloud and other packages. What do you think of Shiny, D3.Js and online data visualization?
Ian- I recently had the opportunity to delve into d3.js for a client project, and absolutely love it. Combined with Shiny, d3 and R one can very quickly create a web visualization of an R modeling technique. One limitation of d3 is that it doesn’t work well with internet explorer 6-8. Once these browsers finally leave the ecosystem, I expect an explosion of sites using d3.
Ajay- Do you think wordcloud is an overused data visualization type and how can it be refined?
Ian- I would say yes, but not for the reasons you would think. A lot of people criticize word clouds because they convey the same information as a bar chart, but with less specificity. With a bar chart you can actually see the frequency, whereas you only get a relative idea with word clouds based on the size of the word.
I think this is both an absolutely correct statement, and misses the point completely. Visualizations are about communicating with the reader. If your readers are statisticians, then they will happily consume the bar chart, following the bar heights to their point on the y-axis to find the frequencies. A statistician will spend time with a graph, will mull it over, and consider what deeper truths are found there. Statisticians are weird though. Most people care as much about how pretty the graph looks as its content. To communicate to these people (i.e. everyone else) it is appropriate and right to sacrifice statistical specificity to design considerations. After all, if the user stops reading you haven’t conveyed anything.
But back to the question… I would say that they are over used because they represent a very superficial analysis of a text or corpus. The word counts do convey an aspect of a text, but not a very nuanced one. The next step in looking at a corpus of texts would be to ask how are they different and how are they the same. The wordcloud package has the comparison and commonality word clouds, which attempt to extend the basic word cloud to answer these questions (see: http://blog.fellstat.com/?p=101).
About-

Dr. Ian Fellows is a professional statistician based out of the University of California, Los Angeles. His research interests range over many sub-disciplines of statistics. His work in statistical visualization won the prestigious John Chambers Award in 2011, and in 2007-2008 his Texas Hold’em AI programs were ranked second in the world.

Applied data analysis has been a passion for him, and he is accustomed to providing accurate, timely analysis for a wide range of projects, and assisting in the interpretation and communication of statistical results. He can be contacted at info@fellstat.com

Interview Jeroen Ooms OpenCPU #rstats

Below an interview with Jeroen Ooms, a pioneer in R and web development. Jeroen contributes to R by developing packages and web applications for multiple projects.

jeroen

Ajay- What are you working on these days?
Jeroen- My research revolves around challenges and opportunities of using R in embedded applications and scalable systems. After developing numerous web applications, I started the OpenCPU project about 1.5 year ago, as a first attempt at a complete framework for proper integration of R in web services. As I work on this, I run into challenges that shape my research, and sometimes become projects in their own. For example, the RAppArmor package provides the security framework for OpenCPU, but can be used for other purposes as well. RAppArmor interfaces to some methods in the Linux kernel, related to setting security and resource limits. The github page contains the source code, installation instructions, video demo’s, and a draft of a paper for the journal of statistical software. Another example of a problem that appeared in OpenCPU is that applications that used to work were breaking unexpectedly later on due to changes in dependency packages on CRAN. This is actually a general problem that affects almost all R users, as it compromises reliability of CRAN packages and reproducibility of results. In a paper (forthcoming in The R Journal), this problem is discussed in more detail and directions for improvement are suggested. A preprint of the paper is available on arXiv: http://arxiv.org/abs/1303.2140.

I am also working on software not directly related to R. For example, in project Mobilize we teach high school students in Los Angeles the basics of collecting and analyzing data. They use mobile devices to upload surveys with questions, photos, gps, etc using the ohmage software. Within Mobilize and Ohmage, I am in charge of developing web applications that help students to visualize the data they collaboratively collected. One public demo with actual data collected by students about snacking behavior is available at: http://jeroenooms.github.com/snack. The application allows students to explore their data, by filtering, zooming, browsing, comparing etc. It helps students and teachers to access and learn from their data, without complicated tools or programming. This approach would easily generalize to other fields, like medical data or BI. The great thing about this application is that it is fully client side; the backend is simply a CSV file. So it is very easy to deploy and maintain.

Ajay-What’s your take on difference between OpenCPU and RevoDeployR ?
Jeroen- RevoDeployR and OpenCPU both provide a system for development of R web applications, but in a fairly different context. OpenCPU is open source and written completely in R, whereas RevoDeployR is proprietary and written in Java. I think Revolution focusses more on a complete solution in a corporate environment. It integrates with the Revolution Enterprise suite and their other big data products, and has built-in functionality for authentication, managing privileges, server administration, support for MS Windows, etc. OpenCPU on the other hand is much smaller and should be seen as just a computational backend, analogous to a database backend. It exposes a clean HTTP api to call R functions to be embedded in larger systems, but is not a complete end-product in itself.

OpenCPU is designed to make it easy for a statistician to expose statistical functionality that will used by web developers that do not need to understand or learn R. One interesting example is how we use OpenCPU inside OpenMHealth, a project that designs an architecture for mobile applications in the health domain. Part of the architecture are so called “Data Processing Units”, aka DPU’s. These are simple, modular I/O units that do various sorts of data processing, similar to unix tools, but then over HTTPS. For example, the mobility dpu is used to calculate distances between gps coordinates via a simple http call, which OpenCPU maps to the corresponding R function implementing the harversine formula.

Ajay- What are your views on Shiny by RStudio?
Jeroen- RStudio seems very promising. Like Revolution, they deliver a more full featured product than any of my projects. However, RStudio is completely open source, which is great because it allows anyone to leverage the software and make it part of their projects. I think this is one of the reasons why the product has gotten a lot of traction in the community, which has in turn provided RStudio with great feedback to further improve the product. It illustrates how open source can be a win-win situation. I am currently developing a package to run OpenCPU inside RStudio, which will make developing and running OpenCPU apps much easier.

Ajay- Are you still developing excellent RApache web apps (which IMHO could be used for visualization like business intelligence tools?)
Jeroen–   The OpenCPU framework was a result of those webapps (including ggplot2 for graphical exploratory analysis, lme4 for online random effects modeling, stockplot for stock predictions and irttool.com, an R web application for online IRT analysis). I started developing some of those apps a couple of years ago, and realized that I was repeating a large share of the infrastructure for each application. Based on those experiences I extracted a general purpose framework. Once the framework is done, I’ll go back to developing applications 🙂

Ajay- You have helped  build web apps, openCPU, RAppArmor, Ohmage , Snack , mobility apps .What’s your thesis topic on?
Jeroen- My thesis revolves around all of the technical and social challenges of moving statistical computing beyond the academic and private labs, into more public, accessible and social places. Currently statistics is still done to mostly manually by specialists using software to load data, perform some analysis, and produce results that end up in a report or presentation. There are great opportunities to leverage the open source analysis and visualization methods that R has to offer as part of open source stacks, services, systems and applications. However, several problems need to be addressed before this can actually be put in production. I hope my doctoral research will contribute to taking a step in that direction.

Ajay- R is RAM constrained but the cloud offers lots of RAM. Do you see R increasing in usage on the cloud? why or why not?
Jeroen-   Statistical computing can greatly benefit from the resources that the cloud has to offer. Software like OpenCPU, RStudio, Shiny and RevoDeployR all provide some approach of moving computation to centralized servers. This is only the beginning. Statisticians, researchers and analysts will continue to increasingly share and publish data, code and results on social cloud-based computing platforms. This will address some of the hardware challenges, but also contribute towards reproducible research and further socialize data analysis, i.e. improve learning, collaboration and integration.

That said, the cloud is not going to solve all problems. You mention the need for more memory, but that is only one direction to scale in. Some of the issues we need to address are more fundamental and require new algorithms, different paradigms, or a cultural change. There are many exciting efforts going on that are at least as relevant as big hardware. Gelman’s mc-stan implements a new MC method that makes Bayesian inference easier and faster while supporting more complex models. This is going to make advanced Bayesian methods more accessible to applied researchers, i.e. scale in terms of complexity and applicability. Also Javascript is rapidly becoming more interesting. Performance of Google’s javascript engine V8 outruns any other scripting language at this point, and the huge Javascript community provides countless excellent software libraries. For example D3 is a graphics library that is about to surpass R in terms of functionality, reliability and user base. The snack viz that I developed for Mobilize is based largely on D3. Finally, Julia is another young language for technical computing with lots of activity and very smart people behind it. These developments are just as important for the future of statistical computing as big data solutions.

About-
You can read more on Jeroen and his work at  http://jeroenooms.github.com/ and reach out to him here http://www.linkedin.com/in/datajeroen

Interview Pranay Agrawal Co-Founder Fractal Analytics

Here is an interview with Pranay Agrawal, Executive Vice President- Global Client Development, Fractal Analytics – one of India’s leading analytics services providers and one of the pioneers in analytics services delivery.

Ajay- Describe Fractal Analytics’ journey as a startup to a pioneer in the Predictive Analytics Services industry. What were some of the key turning points in the field of analytics that you have noticed during these times?

IMG_5387

Pranay- In 2000, Fractal Analytics started as a pure-play analytics services company in India with a focus on financial services. Five years later, we spread our operation to the United States and opened new verticals. Today, we have the widest global footprint among analytics providers and have experience handling data and deep understanding of consumer behavior in over 150 counties. We have matured from an analytics service organization to a productized analytics services firm, specializing in consumer goods, retail, financial services, insurance and technology verticals.
We are on the fore-front of a massive inflection point with Big Data Analytics at the center. We have witnessed the transformation of analytics within our clients from a cost center to the most critical division that drives competitive advantage.  Advances are quickly converging in computer science, artificial intelligence, machine learning and game theory, changing the way how analytics is consumed by B2B and B2C companies. Companies that use analytics well are poised to excel in innovation, customer engagement and business performance.

Ajay- What are analytical tools that you use at Fractal Analytics? Are there any trends in analytical software usage that you have observed?

Pranay- We are tools agnostic to serve our clients using whatever platforms they need to ensure they can quickly and effectively operationalize the results we deliver.  We use R, SAS, SPSS, SpotFire, Tableau, Xcelsius, Webfocus, Microstrategy and Qlikview. We are seeing an increase in adoption of open source platform such as R, and specialize tools for dashboard like Tableau/Qlikview, plus an entire spectrum of emerging tools to process manage and extract information from Big Data that support Hadoop and NoSQL data structures

Ajay- What are Fractal Analytics plans for Big Data Analytics?

Pranay- We see our clients being overwhelmed by the increasing complexity of the data. While they are all excited by the possibilities of Big Data, on-the-ground struggle continues to realize its full potential. The analytics paradigm is changing in the context of Big Data. Our solutions focus on how to make it super-simple for our clients combined with analytics sophistication possible with Big Data.
Let’s take our Customer Genomics solution for retailers as an example. Retailers are collecting information about Shopper behaviors through every transaction. Retailers want to transform their business to make it more customer-centric but do not know how to go about it. Our Customer Genomics solution uses advanced machine learning algorithm to label every shopper across more than 80 different dimensions. Retailers use these to identify which products it should deep-discount depending on what price-sensitive shoppers buy. They are transforming the way they plan their assortment, planogram and targeted promotions armed with this intelligence.

We are also building harmonization engines using Concordia to enable real-time update of Customer Genomics based on every direct, social, or shopping transaction. This will further bridge the gap between marketing actions and consumer behavior to drive loyalty, market share and profitability.

Ajay- What are some of the key things that differentiate Fractal Analytics from the rest of the industry? How are you different?

Pranay- We are one of the pioneer pure-play analytics firm with over a decade of experience consulting with Fortune 500 companies. What clients most appreciate about working with us includes:

  • Experience managing structured and unstructured Big Data (volume, variety) with a deep understanding of consumer behavior in more than 150 counties
  • Advanced analytics leveraging supervised machine-learning platforms
  • Proprietary products for example: Concordia for data harmonization, Customer Genomics for consumer insights and personalized marketing, Pincer for pricing optimization, Eavesdrop for social media listening,  Medley for assortment optimization in retail industry and Known Value Item for retail stores
  • Deep industry expertise enables us to leverage cross-industry knowledge to solve a wide range of marketing problems
  • Lowest attrition rates in the industry and very selective hiring process makes us a great place to work

Ajay- What are some of the initiatives that you have taken to ensure employee satisfaction and happiness?

Pranay- We believe happy employees create happy customers. We are building a great place to work by taking a personal interest in grooming people. Our people are highly engaged as evidenced by 33% new hire referrals and the highest Glassdoor ratings in our industry.
We recognize the accomplishments and contributions made through many programs such as:

  1. FractElite – where peers nominate and defend the best of us
  2. Recognition board – where anyone can write a visible thank you
  3. Value cards – where anyone can acknowledge great role model behavior in one or more values
  4. Townhall – a quarterly all hands where we announce anniversaries and FractElite awards, with an open forum to ask questions
  5. Employee engagement surveys – to measure and report out on satisfaction programs
  6. Open access to managers and leadership team – to ensure we understand and appreciate each person’s unique goals and ambitions, coach for high performance, and laud their success

Ajay- How happy are Fractal Analytics customers quantitatively?  What is your retention rate- and what plans do you have for 2013?

Pranay- As consultants, delivering value with great service is critical to our growth, which has nearly doubled in the last year. Most of our clients have been with us for over five years and we are typically considered a strategic partner.
We conduct client satisfaction surveys during and after each project to measure our performance and identify opportunities to serve our clients better. In 2013, we will continue partnering with our clients to define additional process improvements from applying best practice in engagement management to building more advanced analytics and automated services to put high-impact decisions into our clients’ hands faster.

About

Pranay Agrawal -Pranay co-founded Fractal Analytics in 2000 and heads client engagement worldwide. He has a MBA from India Institute of Management (IIM) Ahmedabad, Bachelors in Accounting from Bangalore University, and Certified Financial Risk Manager from GARP. He is is also available online on http://www.linkedin.com/in/pranayfractal

Fractal Analytics is a provider of predictive analytics and decision sciences to financial services, insurance, consumer goods, retail, technology, pharma and telecommunication industries. Fractal Analytics helps companies compete on analytics and in understanding, predicting and influencing consumer behavior. Over 20 fortune 500 financial services, consumer packaged goods, retail and insurance companies partner with Fractal to make better data driven decisions and institutionalize analytics inside their organizations.

Fractal sets up analytical centers of excellence for its clients to tackle tough big data challenges, improve decision management, help understand, predict & influence consumer behavior, increase marketing effectiveness, reduce risk and optimize business results.