Home » Interviews

Category Archives: Interviews

Interview Vivian Zhang co-founder SupStat

Here is an interview with Vivian Zhang, CTO and co-founder Supstat which is an interesting startup in the R ecosystem. In this interview Vivian talks of the startup journey, helping spread R in China and New York City, and managing Meetups, conferences and training business with balance and excellence.


DecisionStats- (DS) Describe the story behind creation of SupStat Inc and the journey so far along with milestones and turning points. What is your vision for SupStat and what do you want it to accomplish and how.

Vivian Zhang(VZ) -


SupStat was born in 2012 out of the collaboration of 60+ individuals(Statistician, Computer Engineers, Mathematician,Professors, graduate students and talend Data genius)who met through a well-known non-profit organization in China, Capital of Statistics. The SupStat team met through various collaborations on R packages and analytic work. In 2013, SupStat became involved in the New York City data science community through hosting the NYC Open Data Meetup, and soon began offering formal courses through the NYC Data Science Academy. SupStat offers consulting services in the areas of R development, data visualization, and big data solutions. We are experienced with many technologies and languages including R, Python, Hadoop, Spark, Node.js, etc. Courses offered include Data Science with R (Beginner, Intermediate), Data Science with Python (Beginner, Intermediate), and Hadoop (Beginner, Intermediate), as well as many targeted courses on R packages and data visualization tools.

Allen and I, the two co-founders, have been passionate about Data Mining since a young age (we talked about it back in 1997). With industry experience as Chief Data scientist/Senior Analyst and a spirit of entrepreneurship, we started the firm by gathering all the top R/Hadoop/D3.js programmers we knew.

Milestones of SupStat:

June 2012, Established in Beijing

July 2012,  Offered R intensive Bootcamp in Beijing to 50+ college students

June 2013, Established in NYC

Nov 2013,  Launched our NYC training brand: NYC Data Science Academy

Jan 2014,  Became premium partner of Revolution Analytics in China

Feb 2014,  Became training and reseller partner of RStudio in US and China

April 2014, Became Exclusive reseller partner of Transwarp in US

                Started to offer R built-in and professional services for Hadoop/Spark

May 2014, Organized and sponsored R conference in Beijing

                NYC Open Data Meetup had 1800+ members in one year

Jun 2014, Sponsored UCLA R conference (Vivian was panelist for female R programmer talk.)

The major turning point was in November, 2013, when we decided to start our NYC office and launched the NYC Data Science Academy.

Our Mission:

We are committed to helping our clients make distinctive, lasting and substantial improvement in their performance, sales, clients and employee satisfaction by fully utilizing data. We are a value-driven firm. For us this means:

  • Solving the hardest problems

  • Utilizing state-of-the-art data science to help our clients succeed

  • Applying a structured problem-solving approach where all options are considered, researched, and analyzed carefully before recommendations are made

Our Vision: Be a firm that attracts exceptional people to meet the growing demand for data analysis and visualization.

Future goals:

With engaged clients, we want to share the excitement, unlimited potential and methodologies of using data to create business value. We want to be the go-to firm when people think of getting data analytic training, consulting, and big data products.

With top data scientists, we want to be the home for those who want different data challenges all the time. We promote their open data/demo work in the community and  expand the impact of the analytic tools and methodologies they developed. We connect the best ones to build the strongest team.

With new college students and young professionals, we want to help them succeed and be able to handle real world problems right away though our short-term, intensive training programs and internship programs. Through our rich experience, we have tailored our training program to solve some of the critical problems people face in their workplace.

Through our partnerships we want to spread the best technologies between the US and China. We want to close the gap and bring solutions and offerings to clients we serve. We are at the frontline to pick what is the best product for our clients.

We are glad we have the opportunity to do what we love and are good at, and will continue to enjoy doing it with a growing and energetic team.

DS -What is the state of open source statistical software in China? How have you contributed to R in China and how do you see the future of open source evolve there?

VZ- People love R and embrace R.  In May 2014, We helped to organize and sponsor the R conference in Beijing, with 1400 attendants. See our blog post for more details: http://www.r-bloggers.com/the-7th-china-r-conference-in-beijing/

We have helped organize two R conferences in China in the past year, Spring in Beijing and Winter in Shanghai. And we will do a Summer R conference in Guangzhou this year. That’s three R conferences in one year!

DS- Describe some of your work with your partners in helping sell and support R in China and USA

VZ- Revolution Analytics and RStudio are very respected in the R community. We are glad to work and learn from them through collaboration.

With Revolution, we provide services to do proof-of-concept and professional services including analytics and visualization. We also sell Revolution products and licenses in China. With RStudio, we sell Rstudio Server Pro and Shiny and promote training programs around those products in NYC. We plan to sell these products in China starting this summer. With Transwarp, we offer the best R analytic and paralleling experience through the Hadoop/Spark ecosystem.

DS- You have done many free workshops in multiple cities. What has been the response so far.

VZ- Let us first talk about what happened in NYC.

I went to a few meetups before I started my own meetup group. Most of the presentation/talks were awesome but they were not delivered and constructed in a way that attendants could learn and apply the technology right away. Most of the time, those events didn’t offer source code or technical details in the slides.

When I started my own group, my goal was “whatever cool stuff we showed you, we will teach you how to do it.” The majority of the events were designed as hands-on workshops while we hosted a few high profile speakers’ talks from time to time (including the chief data science scientist for the Obama Campaign).

My workshops cover a wide range of topics, including R, Python, Hadoop, D3.js, data processing, Tableau, location data query, open data, etc. People are super excited and keep saying “oh wow oh wow”, “never thought that I could do it”, ”it is pretty cool.” Soon our attendants started giving back to the group by teaching their skills and fun projects, offering free conference room, and sponsoring pizzas.

We are glad we have built a community of sharing experience and passion for data science. And I think this is a very unique thing we can do in NYC (due to the fact everything is close to half-hour subway distance). We host events 2-3 times per week and have attracted 1900 members in one year.

In other cities such as Shanghai and Beijing, we do free workshops for college students and scholars every month. We promise to go to the colleges as far as within 24 hours distance by train from Beijing.  Through partnerships with Capital of Statistics and DataUnion, we hosted entrepreneur sharing events with devoted speeches and lighting talks.

In NYC, we typically see 15 to 150 people per event. U.S. sponsors have included McKinsey & Company, Thoughtworks, and others. Our Beijing monthly tech event sees over 500 attendees and gains attraction from event co-hosts including Baiyu, Youku and others.

DS- What are some interesting projects of Supstat that you would like to showcase.

VZ- Let me start with one interesting open data project on Citibike data done by our team. The blog post, slides and meetup videos can be found at http://nycdatascience.com/meetup/nyc-open-data-project-ii-my-citibike/

Citibike provides a public bike service. There are many bike stations in NYC. People want to take a bike from a station with at least one available bike. And when they get to the destination, they want to return their bike to a station with at least one available slot. Our goal was to predict where to rent and where to return Citibikes. We showed the complete workflow including data scraping, cleaning, manipulation, processing, modeling, and making algorithms into a product.

We built a program to scrape data and save it to our database automatically. Using this data we utilized models from time series theory and machine learning to predict bike numbers in all the stations. Based on the models, we built a website for this citibike system. This application helps users of citibike arrange their trips better. We also showed a few tricks such as how to set up cron job on Linux, Windows and Mac machines, and how to get around RAM limitations on servers with PostgreSQL.

We’ve done other projects in China using R to solve problems ranging from Telecommunications data caching to Cardiovascular disease prevention. Each of these projects has required a unique combination of statistical knowledge and data science tools, with R being the backbone of the solution. The commercial cases can be found at our website: http://www.supstat.com/consulting/


SupStat is a statistical consulting company specialized in statistical computing and graphics using state-of-the-art technologies.

VIVIAN S. ZHANG Co-founder & CTO, NYC, Beijing and Shanghai Office

Vivian is a data scientist who has been devoted to the analytics industry and the development and use of data technologies for several years. She obtained expertise in data analysis and data management using various statistical analytical tools and programming languages as a Senior Analyst and Biostatistician at Memorial Sloan-Kettering Cancer Center and Scientific Programmer at Brown University. She is the co-founder SupStat, NYC Data Science Academy, NYC Open-Data meetup. She likes to portray herself as a programmer, data-holic, visualization evangelist.

You can read more about SupStat at http://www.supstat.com/team/

Writing on APIs for Programmable Web

I have been writing free lance on APIs for Programmable Web. Here is an updated list of the articles, many of these would be of interest to analytics users. Note- some of these are interviews and they are in bold. Note to regular readers: I keep updating this list , and at each updation bring it to the front page, then allowing the blog postings to slide it down!

Scoreoid Aims to Gamify the World Using APIs January 27th, 2014

Plot.ly’s Plot to Visualize More Data January 22nd, 2014

LumenData’s Acquisition of Algorithms.io is a Win-Win January 8th, 2014

Yactraq API Sees Huge Growth in 2013  January 6th, 2014

Scrape.it Describes a Better Way to Extract Data December 20th, 2013

Exclusive Interview: App Store Analytics API December 4th, 2013

APIs Enter 3d Printing Industry November 29th, 2013

PW Interview: José Luis Martinez of Textalytics November 6th, 2013

PW Interview Simon Chan PredictionIO November 5th, 2013

PW Interview: Scott Gimpel Founder and CEO FantasyData.com October 23rd, 2013

PW Interview Brandon Levy, cofounder and CEO of Stitch Labs October 8th, 2013

PW Interview: Jolo Balbin Co-Founder Text Teaser  September 18th, 2013

PW Interview:Bob Bickel CoFounder Redline13 July 29th, 2013

PW Interview : Brandon Wirtz CTO Stremor.com   July 4th, 2013

PW Interview: Andy Bartley, CEO Algorithms.io  June 4th, 2013

PW Interview: Francisco J Martin, CEO BigML.com 2013/05/30

PW Interview: Tal Rotbart Founder- CTO, SpringSense 2013/05/28

PW Interview: Jeh Daruwala CEO Yactraq API, Behavorial Targeting for videos 2013/05/13

PW Interview: Michael Schonfeld of Dwolla API on Innovation Meeting the Payment Web  2013/05/02

PW Interview: Stephen Balaban of Lamda Labs on the Face Recognition API  2013/04/29

PW Interview: Amber Feng, Stripe API, The Payment Web 2013/04/24

PW Interview: Greg Lamp and Austin Ogilvie of Yhat on Shipping Predictive Models via API   2013/04/22

Google Mirror API documentation is open for developers   2013/04/18

PW Interview: Ricky Robinett, Ordr.in API, Ordering Food meets API    2013/04/16

PW Interview: Jacob Perkins, Text Processing API, NLP meets API   2013/04/10

Amazon EC2 On Demand Windows Instances -Prices reduced by 20%  2013/04/08

Amazon S3 API Requests prices slashed by half  2013/04/02

PW Interview: Stuart Battersby, Chatterbox API, Machine Learning meets Social 2013/04/02

PW Interview: Karthik Ram, rOpenSci, Wrapping all science API2013/03/20

Viralheat Human Intent API- To buy or not to buy 2013/03/13

Interview Tammer Kamel CEO and Founder Quandl 2013/03/07

YHatHQ API: Calling Hosted Statistical Models 2013/03/04

Quandl API: A Wikipedia for Numerical Data 2013/02/25

Amazon Redshift API is out of limited preview and available! 2013/02/18

Windows Azure Media Services REST API 2013/02/14

Data Science Toolkit Wraps Many Data Services in One API 2013/02/11

Diving into Codeacademy’s API Lessons 2013/01/31

Google APIs finetuning Cloud Storage JSON API 2013/01/29

Ergast API Puts Car Racing Fans in the Driver’s Seat 2012/12/05
Springer APIs- Fostering Innovation via API Contests 2012/11/20
Statistically programming the web – Shiny,HttR and RevoDeploy API 2012/11/19
Google Cloud SQL API- Bigger ,Faster and now Free 2012/11/12
A Look at the Web’s Most Popular API -Google Maps API 2012/10/09
Cloud Storage APIs for the next generation Enterprise 2012/09/26
Last.fm API: Sultan of Musical APIs 2012/09/12
Socrata Data API: Keeping Government Open 2012/08/29
BigML API Gets Bigger 2012/08/22
Bing APIs: the Empire Strikes Back 2012/08/15
Google Cloud SQL: Relational Database on the Cloud 2012/08/13
Google BigQuery API Makes Big Data Analytics Easy 2012/08/05
Your Store in The Cloud -Google Cloud Storage API 2012/08/01
Predict the future with Google Prediction API 2012/07/30
The Romney vs Obama API 2012/07/27

Interview: Linkurious aims to simplify graph databases

linkurious-239x60-trHere is an interview with a really interesting startup Linkurious and it’s co-founders Sebastien Heymann( also co-founder of Gephi) and Jean Villedieu. They are hoping to making graph databases easier to use and thus spur on their usage.

Decisionstats (DS)-  How did you come up about setting across your startup

Linkurious (L) -A lot of businesses are struggling to understand the connections within their data. Who are the persons connected to this financial transaction? What happens to the telecommunication network if this antenna fails? Who is the most influential person in this community? There are a lot of questions that involve a deep understanding of graphs. Most business intelligence and data visualization tools are not adapted for these questions because they have a hard time handling queries about connections and because their interface is not suited for network visualization.
I noticed this because I co-founded a graph visualization software called Gephi a few years ago. It quickly became a reference and the software was downloaded 250k times last year. It really helped people understand the connections in their data in a new way.
In 2013, this success inspired me to found Linkurious. The idea is to provide a solution that’s easy to use to democratize graph visualization.

What does it mean?
We want to help people understand the connection in their data. Linkurious is really easy to use and optimized for the exploration of graphs.
You can install it in minutes. Then, it gives you a search interface through which you can query the data. What’s special about our software is that the result of your search is represented as a graph that you can explore dynamically. Contrary to Gephi or other graph visualization tools, Linkurious only shows you a limited subset of your data and not the whole graph. The goal here is to focus on what the user is looking for and help him find an answer faster.
In order to do that, Linkurious also comes with the ability to filter nodes or color them according to their properties. This way, it’s much faster to understand the data.

DS- How do you support packages from Python , and R and other languages like Julia? What is Linkurious based on?

L- Linkurious is largely based on a stack of open-source technologies. We rely on Neo4j, the leading graph database to store and access the data. Neo4j can handle really large datasets, this means that our users can access the information much faster than with a traditional SQL database. Neo4j also comes with a query language that allows “smart search”, locating nodes and relationships based on rules like “what’s the shortest path between these 2 nodes?” or “who among the close network of this person has been to London and loves sushi”. That’s the kind of things that Facebook delivers via Graph Search and it’s exciting to see these technologies applied in the business world.
We also use Nodejs, Sigmajs and ElasticSearch.

DS-  Name  a few case studies where enterprises have used graphical analysis for great benefit?

L- There really are a lot of use cases for graph visualization and we are learning about it almost every day. There are well know applications that are connected to security. For example, graph databases are great to identify suspicious patterns across a variety of data sources. People using false identities to defraud bank tend to share addresses, phone numbers or names. Without graphs, it’s hard to see how they are connected and they tend to remain undetected until it’s too late. Graph visualization can be triggered by alert systems. Then, analysts can investigate the data and decide whether the alert should be escalated or not.
In the telecom industry, you can use graph to map your network and identify weak links, assess the potential of a failure (i.e. impact analysis). Graph visualization helps understand these information and better manage the network.

We also have clients in the logistics, health or consulting industry. Every data oriented industry needs data visualization tools, and graphs offer powerful ways to ask new questions and reveal unforeseen information.

DS-What are some of the challenges with creating, sustaining and maintaining a cutting edge technology startup in Europe and France

L- There are a lot of challenges with creating and sustaining a challenges. I think the bigger ones are not necessarily location-related. The main issue is to build something people want. It’s certainly been our biggest challenge. We’ve used a lean startup approach to ship a prototype of our product as fast as we could. The first version of Linkurious was buggy and didn’t much interest from customers. But we did get feedback from a few people who really liked it. Since then, we’ve been focusing on them to develop our vision of Linkurious. We are pleased with the results, I think we are on the right path but it’s really a journey.
As for the more location-related challenges, I think France usually gets a bad rep for not being start-up friendly. Our experience has been quite the contrary. There are administrative annoyances but we also benefit from generous benefits, access to great engineers and a burgeoning startup eco-system!


The mission of Linurio.us is  to help users access and navigate graph databases in a simple manner so they can make sense of their data.

Some of their interesting solutions are here.

Interview Anne Milley JMP

 An interview with noted analytics thought leader Anne Milley from JMP. Anne talks of statistics, cloud computing, culture of JMP, globalization and analytics in general.

DecisionStats(DS) How was 2013 as a year for statistics in general and JMP in particular?  

Anne Milley-  (AM) I’d say the first-ever  International Year of Statistics (Statistics2013) was a great success! We hope to carry some of that momentum into 2014. We are fans of the UK’s 10-year GetStats campaign—they are in the third year, and it seems to be going really well. JMP had a very good year as well, with worldwide double-digit growth again. We are pleased to have launched version 11 of JMP and JMP Pro last year at our annual Discovery Summit user conference.

DS-  Any cloud computing plans for JMP?

AM- We are exploring options, but with memory and storage still so incredibly cheap on the desktop, the responsiveness of local, in-memory computing on Macs or Windows operating systems remains compelling. John Sall said it best in a blog post he wrote in December.  It is our intention to have a public cloud offering in 2014.

DS- Describe the company culture and environment in the JMP division. Any global plans?

AM- John Sall’s passion to bring interactive, intuitive data visualization and analysis on the desktop continues. There is a strong commitment in the JMP division to speeding the statistical discovery process and making it fun. It’s a powerfully motivating factor to work in an environment where that passion and purpose are shared, and where we get to interact with customers who are also passionate users of JMP, many of whom use JMP and SAS together.

While a majority of JMP personnel are in Cary, North Carolina, almost half the staff are contributing from other states and countries. JMP is sold everywhere we have SAS offices (in 59 countries). JMP has localized versions in seven languages, and we keep getting requests for more.

DS- You have been a SAS Institute veteran for 15 years now. What are some of the ups and downs you remember as milestones in the field of analytics?

AM- The most exciting milestone is that analytics has been getting more attention in the last few years, thanks to a combination of factors. Analytics is a very inclusive term (statistics, optimization, data mining, machine learning, data science, etc.), but statistics is the main discipline we draw on when we are trying to make informed decisions in the face of uncertainty. In the early days of data mining, there was a tension between statisticians and data miners/machine learners, but we now have a richer set of methods (with more solid theoretic underpinnings) with which to analyze data and make better decisions. We have better ways to automate parts of the model-building process as well, which is important with ever-wider data. In the early days of data mining, I remember many reacting with “Why spend so much time dredging through opportunistically collected data, when statistics has so much more to offer, like design of experiments?” There is still some merit to that, and maybe we will see the pendulum swing back to doing more with information-rich data.

DS- What are your top three forecasts for analytics technology in 2014?

AM- My perspective may be different than others on what’s trending in analytics technology, but as we try to do more with more data, here are my top three picks:

  • We will continue to innovate new ways to visualize data and statistical output to capitalize on our high visual bandwidth. (Examples of some of our recent innovations can be found on the JMP Blog.)

  • We will continue to see innovative ways to create more analytic bandwidth and democratize analytics—for example, more quickly build and deploy analytic applications and interactive visualizations for others to use.

  • We will see more integration with commonly used analytical tools and infrastructure to help analysts be more productive.

DS-  How do you maintain work-life balance?

AM- I enjoy what I do and the great people I work with; that is part of what motivates me each day and is added to the long list of things for which I’m grateful. Outside of work, I enjoy spending time with family, regular exercise, organic gardening and other creative pursuits.

DS-As a senior technology management person working for the past 15 years, do you think technology is a better employer for women employees than it was in the 1990s? What steps can be done to increase this?

AM- I certainly see more support for women in technology with various women-in-technology organizations and programs around the world. And I also see more encouragement for girls and young women to get more exposure to science, technology, engineering, math, and statistics and consider the career options knowledge of these areas could bring. But there is more to do. I would like to add statistics to the STEM list explicitly since many still consider statistics a branch of math and don’t appreciate that statistics is the science/language of science. (Florence Nightingale said that statistics is “the most important science in the whole world.”) This year, we will see the first Women in Statistics Conference “enticing, elevating, and empowering careers of women in statistics.” There are several organizations and programs out there advocating for women in science, engineering, statistics and math, which is great. The resources such organizations provide for networking, mentoring, career development and making role models more visible are important in raising awareness on what the impediments are and how to overcome them. We should all read Sheryl Sandberg’s re-release of Lean In for Graduates (due out in April). Thank you for asking this question!


Anne oversees product management and analytic strategy in JMP Product Marketing. She is a contributing faculty member for the International Institute of Analytics.

2013 Thank You Note

I would like to write a thank you note to  some of the people who helped make Decisionstats.com possible . We had a total of 150,644 views this year.For that, I have to thank you dear readers for putting up with me- it is now our seventh year.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
13,940 12,153 12,948 13,371 12,778  12,085  12,894  11,934  9,914  14,764  12,907  10,956  150,644

I would like to thank Chris  (of Mashape) for helping me with some of the interviews I wrote here .I did 26 interviews this year for Programmable Web and a total of 30+ articles including the interviews in 2013.

Of course- we have now reached 116 excellent interviews on Decisionstats.com alone ( see http://goo.gl/V6UsCG )I would like to thank each one of the interviewees who took precious time to fill out the questions.

Sponsors- I would like to thank Dr Eric Siegel ( individually as an author and as founder chair of www.pawcon.com ) , Nadja and Ingo (for Rapid-Miner) , Dr Jonathan ( for Datamind) , Chris M (for Statace.com ) , Gergely ( Author) and many more during all these six years who have kept us afloat and the servers warm in these days of cold reflection, including Gregory (of KDNuggets.com) and erstwhile AsterData founders.

Training Partners- I would like to thank Lovleen Bhatia ( of Edureka  for giving me the opportunity to make http://www.edureka.in/r-for-analytics which now has 1721 learners as per http://www.edureka.in/)

I would also specially say Thank you to Jigsaw Academy for giving me the opportunity to create
the first affordable and quality R course in Asia http://analyticstraining.com/2013/jigsaw-completes-training-of-300-students-on-r/

These training courses including those by Datamind and Coursera remain a formidable and affordable alternative to many others catching up in the analytics education game in India ( an issue I wrote here)

Each and Everyone of my students (past and present) and Everyone in the #rstats  and SAS-L community, including people who may have been left out.

Thank you sir, for helping me and Decisionstats.com !

Wish each one of you a very happy and Joyous Happy New Year and a great and prosperous 2014!

Karl Rexer Interview on the state of Analytics

To cap off a wonderful year, we have decided to interview Karl Rexer , founder of http://www.rexeranalytics.com/ and of the data mining survey that is considered the Industry benchmark for the state of the industry in analytics.

Ajay: Describe the history behind doing the survey , how you came up with the idea and what all players do you think survey the data mining and statistical software market apart from you

 Karl: Since the early 2000s I’ve been involved on the organizing and review committees for several data mining conferences and workshops. Early in the 2000s, in the hallways at these conferences I heard many analytic practitioners discussing and comparing their algorithms, data sources, challenges, tools, etc. Since we were already conducting online surveys for several of our clients, and my network of data miners is pretty large, I realized that we could easily do a survey of data miners, and share the results with the data mining community. I saw that the gap was there (and the interest), and we could help fill it. It was a way to give back to the data mining community, and also to raise awareness in the marketplace for my company, Rexer Analytics. So in 2007 we launched the first Data Miner Survey. In the first year, 314 data miners participated, and it’s just grown from there. In each of the last two surveys, over 1200 people participated. The interest we’ve seen in our research summary reports has also been astounding – we get thousands of requests for the summary reports each year. Overall, this just confirms what we originally thought: both inside the industry and beyond, people are hungry for information about data mining.

Are there other surveys and reviews of analytic professionals and the analytic marketplace? Sure. And there’s room for a variety of methodologies and perspectives. Forester and Gartner produce several reports that cover the analytic marketplace – they largely focus on software evaluations and IT trends. There are also surveys of CIOs and IT professionals that sometimes cover analytic topics. James Taylor (Decision Management Solutions) conducted an interesting study this year of Predictive Analytics in the Cloud. And of course, there are also the KDnuggets single-question polls that provide a pulse on people’s views of topical issues.

Ajay: Over the years- what broad trends have you seen in the survey in terms of paradigms- name your top 5 insights over these years

Karl: Well, I can’t think of a fifth one, but I’ve got four key findings and trends we’ve seen over the years we’ve been doing the Data Miner Surveys:

  1. The dramatic rise of open-source data mining tools, especially R. Since 2010, R has been the most-used data mining tool. And in 2013, 70% of data miners report using R. R is frequently used along with other tools, but we also see an increasing number of data miners selecting R as their primary tool.
  2. Data miners consistently report that regression, decision trees, and cluster analysis are the key algorithms they turn to. Each of the surveys, from 2007 through 2013, has shown this same core triad of algorithms.
  3. The challenges data miners face are also consistent: Across multiple years, the #1 challenge data miners report has been “dirty data.”. The other top challenges are “explaining data mining to others” and “difficult access to data”. In response to the 2010 survey, data miners described their best practices in overcoming these three key challenges. A summary of their ideas is available on our website here: http://www.rexeranalytics.com/Overcoming_Challenges.html. And three linked “challenge” pages contain almost 200 verbatim best practice ideas collected from survey respondents.
  4. We also see that there is excitement among analytic professionals, high job satisfaction, and room for more and better analytics. People report that the number of analytic projects is increasing, and the size of analytic teams is increasing too. But still there’s room for much wider and more sophisticated use of analytics – only a minority of data miners consider their companies to be analytically sophisticated.

 Ajay: What percentage of people are now doing analytics on the cloud, on mobile, tablet , versus desktop

Karl: In the past few years we’ve seen a doubling in the percent of people who report doing some of their analytics using cloud environments. It’s still the minority of data miners, but it’s grown from 7% in 2010 to 10% in 2011, and 19% reporting using cloud environments in 2013.

Ajay:Your survey is free. How does it help your consulting practice?

Karl: Our main motivation for doing the Data Miner Survey is to contribute to the data mining community. We don’t want to charge a fee for the summary reports, because we want to get the information into as many people’s hands as possible. And we want people to feel free to send the report on to their friends and colleagues.

However, the Data Miner Survey does also help Rexer Analytics. It helps to raise the visibility of our company. It increases the traffic and links to our website, and therefore helps our Google rankings. And it is a great conversation starter.

Ajay: Name some statistics on how popular your survey has become over time- in terms of people filling the survey , and people reading the survey

Karl: In 2007 when we launched the first Data Miner Survey, 314 data miners participated, and it’s grown nicely from there. In each of the last two surveys, over 1200 people participated. The interest we’ve seen in our research summary reports has also been growing at a dramatic rate – recently we’ve been getting thousands of requests for the summary reports each year. Additionally, we have been unveiling the highlights of the surveys with a presentation at the Fall Predictive Analytics World conferences, and it is always a popular talk.

But the most gratifying aspects about the expanded interest in our Data Miner Survey are two things:

  1. The great conversations that the Data Miner Survey has initiated. I have wonderful conversations with people by phone, email and at conferences and at colleges about the findings, the trends, and about all the great ideas people have for new and exciting ways that they want to apply analytics in their domains – everything from human resource planning to cancer research, and customer retention to fraud detection. And many people have contributed ideas for new questions or topics that we have incorporated into the survey.
  2. Seeing that people in the data mining community find the survey results useful. Many students and young people entering the field have told us the summary reports provide a great overview of the field and emerging trends. And many software vendors have told us that the survey helps them better understand the needs and preferences of hands-on data mining practitioners. I’m often surprised to see new people and places that are reading and appreciating our survey. We get emails from all corners of the globe, asking questions about the survey, or asking to share it with others. Sometime last year after receiving a question from an academic researcher in Asia, I decided to check Google Scholar to see who is citing the Data Miner Survey in their books and published papers. The list was long. And the list of online news stories, blogs and other mentions of the Data Miner Survey was even longer. I started a list of citations, with links back to the places that are citing the Data Miner Survey – you can look at the list here: http://www.rexeranalytics.com/Data_Miner_Survey_Citations.html – there are over 100 places citing our research, and the list includes 15 languages. But even more surprising was finding that someone had created a Wikipedia entry about the Data Miner Surveys. I made a couple small edits, but then I stopped. The accepted rule in the Wikipedia community is to not edit things that one has a personal interest in. However, I want to encourage any Wikipedia authors out there to go and help update https://en.wikipedia.org/wiki/Rexer%27s_Annual_Data_Miner_Survey.

 Ajay -What do you think are the top 3 insightful charts from your 2013 Report

Karl-  OK, it’s tough for me to pick only 3.  I think that you should pick the three that you think are the most insightful, and then blog about them and the reasons you think they’re important.

 But if you want me to pick 3, then here are three good ones:
— R Usage graph on page 16 
Screenshot from 2013-12-26 06:37:34
— Algorithm graph on page 36  
Screenshot from 2013-12-26 06:39:10
— The pair of graphs on page 19 that show that there’s still a lot of room for improvement
Happy new year!
(Ajay- You can see the wonderful report at http://www.rexeranalytics.com/ especially  the collection of links in the top right corner of the  home page that cite this survey)

Interview Christian Mladenov CEO StatAce Excellent and Hot #rstats StartUp

Here is an interview with Christian Mladenov, CEO of Statace , a hot startup in cloud based data science and statistical computing.39c1c29

Ajay Ohri (AO)- What is the difference between using R by StatAce and using R by RStudio on a R Studio server hosted on Amazon EC2 

Christian Mladenov (CM)- There are a few ways in which I think StatAce is better:

  • You do not need the technical skills to set up a server. You can instead start straight away at the click of a button.

  • You can save the full results for later reference. With an RStudio server you need to manually save and organize the text output and the graphics.

  • We are aiming to develop a visual interface for all the standard stuff. Then you will not need to know R at all.

  • We are developing features for collaboration, so that you can access and track changes to data, scripts and results in a team. With an RStudio server, you manage commits yourself, and Git is not suitable for large data files.

AO- How do you aim to differentiate yourself from other providers of R based software including Revolution, RStudio, Rapporter and even Oracle R Enterprise

CM- We aim to build a scalable, collaborative and easy to use environment. Pretty much everything else in the R ecosystem is lacking one, if not two of these. Most of the GUIs lack a visual way of doing the standard analyses. The ones that have it (e.g. Deducer) have a rather poor usability. Collaboration tools are hardly built in. RStudio has Git integration, but you need to set it up yourself, and you cannot really track large source data in Git.

Revolution Analytics have great technology, but you need to know R and you need to know how to maintain servers for large scale work. It is not very collaborative and can become quite expensive.

Rapporter is great for generating reports, but it is not very interactive – editing templates is a bit cumbersome if you just need to run a few commands. I think it wants to be the place to go to after you have finalized the development of the R code, so that you can share it.  Right now, I also do not see the scalability.

With Oracle R Enterprise you again need to know R. It is a targeted at large enterprises and I imagine it is quite expensive, considering it only works with Oracle’s database. For that you need an IT team. Screenshot from 2013-11-18 21:31:08

AO- How do you see the space for using R on a cloud?

CM- I think this is an area that has not received enough quality attention – there are some great efforts (e.g. ElasticR), but they are targeted at experienced R users. I see a few factors that facilitate the migration to the cloud:

  • Statisticians collaborate more and more, which means they need to have a place to share data, scripts and results.

  • The number of devices people use is increasing, and now frequently includes a tablet. Having things accessible through the web gives more freedom.

  • More and more data lives on servers. This is both because it is generated there (e.g. click streams) and because it is too big to fit on a user’s PC (e.g. raw DNA data). Using it where it already is prevents slow download/upload.

  • Centralizing data, scripts and results improves compliance (everybody knows where it is), reproducibility and reliability (it is easily backed up).

For me, having R to the cloud is a great opportunity.

AO-  What are some of the key technical challenges you currently face and are seeking to solve for R based cloud solutions

CM- Our main challenge is CPU use, since cloud servers typically have multiple slow cores and R is mostly single-threaded. We have yet to fully address that and are actively following the projects that aim to improve R’s interpreter – pqR, Renjin, Riposte, etc. One option is to move to bare metal servers, but then we will lose a lot of flexibility.

Another challenge is multi-server processing. This is also an area of progress where we have do not yet have a stable solution.

AO- What are some of the advantages and disadvantages of being a Europe based tech startup vis a vis a San Fransisco based tech startup

CM-In Eastern Europe at least, you can live quite cheaply, therefore you can better focus on the product and the customers. In the US you need to spend a lot of time courting investors.

Eastern Europe also has a lot of technical talent – it is not that difficult or expensive to hire experienced engineers.

The disadvantages are many, and I think they out-weigh the advantages:

  • Capital is scarce, especially after the seed stage. This means startups either have to focus on profit which limits their ability to execute a grander vision or they need to move to the US which wastes a lot of time and resources.

  • There is limited access to customers, partners, mentors and advisors. Most of the startup innovation happens in the US and its users prefer to deal with local companies.

  • The environment in Europe is not as supportive in terms of events, media coverage, and even social acceptance. In many countries founders are viewed with a bit of suspicion, and failure frequently means the end to one’s credibility. Screenshot from 2013-11-18 21:30:46

AO- What advice would you give to aspiring data scientists

CM-Use open-source. R, Julia, Octave and the others are seeing a level of innovation that the commercial solutions just cannot match. They are very flexible and versatile, and if you need something specific, you should learn some Python and do it yourself.

Keep things reproducible, or at some point you will get lost. This includes a version control system.

Be active in the community. While books are great, sharing and seeking advice will improve your skills much faster.

Focus more on “why” you do something and “what” you want to achieve. Only then get technical about “how” you want to do it. Use a good IDE that facilitates your work and allows you to do the simple things fast. You know, like StatAce :)

AO- Describe your career journey from Student to CEO

CM-During my bachelor studies I worked as a software developer and customer intelligence analyst. This gave me a lot of perspective on software and data.

After graduating I got a job where I coordinated processes and led projects. This is where I discovered the importance of listening to customers, planning well in advance, and having good data to base decisions on.

In my master studies, it was my statistics-heavy thesis that made me think “why is there not a place where I can easily use the power of R on a machine with a lot of RAM?” This is when the idea for StatAce was born.


About StatAce-

Bulgarian StatAce is the winner of betapitch | global, which was held in Berlin on 6 July (read more about it here). The  team, driven by the lack of software for low, student budgets, came up with the idea of building “Google docs for professional statisticians” and eventually took home the first prize of the startup competition.


Get every new post delivered to your Inbox.

Join 831 other followers

%d bloggers like this: