Home » Analytics

Category Archives: Analytics

Using Windows Azure Machine Learning as a service with R #rstats

A Brief Tutorial I wrote by playing with the software at manage.windowsazure.com

Interview Louis Bajuk-Yorgan TIBCO Enterprise Runtime for R (TERR) #rstats

Here is an interview with Louis Bajuk-Yorgan, from TIBCO.  TIBCO which was the leading commercial vendor to S Plus, the precursor of the R language makes a commercial enterprise version of R called TIBCO Enterprise Runtime for R (TERR). Louis also presented recently at User2014 http://user2014.stat.ucla.edu/abstracts/talks/54_Bajuk-Yorgan.pdf


DecisionStats(DS)- How is TERR different from Revolution Analytics or Oracle R. How is it similar.

Louis Bajuk-Yorgan (Lou)- TERR is unique, in that it is the only commercially-developed alternative R interpreter. Unlike other vendors, who modify and extend the open source R engine, we developed TERR from the ground up, leveraging our 20+ years of experience with the closely-related S-PLUS engine.
Because of this, we were able to architect TERR to be faster, more scalable, and handle memory much more efficiently than the open source R engine. Other vendors are constrained by the limitations of the open source R engine, especially around memory management.
Another important difference is that TERR can be licensed to customers and partners for tight integration into their software, which delivers a better experience for their customers. Other vendors typically integrate loosely with open source R, keeping R at arm’s length to protect their IP from the risk of contamination by R’s GPL license. They often force customers to download, install and configure R separately, making for a much more difficult customer experience.
Finally, TIBCO provides full support for the TERR engine, giving large enterprise customers the confidence to use it in their production environments. TERR is integrated in several TIBCO products, including Spotfire and Streambase, enabling customers to take models developed in TERR and quickly integrate them into BI and real-time applications.

DS- How much of R is TERR compatible with?

Lou- We regularly test TERR with a wide variety of R packages, and extend TERR to greater R coverage  over time. We are currently compatible with ~1800 CRAN packages, as well as many bioconductor packages. The full list of compatible CRAN packages is available at the TERR Community site at tibcommunity.com.

DS- Describe Tibco Cloud Compute Grid, What are it’s applications for data science.

Lou- Tibco Cloud Compute Grid leverages the Tibco Gridserver architecture, which has been used by major Wall Street firms to run massively-parallel applications across tens of thousands of individual nodes. TIBCO CCG brings this robust platform to the cloud, enabling anyone to run massively-parallel jobs on their Amazon EC2 account. The platform is ideal for Big Computation types of jobs, such as Monte Carlo simulation and risk calculations. More information can be found at the TIBCO Cloud Marketplace at https://marketplace.cloud.tibco.com/.

DS-  What advantages does TIBCO’s rich history with the S project give it for the R project.

Lou- Our 20+ years of experience with S-PLUS gave us a unique knowledge of the commercial applications of the S/R language, deep experience with architecting, extending and maintaining a commercial S language engine, strong ties to the R community and a rich trove of algorithms we could apply on developing the TERR engine.

DS-  Describe  some benchmarks of TERR with open source of R.

Lou- While the speed of individual operations will vary, overall TERR is roughly 2-10x faster than open source R when applied to small data sets, but 10-100x  faster when applied to larger data sets. This is because TERR’s efficient memory management enables it to handle larger data more reliably, and stay more linear in performance as data sizes increase.

DS-  TERR is not open source. Why is that?

Lou-  While open sourcing TERR is an option we continue to consider, we’ve decided to intially focus our energy and time on building the best S/R language engine possible. Running a successful, vibrant open source project is a significant undertaking to do well, and if we choose to do so, we will invest accordingly.
Instead, for now we’ve decided to make a Developer Editon of TERR freely available, so that the R community at large could still benefit from our work on TERR. The Developer Editon is available at tap.tibco.com.

DS- How is TIBCO a company to work for potential data scientists.

Lou- Great! I’ve have worked in this space for nearly 18 years in large part because I get the opportunity to work with customers in many different industries (such as Life Sciences, Financial Services, Energy, Consumer Packaged Goods, etc),  who are trying to solve valuable and interesting problems.
We have an entire team of data scientists, called the Industry Analytics Group, who work on these sorts of problems for our customers, and we are always looking for more Data Scientists to join that team.

DS-  How is TIBCO giving back to the R Community globally. What are it’s plans on community.

Lou- As mentioned above, we make a free Developers Editon of TERR available. In addition, we’ve been sponsors of useR for several years, we contribute feedback to the R Core team as we develop TERR, and we often open source packages that we develop for TERR to that they can be used with open source R as well.  This has included packages ported from S-PLUS (such as sjdbc) and new packages (such as tibbrConnector).

DS- As a sixth time attendee of UseR, Describe  the evolution of R ecosystem as you have observed it.

Lou- It has been fascinating  to see how the R community has grown and evolved over the years. The useR conference at UCLA this year was the largest ever (700+ attendees), with more commercial sponsors than ever before (including enterprise heavyweights like TIBCO, Teradata and Oracle, smaller analytic vendors like RStudio, Revolution and Alteryx, and new companies like plot.ly). What really struck me, however, was the nature of the attendees. There were far more attendees from commercial companies this year, many of whom were R users. More so than in the past, there were many people who simply wanted to learn about R.
Lou Bajuk-Yorgan leads Predictive Analytics product strategy at TIBCO Spotfire, including the development of the new TIBCO Enterprise Runtime for R. With a background in Physics and Atmospheric Sciences, Lou was a Research Scientist at NASA JPL before focusing on analytics and BI software 16 years ago. An avid cyclist, runner and gamer, Lou frequently speaks and tweets (@LouBajuk) about the importance of Predictive Analytics for the most valuable business challenges.

Interview Jan de Leeuw Founder JSS

Here is an interview with one of the greatest statisticians and educator of this generation, Prof Jan de Leeuw. In this exclusive and free wheeling interview, Prof De Leeuw talks on the evolution of technology, education, statistics and generously shares nuggets of knowledge of interest to present and future statisticians.


DecisionStats(DS)- You have described UCLA Dept of Statistics as your magnum opus.Name a couple of turning points in your career which helped in this creation .

Jan de Leeuw (JDL) -From about 1980 until 1987 I was head of the Department of Data Theory at Leiden University. Our work there produced a large number of dissertations which we published using our own publishing company. I also became president of the Psychometric Society in 1987. These developments resulted in an invitation from UCLA to apply for the position of director of the interdepartmental program in social statistics, with a joint appointment in Mathematics and Psychology. I accepted the offer in 1987. The program eventually morphed into the new Department of Statistics in 1998.

DS- Describe your work with Gifi software and non linear multivariate analysis.

JDL- I started working on NLMVA and MDS in 1968, while I was a graduate student researcher in the new Department of Data Theory. Joe Kruskal and Doug Carroll invited me to spend a year at Bells Labs in Murray Hill in 1973-1974. At that time I also started working with Forrest Young and his student Yoshio Takane. This led to the sequence of “alternating least squares” papers, mainly in Psychometrika. After I returned to Leiden we set up a group of young researchers, supported by NWO (the Dutch equivalent of the NSF) and by SPSS, to develop a series of Fortran programs for NLMVA and MDS.
In 1980 the group had grown to about 10-15 people, and we gave a succesful postgraduate course on the “Gifi methods”, which eventually became the 1990 Gifi book. By the time I left Leiden most people in the group had gone on to do other things, although I continued to work in the area with some graduate students from Leiden and Los Angeles. Then around 2010 I worked with Patrick Mair, visiting scholar at UCLA, to produce the R packages smacof, anacor, homals, aspect, and isotone. Also see https://www.youtube.com/watch?v=u733Mf7jX24

DS- You have presided over almost 5 decades of changes in statistics.  Can you describe the effect of changes in computing and statistical languages over the years, and some learning from these changes

JDL- I started in 1968 with PL/I. Card decks had to be flown to Paris to be compiled and executed on the IBM/360 mainframes. Around the same time APL came up and satisfied my personal development needs, although of course APL code was difficult to communicate. It was even difficult to underatand your own code after a week. We had APL symbol balls on the Selectrix typewriters and APL symbols on the character terminals. The basic model was there — you develop in an interpreted language (APL) and then for production you use a compiled language (FORTRAN). Over the years APL was replaced by XLISP and then by R. Fortran was largely replaced by C, I never switched to C++ or Java. We discouraged our students to use SAS or SPSS or MATLAB. UCLA Statistics promoted XLISP-STAT for quite a long time, but eventually we had to give it up. See http://www.stat.ucla.edu/~deleeuw/janspubs/2005/articles/deleeuw_A_05.pdf.

(In 1998 the UCLA Department of Statistics, which had been one of the major users of Lisp-Stat, and one of the main producers of Lisp-Stat code, decided to switch to S/R. This paper discusses why this decision was made, and what the pros and the cons were. )

Of course the WWW came up in the early nineties and we used a lot of CGI and PHP to write instructional software for browsers.

Generally, there has never been an computational environment like R — so integrated with statistical practice and development, and so enormous, accessible and democratic. I must admit I personally still prefer to use R as originally intended: as a convenient wrapper around and command line interface for compiled libraries and code. But it is also great for rapid prototyping, and in that role it has changed the face of statistics.
The fact that you cannot really propose statistical computations without providing R references and in many cases R code has contributed a great deal to reproducibility and open access.

DS- Does Big Data and Cloud Computing , in the era of data deluge require a new focus on creativity in statistics or just better application in industry of statistical computing over naive models
JDL- I am not really active in Big Data and Cloud Computing, mostly because I am more of a developer than a data analyst. That is of course a luxury.
The data deluge has been there for a long time (sensors in the road surface, satellites, weather stations, air pollution monitors, EEG’s, MRI’s) but until fairly recently there were no tools, both in hardware and software, to attack these data sets. Of course big data sets have changed the face of statistics once again, because in the context of big data the emphasis on optimality and precise models becomes laughable. What I see in the area is a lot of computer science, a lot of fads, a lot of ad-hoc work, and not much of a general rational approach. That may be unavoidable.
DS- What is your biggest failure in professional life
JDL- I decided in 1963 to major in psychology, mainly because I wanted to discover big truths about myself. About a year later I discovered that psychology and philosophy do not produce big truths, and that my self was not a very interesting object of study anyway. I switched to physics for a while, and minored in math, but by that time I already had a research assistant job, was developing software, and was not interested any more in going to lectures and doing experiments. In a sense I dropped out. It worked out fairly well, but it sometimes gives rise to imposter syndrome.
DS- If you had to do it all over again, what are the things you would really look forward to doing.

JDL- I really don’t know how to answer this. A life cannot be corrected, repeated, or relived.

DS- What motivates you to start Journal of Statistical software and  push for open access.
JDL- That’s basically in the UserR! 2014 presentation. See http://gifi.stat.ucla.edu/janspubs/2014/notes/deleeuw_mullen_U_14.pdf
DS-  How can we make the departments of Statistics and departments of Computer Science work closely for better industry relevant syllabus especially in data mining, business analytics and statistical computing.
JDL- That’s hard. The cultures are very different — CS is so much more agressive and self-assured, as well as having more powerful tools and better trained students. We have tried joint appointments but they do not work very well. There are some interdisciplinary programs but generally CS dominates and provides the basic keywords such as neural nets, machine learning, data mining, cloud computing and so on. One problem is that in many universities statistics is the department that teaches the introductory statistics courses, and not much more. Statistics is forever struggling to define itself, to fight silly battles about foundations, and to try to control the centrifugal forces that do statistics outside statistics departments.
DS- What are some of the work habits that have helped you be more productive in your writing and research
JDL- Well, if one is not brilliant, one has to be a workaholic. It’s hard on the family. I decided around 1975 that my main calling was to gather and organize groups of reseachers with varying skills and interests — and not to publish as much as possible. That helped.
Jan de Leeuw (born December 19, 1945) is a Dutch statistician, and Distinguished Professor of Statistics and Founding Chair of the Department of Statistics, University of California, Los Angeles. He is known as the founding editor and editor-in-chief of the Journal of Statistical Software, as well as editor-in-chief of the Journal of Multivariate Analysis.

Interview Antonio Piccolboni Big Data Analytics RHadoop #rstats

Here is an interview with Antonio Piccolboni , a consultant on big data analytics who has most notably worked on the RHadoop project for Revolution Analytics. Here he tells us about writing better code, and the projects he has been involved with.
DecisionStats(DS)- Describe your career journey from being a computer science student to one of the principal creators for RHadoop. What motivated you, what challenges did you overcome. What were the turning points.(You have 3500+ citations. What are most of those citations regarding.)

Antonio (AP)- I completed my undergrad in CS in Italy. I liked research and industry didn’t seem so exciting back then, both because of the lack of a local industry and the Microsoft monopoly, so I entered the PhD program.
After a couple of false starts I focused on bioinformatics. I was very fortunate to get involved in an international collaboration and that paved the way for a move to the United States. I wanted to work in the US as an academic, but for a variety of reasons that didn’t work out.
Instead I briefly joined a new proteomics department in a mass spectrometry company, then a research group doing transcriptomics, also in industry, but largely grant-funded. That’s the period when I accumulated most of my citations.
After several years there, I realized that bioinformatics was not offering the opportunities I was hoping for and that I was missing out on great changes that were happening in the computer industry, in particular Hadoop, so after much deliberation I took the plunge and worked first for a web ratings company and then a social network, where I took the role of what is now called a “data scientist”, using the statistical skills that I acquired during the first part of my career. After taking a year off to work on my own idea I became a free lance and Revolution Analytics one of my clients, and I became involved in RHadoop.
As you can see there were several turning points. It seems to me one needs to seek a balance of determination and flexibility, both mental and financial, to explore different options, while trying to make the most of each experience. Also, be at least aware of what happens outside your specialty area. Finally, the mandatory statistical warning: any generalizations from a single career are questionable at best.


DS-What are the top five things you have learnt for better research productivity and code output in your distinguished career as a computer scientist.
AP-1. Keep your code short. Conciseness in code seems to correlate with a variety of desirable properties, like testability and maintainability. There are several aspects to it and I have a linkblog about this (asceticprogrammer.info). If I had said “simple”, different people would have understood different things, but when you say “short” it’s clear and actionable, albeit not universally accepted.
2. Test your code. Since proving code correct is unfeasible for the vast majority of projects, development is more like an experimental science, where you assemble programs and then corroborate that they have the desired properties via experiments. Testing can have many forms, but no testing is no option.
3. Many seem to think that programming is an abstract activity somewhere in between mathematics and machines. I think a developer’s constituency are people, be them the millions using a social network or the handful using a specialized API. So I try to understand how people interact with my work, what they try to achieve, what their background is and so forth.
4. Programming is a difficult activity, meaning that failure happens even to the best and brightest. Learning to take risk into account and mitigate it is very important.
5. Programs are dynamic artifacts. For each line of code, one may not only ask if it is correct but for how long, as assumptions shift, or how often it will be executed. For a feature, one could wonder how many will use it, and how many additional lines of code will be necessary to maintain it.
6. Bonus statistical suggestion: check the assumptions. Academic statistics has an emphasis on theorems and optimality, bemoaned already by Tukey over sixty years ago. Theorems are great guides for data analysis, but rely on assumptions being met, and, when they are not, consequences can be unpredictable. When you apply the most straightforward, run of the mill test or estimator, you are responsible for checking the assumptions, or otherwise validating the results. “It looked like a normal distribution” won’t cut it when things go wrong.


DS-Describe the RHadoop project- especially the newer plyrmr package. How was the journey to create it.
AP-Hadoop is for data and R is for statistics, to use slogans, so it’s natural to ask the question of how to combine them, and RHadoop is one possible answer.
We selected a few important components of Hadoop and provided an R API. plyrmr is an offshoot of rmr, which is an API to the mapreduce system. While rmr has enjoyed some success, we received feedback that a simplified API would enable even more people to directly access and analyze the data.Again based on feedback we decided to focus on structured data, equivalent to an R data frame. We tried to reduce the role of user-defined functions as parameters to be fed into the API, and when custom functions are needed they are simpler. Grouping and regrouping the data is fundamental to mapreduce. While in rmr the programmer has to process two data structures, one for the data itself and the other describing the grouping, plyrmr uses a very familiar SQL-like “group” function.
Finally, we added a layer of delayed evaluation that allows to perform certain optimizations automatically and encourages reuse by reducing the cost of abstraction. We found enough commonalities with the popular package plyr that we decided to use it as a model, hence the tribute in the name. This lowers the cognitive burden for a typical user.


DS-Hue is an example of making interfaces easier for users to use Hadoop. so are sandboxes and video trainings. How can we make it easier to create better interfaces to software like RHadoop et al
AP- It’s always a trade-off between power and ease of use, however I believe that the ability to express analyses in a repeatable and communicable way is fundamental to science and necessary to business and one of the key elements in the success of R. I haven’t seen a point and click GUI that satisfies these requirements yet, albeit it’s not inconceivable. For me, the most fruitful effort is still on languages and APIs. While some people write their own algorithms, the typical data analyst needs a large repertoire of algorithms that can be applied to specific problems. I see a lot of straightforward adaptations of sequential algorithms or parallel algorithms that work at smaller scales, and I think that’s the wrong direction. Extreme data sizes call for algorithms that work within stricter memory, work and communication constraints than before. On the other hand, the abundance of data, at least in some cases, offers the option of using less powerful or efficient statistics. It’s a trade off whose exploration has just started.


DS-What do you do to maintain work life balance and manage your time
AP- I think becoming a freelancer affords me a flexibility that employed work generally lacks. I can put in more or fewer hours depending on competing priorities and can move them around other needs, like being with family in the morning or going for a bike ride while it’s sunny.  I am not sure I manage my time incredibly well, but I try to keep track of where I spend it at least by broad categories, whether I am billing it to a client or not. “If you can not measure it, you can not improve it”, a quote usually attributed to Lord Kelvin.

DS- What do you think is the future of R as an enterprise and research software in terms of computing on mobile, desktop, cloud and how do you see things evolve from here

AP- One of the most interesting things that are happening right now is the development of different R interpreters. A successful language needs at least two viable implementations in my opinion. None of the alternatives is ready for prime time at the moment, but work is ongoing. Some implementations are experimental but demonstrate technological advances that can be then incorporated into the other interpreters. The main challenge is transitioning the language and the community to the world of parallel and distributed programming, which is a hardware-imposed priority. RHadoop is meant to help with that, for the largest data sets. Collaboration and publishing on the web is being addressed by many valuable tools and it looks to me the solutions exist already and it’s more a problem of adoption.  For the enterprise, there are companies offering training, consulting, licensing,  centralized deployments,  database APIs, you name it. It would be interesting to see touch interfaces applied to interactive data visualization, but while there is progress on the latter, touch on desktop is limited to a single platform and R doesn’t run on mobile, so I don’t see it as an imminent development.


Antonio Piccolboni is an  experienced data scientist (FlowingdataRadar on this emerging role) with industrial and academic backgrounds currently working as an independent consultant on big data analytics. His clients include Revolution Analytics. His other recent work is on social network analysis (hi5) and web analytics (Quantcast). You can contact him via http://piccolboni.info/about.html or his LinkedIn profile

Interview Vivian Zhang co-founder SupStat

Here is an interview with Vivian Zhang, CTO and co-founder Supstat which is an interesting startup in the R ecosystem. In this interview Vivian talks of the startup journey, helping spread R in China and New York City, and managing Meetups, conferences and training business with balance and excellence.


DecisionStats- (DS) Describe the story behind creation of SupStat Inc and the journey so far along with milestones and turning points. What is your vision for SupStat and what do you want it to accomplish and how.

Vivian Zhang(VZ) -


SupStat was born in 2012 out of the collaboration of 60+ individuals(Statistician, Computer Engineers, Mathematician,Professors, graduate students and talend Data genius)who met through a well-known non-profit organization in China, Capital of Statistics. The SupStat team met through various collaborations on R packages and analytic work. In 2013, SupStat became involved in the New York City data science community through hosting the NYC Open Data Meetup, and soon began offering formal courses through the NYC Data Science Academy. SupStat offers consulting services in the areas of R development, data visualization, and big data solutions. We are experienced with many technologies and languages including R, Python, Hadoop, Spark, Node.js, etc. Courses offered include Data Science with R (Beginner, Intermediate), Data Science with Python (Beginner, Intermediate), and Hadoop (Beginner, Intermediate), as well as many targeted courses on R packages and data visualization tools.

Allen and I, the two co-founders, have been passionate about Data Mining since a young age (we talked about it back in 1997). With industry experience as Chief Data scientist/Senior Analyst and a spirit of entrepreneurship, we started the firm by gathering all the top R/Hadoop/D3.js programmers we knew.

Milestones of SupStat:

June 2012, Established in Beijing

July 2012,  Offered R intensive Bootcamp in Beijing to 50+ college students

June 2013, Established in NYC

Nov 2013,  Launched our NYC training brand: NYC Data Science Academy

Jan 2014,  Became premium partner of Revolution Analytics in China

Feb 2014,  Became training and reseller partner of RStudio in US and China

April 2014, Became Exclusive reseller partner of Transwarp in US

                Started to offer R built-in and professional services for Hadoop/Spark

May 2014, Organized and sponsored R conference in Beijing

                NYC Open Data Meetup had 1800+ members in one year

Jun 2014, Sponsored UCLA R conference (Vivian was panelist for female R programmer talk.)

The major turning point was in November, 2013, when we decided to start our NYC office and launched the NYC Data Science Academy.

Our Mission:

We are committed to helping our clients make distinctive, lasting and substantial improvement in their performance, sales, clients and employee satisfaction by fully utilizing data. We are a value-driven firm. For us this means:

  • Solving the hardest problems

  • Utilizing state-of-the-art data science to help our clients succeed

  • Applying a structured problem-solving approach where all options are considered, researched, and analyzed carefully before recommendations are made

Our Vision: Be a firm that attracts exceptional people to meet the growing demand for data analysis and visualization.

Future goals:

With engaged clients, we want to share the excitement, unlimited potential and methodologies of using data to create business value. We want to be the go-to firm when people think of getting data analytic training, consulting, and big data products.

With top data scientists, we want to be the home for those who want different data challenges all the time. We promote their open data/demo work in the community and  expand the impact of the analytic tools and methodologies they developed. We connect the best ones to build the strongest team.

With new college students and young professionals, we want to help them succeed and be able to handle real world problems right away though our short-term, intensive training programs and internship programs. Through our rich experience, we have tailored our training program to solve some of the critical problems people face in their workplace.

Through our partnerships we want to spread the best technologies between the US and China. We want to close the gap and bring solutions and offerings to clients we serve. We are at the frontline to pick what is the best product for our clients.

We are glad we have the opportunity to do what we love and are good at, and will continue to enjoy doing it with a growing and energetic team.

DS -What is the state of open source statistical software in China? How have you contributed to R in China and how do you see the future of open source evolve there?

VZ- People love R and embrace R.  In May 2014, We helped to organize and sponsor the R conference in Beijing, with 1400 attendants. See our blog post for more details: http://www.r-bloggers.com/the-7th-china-r-conference-in-beijing/

We have helped organize two R conferences in China in the past year, Spring in Beijing and Winter in Shanghai. And we will do a Summer R conference in Guangzhou this year. That’s three R conferences in one year!

DS- Describe some of your work with your partners in helping sell and support R in China and USA

VZ- Revolution Analytics and RStudio are very respected in the R community. We are glad to work and learn from them through collaboration.

With Revolution, we provide services to do proof-of-concept and professional services including analytics and visualization. We also sell Revolution products and licenses in China. With RStudio, we sell Rstudio Server Pro and Shiny and promote training programs around those products in NYC. We plan to sell these products in China starting this summer. With Transwarp, we offer the best R analytic and paralleling experience through the Hadoop/Spark ecosystem.

DS- You have done many free workshops in multiple cities. What has been the response so far.

VZ- Let us first talk about what happened in NYC.

I went to a few meetups before I started my own meetup group. Most of the presentation/talks were awesome but they were not delivered and constructed in a way that attendants could learn and apply the technology right away. Most of the time, those events didn’t offer source code or technical details in the slides.

When I started my own group, my goal was “whatever cool stuff we showed you, we will teach you how to do it.” The majority of the events were designed as hands-on workshops while we hosted a few high profile speakers’ talks from time to time (including the chief data science scientist for the Obama Campaign).

My workshops cover a wide range of topics, including R, Python, Hadoop, D3.js, data processing, Tableau, location data query, open data, etc. People are super excited and keep saying “oh wow oh wow”, “never thought that I could do it”, ”it is pretty cool.” Soon our attendants started giving back to the group by teaching their skills and fun projects, offering free conference room, and sponsoring pizzas.

We are glad we have built a community of sharing experience and passion for data science. And I think this is a very unique thing we can do in NYC (due to the fact everything is close to half-hour subway distance). We host events 2-3 times per week and have attracted 1900 members in one year.

In other cities such as Shanghai and Beijing, we do free workshops for college students and scholars every month. We promise to go to the colleges as far as within 24 hours distance by train from Beijing.  Through partnerships with Capital of Statistics and DataUnion, we hosted entrepreneur sharing events with devoted speeches and lighting talks.

In NYC, we typically see 15 to 150 people per event. U.S. sponsors have included McKinsey & Company, Thoughtworks, and others. Our Beijing monthly tech event sees over 500 attendees and gains attraction from event co-hosts including Baiyu, Youku and others.

DS- What are some interesting projects of Supstat that you would like to showcase.

VZ- Let me start with one interesting open data project on Citibike data done by our team. The blog post, slides and meetup videos can be found at http://nycdatascience.com/meetup/nyc-open-data-project-ii-my-citibike/

Citibike provides a public bike service. There are many bike stations in NYC. People want to take a bike from a station with at least one available bike. And when they get to the destination, they want to return their bike to a station with at least one available slot. Our goal was to predict where to rent and where to return Citibikes. We showed the complete workflow including data scraping, cleaning, manipulation, processing, modeling, and making algorithms into a product.

We built a program to scrape data and save it to our database automatically. Using this data we utilized models from time series theory and machine learning to predict bike numbers in all the stations. Based on the models, we built a website for this citibike system. This application helps users of citibike arrange their trips better. We also showed a few tricks such as how to set up cron job on Linux, Windows and Mac machines, and how to get around RAM limitations on servers with PostgreSQL.

We’ve done other projects in China using R to solve problems ranging from Telecommunications data caching to Cardiovascular disease prevention. Each of these projects has required a unique combination of statistical knowledge and data science tools, with R being the backbone of the solution. The commercial cases can be found at our website: http://www.supstat.com/consulting/


SupStat is a statistical consulting company specialized in statistical computing and graphics using state-of-the-art technologies.

VIVIAN S. ZHANG Co-founder & CTO, NYC, Beijing and Shanghai Office

Vivian is a data scientist who has been devoted to the analytics industry and the development and use of data technologies for several years. She obtained expertise in data analysis and data management using various statistical analytical tools and programming languages as a Senior Analyst and Biostatistician at Memorial Sloan-Kettering Cancer Center and Scientific Programmer at Brown University. She is the co-founder SupStat, NYC Data Science Academy, NYC Open-Data meetup. She likes to portray herself as a programmer, data-holic, visualization evangelist.

You can read more about SupStat at http://www.supstat.com/team/

Latest Interview – Rapid Miner CEO Ingo Mierswa

Here is an interview I did with the CEO of Rapid Miner, Ingo Mierswa. Ingo, who is something of a prodigy and genius with multi-lingual capabilities, stellar academic and business record talks on navigating the journey for an open source startup.


Popularized by Michael (Monty) Widenius, one of the founders of MySQL and an investor in RapidMiner, business source is a commercial software license model that offers many of the benefits of open source, but with a built-in time delay on users being able to access new versions of our products.



  1. Guide to Data Science Cheat Sheets 2014/05/12
  2. Book Review: Data Just Right 2014/04/03
  3. Exclusive Interview: Richard Socher, founder of etcML, Easy Text Classification Startup 2014/03/31
  4. Trifacta – Tackling Data Wrangling with Automation and Machine Learning 2014/03/17
  5. Paxata automates Data Preparation for Big Data Analytics 2014/03/07
  6. etcML Promises to Make Text Classification Easy  2014/03/05
  7. Wolfram Breakthrough Knowledge-based Programming Language – what it means for Data Science? 2014/03/02

10 for 10 – Packt lowers cost of books for students and researchers alike

The high cost of textbooks and science books is an open scandal. Despite this publishers are barely profitable, and the ecosystem is ripe for disruption.

Packt is one such player. I have reviewed many books for them ( in return I get ebooks and books – some of which I give to my students).

Now they have an intriguing offer.

As you are aware, this month, Packt is celebrating 10 years of success with over 2000 Titles in its Library. To celebrate this huge milestone, we have come up with an exciting opportunity for collaboration which you might be interested in.

Packt is offering all of its eBooks and Videos at just $10 each. This campaign is specifically aimed towards thanking all our customers for their support and opening up our comprehensive range of titles just for $10 each. This promotion covers every title and customers can stock up on as many copies as they like until July 5th. I hope you find this as a great opportunity to explore what’s new and maintain your personal and professional development.

Interested- you can see http://www.packtpub.com/10years

Disclosure- The author was offered 2 free ebooks as part of this campaign on social media. Books is one thing he is willing to blog for ;)


Get every new post delivered to your Inbox.

Join 831 other followers