Here is an interview with Eduardo Ariño de la Rubia, VP of Product & Data Scientist in Residence at Domino Data Lab Here Eduardo weighs in on issues concerning data science and his experiences.
Ajay (A) How does Domino Data Lab give a data scientist an advantage ?
Eduardo (E) – Domino Data Lab’s enterprise data science platform makes data scientists more productive and helps teams collaborate better. For individual data scientists, Domino is a feature rich platform which helps them manage the analytics environment, provides scalable compute resources to run complex and multiple tasks in parallel, and makes it easy to share and productize analytic models. For teams, the Domino platform supports substantially better collaboration by making all the work people are doing viewable and reproducible. Domino provides a central analytics hub where all work is saved and hosted. The result is faster progress for individuals, and better results from teams.
A- What languages and platforms do your currently support?
E- Domino is an open platform that runs on Mac, Windows, or Linux. We’ll run any code that can be run on a Linux system. We have first class support for R, Python, Matlab, SaS, and Julia.
E- Domino was designed from the ground up to be an enterprise collaboration and data science platform. It’s a full featured platform in use at some of the largest research organizations in the world today.
A- What is your experience of Python versus other languages in the field of data science
E- That’s the opening salvo of a religious war, and though I should know better than to involve myself, I will try to navigate it. First and foremost, I think it’s important to note that the two “most common” open source languages used by data scientists today, Python and R, have fundamentally hit feature parity in their maturity. While it’s true that for some particular algorithm, for some poorly trod use-case, one language and environment may have an edge over the other, I believe that for the average data scientist, language comes down to choice.
That being said, my personal experience is slightly more nuanced. My background is primarily computer science and as such, having spent many years about programming first and data analysis second, this has formed the way I approach a problem. I find that if I am doing the “exploratory analysis” or “feature engineering” phase of a data science project, and I am using a language which has roots in “typical programming”, often times this will make me approach the solution of the problem less like a data scientist, and more like a programmer. When I should be thinking in terms of set or vectorized operations, when I should be thinking about whether I’m violating some constraint, instead I’m building a data structure to make an operation O(n log n) so that I can use a for loop when I shouldn’t.
This isn’t an indictment of any language, not is it a statement that there’s a fundamental benefit to thinking one way or another about a problem. It is however a testament to the fact that often when challenged, people will fall back to their most familiar skill set, and begin to treat every problem as a nail to be hammered. If I had come to Python *as* a data scientist first, it is possible this nuance wouldn’t have ever surfaced, however I learned Python before pandas, scikit-learn, and the DS revolution. So those neurons are quite trained up. However, I learned R purely as an endeavor in data science, and as such I don’t find myself falling back on “programmer’s habits” when I hit a wall in R, I take a step back and usually find a way to work around it within the idiomatic approaches.
To summarize, my experience is that language wars accomplish very little, and that most of the modern data science languages are up to the task. Just beware of the mental baggage that you bring with you on the journey.
A- What do you feel about polyglots ( multiple languages ) in data science (like R, Python, Julia) and software like Beaker and Jupyter that enable multiple languages?
E- Data science is a polyglot endeavor. At the very least, you usually have some data manipulation language (such as SQL) and some language for your analysis (R or Python.) Often times you have many more languages, for the data engineering pipeline I often reach for perl (it’s still an amazing language for the transformation of text data), sometimes I have a bit of code that must run very quickly, and I reach for C or C++, etc… I think that multiple languages are a reality. Domino supports, out of the box, fundamentally every language that will run on Linux. If your feature pipeline involves some sed/awk, we understand. If you need a bit of Rcpp, we’re right there with you. If you want to output some amazing d3.js visualizations to summarize the data, we’re happy to provide the framework for you to host it on. Real world data is messy, and being a polyglot is a natural adaptation to that reality.
Domino makes data scientists more productive and facilitates collaborative, reproducible, reusable analysis. The platform runs on Premise or in the Cloud. Its customers come from a wide range of industries, including government, insurance, advanced manufacturing, and pharmaceuticals. It is backed by Zetta Venture Partners, Bloomberg Beta, and In-Q-Tel.
You can have a look at their very interesting data science platform at Domino Data Lab