As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Radim Řehůřek, CEO of Rare Consulting and creator of gensim, Python package
Decision Stats (DS)- Describe your work in the Python package gensim. How did you write it, and were the key turning points in the journey. What are some of the key design points you used for creating gensim. How is gensim useful to businesses for text mining or natural language processing ( any links to examples of usage)
Radim Řehůřek (RaRe)-Gensim was born out of frustration with existing software. We were developing a search engine for an academic library back in 2009, and wanted to include this “hot new functionality of semantic search”. All implementations I could find were either arcane FORTRAN (yes, all caps!) or insanely fragile academic code. Good luck debugging and customizing that…
I ended up redesigning these algorithms to be streamed (online), so that we could run them on large out-of-core datasets. This became gensim, as well as the core of my PhD thesis.
Looking back, focusing on data streaming and picking Python were incredibly lucky choices. Both concepts have gained a lot of momentum in the modern data science world, and gensim along with them. It’s just a happy marriage of Python’s “ease of use” and commercial “need to process large datasets quickly”.
Gensim has been applied across industries — apart from the obvious ones (media, marketing, e-commerce), there have been some imaginative uses of topic modeling in biogenetics or literary sciences. Gensim’s also being taught in several universities across the world as a machine learning tool. A few “on-record” testimonials are at its project page.
DS- Have you used other languages like R or Java other than Python. What has been your experience in using them versus Python for machine learning, text mining and data mining especially in production systems
Python has a lot going for it in this nascent, prototype-driven field of machine learning. People claim it’s slow, but you can whip it to run faster than optimized C, if you know what you’re doing 🙂
In my opinion, Python’s main disadvantage coincides (as is often the case) with its main advantage — the dynamic duck typing. Its suitability for production is questionable, except maybe for fast-pivoting startups. Without herculean efforts in unit testing and ad-hoc tools for static analysis, it’s easy to get lost in large codebases. By the time the solution is clearly scoped, well defined and unlikely to change (ha!) I’d consider the JVM world for production.
Examples in my PyData Italy keynote “Does Python stand a chance in today’s world of data science” covered this topic in a bit more depth.
DS- You have worked as an academic, as a freelance consultant and now a startup across multiple locations. What are some of the key challenges you faced in this journey
RaRe- I’d say the transition from an academic mindset to a commercial one was a major challenge. It’s underestimated by many fresh graduates. Tinkering with details, hacking, exciting irrelevant detours are all fine, but the consulting business is much more about a pragmatic listen-to-what-the-client-actually-needs and then get-it-done. Preferably in a straightforward, efficient manner.
There’s other stuff that comes with running a business: understanding intellectual property, legalese, cross-country and cross-continent accounting, managing employees, managing clients, marketing… It’s exciting for sure and a lot of hard, novel work, but you kind of expect that, no surprise there.
By the way I’m in the process of writing a series of articles about “the life of a data science consultant” (to appear on our site soon), following the wave of interest after my BerlinBuzzwords talk on the topic.
DS- What are your favourite algorithms in terms of how you use them
RaRe- Funnily enough, I’m a fan of simple, well-understood algorithms.
Linear classifiers are one example; linear scan in place of search is another. Compared to the academic cutting edge these are ridiculous fossils. But what you’ll often find out in real-word projects is that by the time the business problem is sufficiently well defined, implementation scoped, integrations with other systems understood and the whole pipeline working, the few percent gained by a more complex algorithm are the least of your concern.
You’ll mostly hear about startups that live on the cutting edge of AI, where deep learning makes or breaks their business model. But there are gazillions of businesses that don’t need that. Having a clearly understood, interpretable, efficient and integrated predictive model that works is a massive win, and already enough work as is. Most effort goes into business analysis in order to solve the right problem using a manageable process, not pushing the theoretical envelope of life, universe and everything.
There was a great talk on “Linear Models for Data Science” by Brad Klingenberg (of StitchFix) recently, which made a good case for simpler models.
DS- What are your views on Python leveraging multiple cores . What do you think about cloud computing. Why is creating parallel processing of algorithms so not common for other packages as well.
RaRe- Higher connectivity and larger computing clusters are the future, no doubt about it.
We’re slowly coming out of an age where every single distributed system that actually worked was something of an art piece. Always NIH-heavy, finely tuned to its particular big-data use case by necessity, while touting completely generic universality for PR reasons.
But I think we’re not far off an age where it will be truly easier to use one of these frameworks than roll your own. The current generation of general-purpose distributed systems (such as Spark) is already getting some parts right. They’re still too raw and hard to manage (debug, integrate) to be practically useful for the mainstream, but we’re getting there, it’s a wave.
What does this means for Python? Who knows, but its pragmatic no-nonsense culture has good potential for producing a useful solution too, though the current distributed ecosystems favour the JVM world heavily. In short term there’s some effort in cross-language interoperability, in the long term, evolution tends to cull dead branches and favour the uncompromising.
DS- What is the best thing you like about coding in Python? and the worst?
RaRe- I can only speak for the PyData subset of the (many) Python communities:
Pro: pragmatic mindset codified in the Zen of Python; experienced full-stack developers; duck typing; fast iteration and prototyping cycles, Python makes you think before you write (by virtue of its no-debugger REPL culture) 🙂
Con: duck typing; lack of enterprise maturity: deployment, packaging maintenance, marketing. Continuum.io are doing great work in this area to keep Python alive.