DECISION STATS

DecisionStats Interview Radim Řehůřek Gensim #python

As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Radim Řehůřek, CEO of Rare Consulting and creator of gensim, Python package

Decision Stats (DS)- Describe your work in the Python package gensim. How did you write it, and were the key turning points in the journey. What are some of the key design points you used for creating gensim. How is gensim useful to businesses for text mining or natural language processing ( any links to examples of usage)

Radim Řehůřek (RaRe)-Gensim was born out of frustration with existing software. We were developing a search engine for an academic library back in 2009, and wanted to include this “hot new functionality of semantic search”. All implementations I could find were either arcane FORTRAN (yes, all caps!) or insanely fragile academic code. Good luck debugging and customizing that…

I ended up redesigning these algorithms to be streamed (online), so that we could run them on large out-of-core datasets. This became gensim, as well as the core of my PhD thesis.

Looking back, focusing on data streaming and picking Python were incredibly lucky choices. Both concepts have gained a lot of momentum in the modern data science world, and gensim along with them. It’s just a happy marriage of Python’s “ease of use” and commercial “need to process large datasets quickly”.

Gensim has been applied across industries — apart from the obvious ones (media, marketing, e-commerce), there have been some imaginative uses of topic modeling in biogenetics or literary sciences. Gensim’s also being taught in several universities across the world as a machine learning tool. A few “on-record” testimonials are at its project page.

DS- Have you used other languages like R or Java other than Python. What has been your experience in using them versus Python for machine learning, text mining and data mining especially in production systems

RaRe- I only started with Python around 2007 or 2008. Before it was all Java (does anyone remember Weka?), C, C++, bash, assembly, C#… later we used Javascript, Go and god knows how many others. But just as anyone who’s been around for a while realizes, development is more about proper design and architecture, rather than a particular language choice.

Python has a lot going for it in this nascent, prototype-driven field of machine learning. People claim it’s slow, but you can whip it to run faster than optimized C, if you know what you’re doing 🙂

In my opinion, Python’s main disadvantage coincides (as is often the case) with its main advantage — the dynamic duck typing. Its suitability for production is questionable, except maybe for fast-pivoting startups. Without herculean efforts in unit testing and ad-hoc tools for static analysis, it’s easy to get lost in large codebases. By the time the solution is clearly scoped, well defined and unlikely to change (ha!) I’d consider the JVM world for production.

Examples in my PyData Italy keynote “Does Python stand a chance in today’s world of data science” covered this topic in a bit more depth.

DS- You have worked as an academic, as a freelance consultant and now a startup across multiple locations. What are some of the key challenges you faced in this journey

RaRe- I’d say the transition from an academic mindset to a commercial one was a major challenge. It’s underestimated by many fresh graduates. Tinkering with details, hacking, exciting irrelevant detours are all fine, but the consulting business is much more about a pragmatic listen-to-what-the-client-actually-needs and then get-it-done. Preferably in a straightforward, efficient manner.

There’s other stuff that comes with running a business: understanding intellectual property, legalese, cross-country and cross-continent accounting, managing employees, managing clients, marketing… It’s exciting for sure and a lot of hard, novel work, but you kind of expect that, no surprise there.

By the way I’m in the process of writing a series of articles about “the life of a data science consultant” (to appear on our site soon), following the wave of interest after my BerlinBuzzwords talk on the topic.

DS- What are your favourite algorithms in terms of how you use them

RaRe- Funnily enough, I’m a fan of simple, well-understood algorithms.

Linear classifiers are one example; linear scan in place of search is another. Compared to the academic cutting edge these are ridiculous fossils. But what you’ll often find out in real-word projects is that by the time the business problem is sufficiently well defined, implementation scoped, integrations with other systems understood and the whole pipeline working, the few percent gained by a more complex algorithm are the least of your concern.

You’ll mostly hear about startups that live on the cutting edge of AI, where deep learning makes or breaks their business model. But there are gazillions of businesses that don’t need that. Having a clearly understood, interpretable, efficient and integrated predictive model that works is a massive win, and already enough work as is. Most effort goes into business analysis in order to solve the right problem using a manageable process, not pushing the theoretical envelope of life, universe and everything.

There was a great talk on “Linear Models for Data Science” by Brad Klingenberg (of StitchFix) recently, which made a good case for simpler models.

DS- What are your views on Python leveraging multiple cores . What do you think about cloud computing. Why is creating parallel processing of algorithms so not common for other packages as well.

RaRe- Higher connectivity and larger computing clusters are the future, no doubt about it.

We’re slowly coming out of an age where every single distributed system that actually worked was something of an art piece. Always NIH-heavy, finely tuned to its particular big-data use case by necessity, while touting completely generic universality for PR reasons.

But I think we’re not far off an age where it will be truly easier to use one of these frameworks than roll your own. The current generation of general-purpose distributed systems (such as Spark) is already getting some parts right. They’re still too raw and hard to manage (debug, integrate) to be practically useful for the mainstream, but we’re getting there, it’s a wave.

What does this means for Python? Who knows, but its pragmatic no-nonsense culture has good potential for producing a useful solution too, though the current distributed ecosystems favour the JVM world heavily. In short term there’s some effort in cross-language interoperability, in the long term, evolution tends to cull dead branches and favour the uncompromising.

DS- What is the best thing you like about coding in Python? and the worst?

RaRe- I can only speak for the PyData subset of the (many) Python communities:

Pro: pragmatic mindset codified in the Zen of Python; experienced full-stack developers; duck typing; fast iteration and prototyping cycles, Python makes you think before you write (by virtue of its no-debugger REPL culture) 🙂

Con: duck typing; lack of enterprise maturity: deployment, packaging maintenance, marketing. Continuum.io are doing great work in this area to keep Python alive.

About

http://radimrehurek.com/about/

Article from PA Times: How hiring is the same as approving a mortgage

In this article http://www.predictiveanalyticsworld.com/patimes/hiring-approving-mortgages-its-the-same-thing/6715/

In this article veteran Industry expert Greta Roberts, CEO of Talent Analytics argues for a case for proactive preventive analytics rather than reactive post mortem analytics

To stay in business and be profitable, lenders need to predict which borrower candidates are a good risk before extending an offer. Once the offer has been extended, all the company can do is restructure their mortgage, coach, cajole, support, train and hopefully manage the borrower to keep them from completely defaulting.

and

Two Problems with Most “Predictive” HR Systems and Approaches

Most approaches “predict” flight risk or performance for current employees only – when it’s too late.
Most approaches don’t have highly predictive job candidate data – and never consider augmenting their candidate datasets so they can predict pre-hire.

DecisionStats Interview Scott Draves Beaker Notebook

As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Scott Draves, awesome software artist and developer at Beaker Notebook. Beaker Notebook allows you to use multiple languages together in same interface seamlessly ( like Python, R, JS , Scala)

Ajay Ohri (AO) -What inspired you to make BeakerNotebook? What are some of the design decisions you took? How does it compare to Jupyter Notebook and what do you see the product roadmap ahead for Beaker (with current limitations if any)

Scott Draves (S) – Two Sigma uses a variety of tools. Some have been developed internally over many years, some are open source like Linux, Java, R, and IPython, and some are commercial such as MATLAB and Excel. Beaker is inspired by all these systems, and many more. It’s a new synthesis on new infrastructure. The design favors ease of use and high quality. Beaker is about working automatically with one click, and also having total programmability.

Jupyter (which was called IPython when we started) is definitely one of our inspirations. If fact Beaker is compatible with it and when you run Python in Beaker, it’s talking to your existing IPython backend. Beaker uses nginx as a reverse proxy to make a collection of backends (one for each language, plus Beaker‘s core server) appear as a single application.

Our roadmap is published on the wiki: https://github.com/twosigma/beaker-notebook/wiki/Roadmap

Screenshot from 2015-12-07 10:17:18

AO- To pass objects from Python to R I need rpy2. How does Beaker simplify this process. For example if I want to use auto.arima from forecast package for a Panda Time Series how would I do it

S- Beaker‘s autotranslation is simpler because it focuses on the data. That means your R and Python code co-exist in independent cells, each in its native syntax, but they can communicate via the Beaker object that is reflected to exist in all languages. By contrast with rpy2, you access R through a Python syntax. For example instead of

robjects.StrVector([‘abc’, ‘def’])

in Beaker you can just say

c(‘abc’, ‘def’)

As for auto.arima, let me first note that by coincidence, the #1 google hit for [auto.arima] is a web site that uses a Flame as its banner, ie was made with an algorithm and I open sourced in the early 90s (see below).

But anyway, I took the example from the bottom of that page and made it work with a random Pandas data frame. Here’s the Beaker notebook: https://pub.beakernotebook.com/#/publications/56648fcc-2e8e-41a6-aa4a-1249ee39023c

One improvement in the works is replacing beaker::get(‘df’) with beaker$df.

AO- Which industries or businesses would most benefit from ability to use Python, R, JS, Scala etc in same notebook

S- Any industry that works rapidly with data in a quantitative and scientific style benefits. Autotranslation increases your options and makes experimentation and mash-ups easy. That would include traditional sciences such as genetics and physics, and also business applications in finance, data mining, and machine learning. But users come from all over. Beaker is at its heart a general purpose tool for exploring with code and data, so we believe the benefits could be widespread.

AO- How would Beaker Notebook be useful for Big Data Analytics and Data Science

S- Beaker is great for exploring data sets with code, visualization, and tables. And you can turn your research into applications, without recoding, because Beaker notebooks are repeatable, reproducible, and remixable.

AO- Describe your journey in science including earlier famous projects like Electric Sheep et al. What were the key points and things that keep you learning new technologies or pivot to new projects.

S- The long version: my journey started at Brown University developing fnord a generative GUI and language for mathematical research and teaching, especially calculus and differential geometry. Think curves and surfaces in 3D with sliders. I was also working atIRIS, then I worked for Andries Van Dam in his graphics group and for Thomas Banchoff in the Math dept. Back in 1988-1990 there was a phenomenal network of people to collaborate with and learn from. That’s when I became interested in Open Source, initially through the Emacs text editor and LISP UI, which was the first project that I ever contributed back to.

I did my PhD research at CMU SCS, CS Dept. Early on I had some fortuitous internships, one at SGI working on IRIS Explorer and the other in Tokyo at NTT-Data. It was there, on an unused supercomputer, I generated the first Flames, what became the first Open Source artwork. Later back in Pittsburgh I developed Bomb, an “interactive visual musical instrument” that got me into making projection installations and eventually VJing. I was very lucky to have Peter Lee as my teacher and adviser, he helped me find my voice and also gave me plenty of rope.

My research at CMU culminated in a thesis on Meta-programing for media processing, ie using compilers and types to build low-latency, high-bandwidth systems that are still flexible and allow dynamic experimentation. The thesis document was generated by a markup language implemented in Scheme that compiled and ran my research code, measured its performance results, and generated the graphs, and could generate LaTeX for typesetting, or HTML for the web. That was published in 1997, all open source.

I graduated and went to San Francisco, and worked at Transmeta along with Linus Torvalds on a virtual microprocessor and another startup doing internet streaming media infrastructure (now it was 1999). It was this startup/tech environment of the Bay Area, including Burning Man and the VJ scene that gave birth to the Electric Sheep. It’s been evolving ever since.

Ao- How can we further increase the supply of Data Scientists . How would people in education find Beaker Notebook Useful

S- Improving UIs like Beaker‘s makes it easier for people to get started with data science. And because Beaker has one UI for multiple languages, students can spend more time on the scientific and statistical concepts, and less time learning a new GUI for each new language. Being a web application, Beaker can also be delivered as a service (Domino Data Lab does this already), which helps deal with config/install/os problems on uncontrolled student laptops. So I hope data in education and also data in civic discourse will benefit and expand.

About

https://en.wikipedia.org/wiki/Scott_Draves and http://scottdraves.com/about.html

Scott Draves is the inventor of Fractal Flames^[1] and the leader of the distributed computing project Electric Sheep.^[2]^[3]He is currently employed by Two Sigma to develop the Beaker Notebook.

Beaker is a notebook-style development environment for working interactively with large and complex datasets. Its plugin-based architecture allows you to switch between languages

Related-

Comparing data mining algorithms in R packages and Python packages for same data(see slides 28 onwards on http://www.slideshare.net/ajayohri/python-for-r-users
Polyglots for Data Science- most viewed tweet on KDNuggets.com in October 2013 http://www.kdnuggets.com/2013/10/top-tweets-oct7-8.html
Is there anything like RStudio for Python from http://blog.dominodatalab.com/interactive-data-science/
a Docker container with Beaker and all supported languages built right in . There are experimental Electron versions of Beaker that work like native applications with regular windows and menus instead of running inside a web browser.
Beaker is supported by the Domino Data Lab which gives a hosted version of Beaker

How to use R and Python together

If you can have 31 flavours of Icecream, why can’t you have atleast two flavours for open source data science. R for the data visualization and statistical libraries, Python for machine learning and the production environment. As part of my research for my upcoming book ” Python for R users – A Data Science Approach”, here are some ways to use both Python and R

rpy2 communication channel from Python to R. rpy2 is an interface to R running embedded in a Python process. The project is mature, stable, and widely used. A lucid example of using it is given here at A Slug’s Guide to Python https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python .
conda -Jupyter – You can use R Kernel from within Jupyter/iPython . You can see here https://www.continuum.io/conda-for-r and https://www.continuum.io/blog/developer/jupyter-and-conda-r It uses the R kernel for Jupyter at http://irkernel.github.io/ . Here is a tutorial I wrote in Jupyter but in Python alone http://nbviewer.ipython.org/gist/decisionstats/c1684daaeecf62dd4bf4
Beaker Notebook – You can see Beaker from http://beakernotebook.com/ . This is a relatively new kind of software and allows you to mix Python and R within the same notebook (unlike Jupyter which allows you either a Python or a R kernel) . Here is a notebook I created https://pub.beakernotebook.com/#/publications/5657e715-bdaf-4787-99fc-a0d7f37c3e38 Beaker allows even JS, Scala and otehr languages within the same notebook so its heavily amazing as an Idea. I also note that they are silver sponsors at http://user2016.org/ through their parent company https://www.twosigma.com/

Screenshot from 2015-11-27 09:56:34

Using multiple languages in data science is clearly an idea whose time has come. Tools like Jupyter, rpy2 and Beaker can also speeden up this exciting trend. The customer should dictate the need for data science, and the need should dictate the software, the software should dictate which data scientist to choose or skill up. Right now, we choose data scientists and software first and then try and fit them to the project use case.

Have an amazing 2016 for data science from the DecisionStats team and I hope you liked us in 2015!

Python for R Users A Data Science Approach

Coming up in the new year, is my new book on enabling polyglotism in data science. It is called Python for R Users :A Data Science Approach by Wiley ( due in 2016).

It will basically expose the target reader ( a data scientist professional) to a small sub set of the Python language which is most pertinent to data science.

p4r

What the Internet does for people like me in developing countries

It gives us access to the best of knowledge, teaching, experts for free
It gives us unfettered entertainment- free music in Youtube and TV shows like Game of Thrones instead of waiting years for our government to approve it
It allows us to criticize our leaders on blog,s Facebook, Twitter without getting censored by corrupt politicians and a corrupt media- Government nexus
It allows us to keep in touch via Skype via Facebook to people far way without straining our purse
It allows us to learn a lot without paying a lot

That is just me- an urban citizen in a relatively decent economy. The benefits to underprivileged humans is even more

Superpowers

India used to be a Superpower but we declined. China was a superpower then it declined. So did Britain. So did Soviet Russia. The United States remains the aging Rocky Balboa of the superpowers, but you can see some decline in influence compared to when Clinton was President.

What do superpowers do?

They invest a lot of money in arms and defence
They earn a lot of money from trade so they can invest it in arms
They put their own interest ahead the interest of their neighbours and competitors
They pretend to go to war if you hurt a single citizen, but they themselves do not do much when thousands of their citizens are mal-treated by pollution, by exploitative working conditions, by small arms and guns, by crime, by inequality

Ultimately I think Switzerland is the only superpower. Their superpower lies in not pretending to be super at all.

During trade and now climate negotiations, the past and the present and the future superpowers collide. The needs of the many are more important than the egos of a few politicians , the brilliance of their advisers and the theatrics of a few.

Does the planet need a CEO? Probably yes, and the United Nations has failed to be a superpower or any power at all. It is just a conference holding organization.

The greatest generation that won Word War 2 in the West and defeated Colonialism in the East was succeeded by the Baby Boomer generation that just boomed and consumed. The next generation will pay the price of the past few generations. The country that has the best care of the next generation for a healthy productive workforce for both economic and defence deployment will win the race to be the Superbpower. Thats not a typo. Stop being a superpower and start being a superb power.

In the meantime, I would rather see Matt Damon colonize Mars and Rocky Balbao teach boxing to the nest generation.