Ajay Ohri

Article from PA Times: How hiring is the same as approving a mortgage

In this article http://www.predictiveanalyticsworld.com/patimes/hiring-approving-mortgages-its-the-same-thing/6715/

In this article veteran Industry expert Greta Roberts, CEO of Talent Analytics argues for a case for proactive preventive analytics rather than reactive post mortem analytics

To stay in business and be profitable, lenders need to predict which borrower candidates are a good risk before extending an offer. Once the offer has been extended, all the company can do is restructure their mortgage, coach, cajole, support, train and hopefully manage the borrower to keep them from completely defaulting.

and

Two Problems with Most “Predictive” HR Systems and Approaches

Most approaches “predict” flight risk or performance for current employees only – when it’s too late.
Most approaches don’t have highly predictive job candidate data – and never consider augmenting their candidate datasets so they can predict pre-hire.

DecisionStats Interview Scott Draves Beaker Notebook

As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Scott Draves, awesome software artist and developer at Beaker Notebook. Beaker Notebook allows you to use multiple languages together in same interface seamlessly ( like Python, R, JS , Scala)

Ajay Ohri (AO) -What inspired you to make BeakerNotebook? What are some of the design decisions you took? How does it compare to Jupyter Notebook and what do you see the product roadmap ahead for Beaker (with current limitations if any)

Scott Draves (S) – Two Sigma uses a variety of tools. Some have been developed internally over many years, some are open source like Linux, Java, R, and IPython, and some are commercial such as MATLAB and Excel. Beaker is inspired by all these systems, and many more. It’s a new synthesis on new infrastructure. The design favors ease of use and high quality. Beaker is about working automatically with one click, and also having total programmability.

Jupyter (which was called IPython when we started) is definitely one of our inspirations. If fact Beaker is compatible with it and when you run Python in Beaker, it’s talking to your existing IPython backend. Beaker uses nginx as a reverse proxy to make a collection of backends (one for each language, plus Beaker‘s core server) appear as a single application.

Our roadmap is published on the wiki: https://github.com/twosigma/beaker-notebook/wiki/Roadmap

Screenshot from 2015-12-07 10:17:18

AO- To pass objects from Python to R I need rpy2. How does Beaker simplify this process. For example if I want to use auto.arima from forecast package for a Panda Time Series how would I do it

S- Beaker‘s autotranslation is simpler because it focuses on the data. That means your R and Python code co-exist in independent cells, each in its native syntax, but they can communicate via the Beaker object that is reflected to exist in all languages. By contrast with rpy2, you access R through a Python syntax. For example instead of

robjects.StrVector([‘abc’, ‘def’])

in Beaker you can just say

c(‘abc’, ‘def’)

As for auto.arima, let me first note that by coincidence, the #1 google hit for [auto.arima] is a web site that uses a Flame as its banner, ie was made with an algorithm and I open sourced in the early 90s (see below).

But anyway, I took the example from the bottom of that page and made it work with a random Pandas data frame. Here’s the Beaker notebook: https://pub.beakernotebook.com/#/publications/56648fcc-2e8e-41a6-aa4a-1249ee39023c

One improvement in the works is replacing beaker::get(‘df’) with beaker$df.

AO- Which industries or businesses would most benefit from ability to use Python, R, JS, Scala etc in same notebook

S- Any industry that works rapidly with data in a quantitative and scientific style benefits. Autotranslation increases your options and makes experimentation and mash-ups easy. That would include traditional sciences such as genetics and physics, and also business applications in finance, data mining, and machine learning. But users come from all over. Beaker is at its heart a general purpose tool for exploring with code and data, so we believe the benefits could be widespread.

AO- How would Beaker Notebook be useful for Big Data Analytics and Data Science

S- Beaker is great for exploring data sets with code, visualization, and tables. And you can turn your research into applications, without recoding, because Beaker notebooks are repeatable, reproducible, and remixable.

AO- Describe your journey in science including earlier famous projects like Electric Sheep et al. What were the key points and things that keep you learning new technologies or pivot to new projects.

S- The long version: my journey started at Brown University developing fnord a generative GUI and language for mathematical research and teaching, especially calculus and differential geometry. Think curves and surfaces in 3D with sliders. I was also working atIRIS, then I worked for Andries Van Dam in his graphics group and for Thomas Banchoff in the Math dept. Back in 1988-1990 there was a phenomenal network of people to collaborate with and learn from. That’s when I became interested in Open Source, initially through the Emacs text editor and LISP UI, which was the first project that I ever contributed back to.

I did my PhD research at CMU SCS, CS Dept. Early on I had some fortuitous internships, one at SGI working on IRIS Explorer and the other in Tokyo at NTT-Data. It was there, on an unused supercomputer, I generated the first Flames, what became the first Open Source artwork. Later back in Pittsburgh I developed Bomb, an “interactive visual musical instrument” that got me into making projection installations and eventually VJing. I was very lucky to have Peter Lee as my teacher and adviser, he helped me find my voice and also gave me plenty of rope.

My research at CMU culminated in a thesis on Meta-programing for media processing, ie using compilers and types to build low-latency, high-bandwidth systems that are still flexible and allow dynamic experimentation. The thesis document was generated by a markup language implemented in Scheme that compiled and ran my research code, measured its performance results, and generated the graphs, and could generate LaTeX for typesetting, or HTML for the web. That was published in 1997, all open source.

I graduated and went to San Francisco, and worked at Transmeta along with Linus Torvalds on a virtual microprocessor and another startup doing internet streaming media infrastructure (now it was 1999). It was this startup/tech environment of the Bay Area, including Burning Man and the VJ scene that gave birth to the Electric Sheep. It’s been evolving ever since.

Ao- How can we further increase the supply of Data Scientists . How would people in education find Beaker Notebook Useful

S- Improving UIs like Beaker‘s makes it easier for people to get started with data science. And because Beaker has one UI for multiple languages, students can spend more time on the scientific and statistical concepts, and less time learning a new GUI for each new language. Being a web application, Beaker can also be delivered as a service (Domino Data Lab does this already), which helps deal with config/install/os problems on uncontrolled student laptops. So I hope data in education and also data in civic discourse will benefit and expand.

About

https://en.wikipedia.org/wiki/Scott_Draves and http://scottdraves.com/about.html

Scott Draves is the inventor of Fractal Flames^[1] and the leader of the distributed computing project Electric Sheep.^[2]^[3]He is currently employed by Two Sigma to develop the Beaker Notebook.

Beaker is a notebook-style development environment for working interactively with large and complex datasets. Its plugin-based architecture allows you to switch between languages

Related-

Comparing data mining algorithms in R packages and Python packages for same data(see slides 28 onwards on http://www.slideshare.net/ajayohri/python-for-r-users
Polyglots for Data Science- most viewed tweet on KDNuggets.com in October 2013 http://www.kdnuggets.com/2013/10/top-tweets-oct7-8.html
Is there anything like RStudio for Python from http://blog.dominodatalab.com/interactive-data-science/
a Docker container with Beaker and all supported languages built right in . There are experimental Electron versions of Beaker that work like native applications with regular windows and menus instead of running inside a web browser.
Beaker is supported by the Domino Data Lab which gives a hosted version of Beaker

How to use R and Python together

If you can have 31 flavours of Icecream, why can’t you have atleast two flavours for open source data science. R for the data visualization and statistical libraries, Python for machine learning and the production environment. As part of my research for my upcoming book ” Python for R users – A Data Science Approach”, here are some ways to use both Python and R

rpy2 communication channel from Python to R. rpy2 is an interface to R running embedded in a Python process. The project is mature, stable, and widely used. A lucid example of using it is given here at A Slug’s Guide to Python https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python .
conda -Jupyter – You can use R Kernel from within Jupyter/iPython . You can see here https://www.continuum.io/conda-for-r and https://www.continuum.io/blog/developer/jupyter-and-conda-r It uses the R kernel for Jupyter at http://irkernel.github.io/ . Here is a tutorial I wrote in Jupyter but in Python alone http://nbviewer.ipython.org/gist/decisionstats/c1684daaeecf62dd4bf4
Beaker Notebook – You can see Beaker from http://beakernotebook.com/ . This is a relatively new kind of software and allows you to mix Python and R within the same notebook (unlike Jupyter which allows you either a Python or a R kernel) . Here is a notebook I created https://pub.beakernotebook.com/#/publications/5657e715-bdaf-4787-99fc-a0d7f37c3e38 Beaker allows even JS, Scala and otehr languages within the same notebook so its heavily amazing as an Idea. I also note that they are silver sponsors at http://user2016.org/ through their parent company https://www.twosigma.com/

Screenshot from 2015-11-27 09:56:34

Using multiple languages in data science is clearly an idea whose time has come. Tools like Jupyter, rpy2 and Beaker can also speeden up this exciting trend. The customer should dictate the need for data science, and the need should dictate the software, the software should dictate which data scientist to choose or skill up. Right now, we choose data scientists and software first and then try and fit them to the project use case.

Have an amazing 2016 for data science from the DecisionStats team and I hope you liked us in 2015!

Python for R Users A Data Science Approach

Coming up in the new year, is my new book on enabling polyglotism in data science. It is called Python for R Users :A Data Science Approach by Wiley ( due in 2016).

It will basically expose the target reader ( a data scientist professional) to a small sub set of the Python language which is most pertinent to data science.

p4r

What the Internet does for people like me in developing countries

It gives us access to the best of knowledge, teaching, experts for free
It gives us unfettered entertainment- free music in Youtube and TV shows like Game of Thrones instead of waiting years for our government to approve it
It allows us to criticize our leaders on blog,s Facebook, Twitter without getting censored by corrupt politicians and a corrupt media- Government nexus
It allows us to keep in touch via Skype via Facebook to people far way without straining our purse
It allows us to learn a lot without paying a lot

That is just me- an urban citizen in a relatively decent economy. The benefits to underprivileged humans is even more

Superpowers

India used to be a Superpower but we declined. China was a superpower then it declined. So did Britain. So did Soviet Russia. The United States remains the aging Rocky Balboa of the superpowers, but you can see some decline in influence compared to when Clinton was President.

What do superpowers do?

They invest a lot of money in arms and defence
They earn a lot of money from trade so they can invest it in arms
They put their own interest ahead the interest of their neighbours and competitors
They pretend to go to war if you hurt a single citizen, but they themselves do not do much when thousands of their citizens are mal-treated by pollution, by exploitative working conditions, by small arms and guns, by crime, by inequality

Ultimately I think Switzerland is the only superpower. Their superpower lies in not pretending to be super at all.

During trade and now climate negotiations, the past and the present and the future superpowers collide. The needs of the many are more important than the egos of a few politicians , the brilliance of their advisers and the theatrics of a few.

Does the planet need a CEO? Probably yes, and the United Nations has failed to be a superpower or any power at all. It is just a conference holding organization.

The greatest generation that won Word War 2 in the West and defeated Colonialism in the East was succeeded by the Baby Boomer generation that just boomed and consumed. The next generation will pay the price of the past few generations. The country that has the best care of the next generation for a healthy productive workforce for both economic and defence deployment will win the race to be the Superbpower. Thats not a typo. Stop being a superpower and start being a superb power.

In the meantime, I would rather see Matt Damon colonize Mars and Rocky Balbao teach boxing to the nest generation.

Interview Maciej Fijalkowski PyPy

As part of my research for “Python for R Users- A Data Science Approach” (Wiley 2016), I came across PyPy (http://pypy.org/) What is PyPy?

PyPy is a fast, compliant alternative implementation of the Python language (2.7.10 and 3.2.5). It has several advantages and distinct features:

Speed: thanks to its Just-in-Time compiler, Python programs often run faster on PyPy.

Memory usage: memory-hungry Python programs (several hundreds of MBs or more) might end up taking less space than they do in CPython.

Compatibility: PyPy is highly compatible with existing python code. It supports cffi and can run popular python libraries like twisted and django.

Stackless: PyPy comes by default with support for stackless mode, providing micro-threads for massive concurrency.

Now R users might remember the debate with Renjin and pqR a few years ago. PyPy is an effort which has been around for some time and they are currently at an interesting phase.

Here is an interview with Maciej Fijalkowski of PyPy

Ajay Ohr-Why did you create PyPy to serve what need ?

PyPy– I joined pypy in 2006 or 2007, I don’t even remember, but it was about 2 years into the project existence. Shockingly enough, the very first idea was that there will be a python-in-python for educational purposes only. It later occurred to us that we can use the fact that PyPy is written in a high level language and apply various transformations to it, including just-in-time compilation. Overall it was a very roundabout way, but we came to the conclusion that this is the right way to provide a high-performance python virtual machine, after Armins experience writing Psyco, that likely only few people
remember.

Ajay Ohri- Describe the current state of PyPy especially regarding to using NumPy. Can we use it for Pandas, matplotlib,seaborn, scikit-learn, statsmodels in near future. What hinders your progress?

PyPy- We are right now in the state of flux. I’m almost inclined to say “talk to us in a few weeks/months”. I will describe the status right now as well as possible near futures. Right now, we have a custom version of numpy that supports most of the existing numpy and can be used, although it does not pass all the tests. It has a very fast array item access routines, so you can write your algorithms directly in python without looking into custom solutions. It however, does not provide a C API and so does not support anything else from the numeric stack.

We’re considering also supporting the original numpy with CPython C API, which will enable the whole numeric stack with some caveats. Currently, there are ongoing discussions and I can get back to you once this is resolved.

Our main problem is the CPython C API and the dependency of the entire numeric stack on that. It exposes a lot of CPython internals, like reference counting, the exact layout of lists and strings etc. We have a layer that provides some sort of compatibility with that, but we need more work in order to make it more robust and faster. In the case of C API the main hindrance is funding – I wrote a blog post detailing the current situation: http://lostinjit.blogspot.co.za/2015/11/python-c-api-pypy-and-road-into-future.html We would love to support the entire numeric stack and we will look into ways that make it possible.

Ajay Ohri-A faster more memory efficient Python – will it be useful for analysis of large amounts of numeric data ?

PyPy- Python owes much of it’s success to good integration with the C ecosystem. For years we’ve been told that no one needs a fast Python, because what is necessary to be fast is already in C and we can go away. That has proven to be blatantly false with projects like apache spark embedding python as a way to do computations. There are also a lot of Python programmers and it’s a bit unfair to expect from them to “write all the performance critical parts in C” or any of the other custom languages built around Python, like Cython. I personally think that there is a big place for a faster Python and we’re mostly fulfilling that role, except exactly for the case of integration with numeric libraries that is absolutely crucial for a lot of people. We need to improve that story if we were to fill in that gap completely and while predicting future is hard, we would do our best to support the numeric stack a lot better in the coming months.

Ajay Ohri- What are the day to day challenges you face while working on PyPy?

PyPy- That’s a tough question. There is no such thing in IT as “day to day challenges with technology” because if it’s really such a hindrance, you can usually automate it away. However, I don’t do only technical work these days, I deal a lot with people asking questions, looking at issues, trying to organize money for PyPy etc. This means that it’s very hard to pinpoint what a day-to-day activity is, let alone what it’s problems are.

The most repeating challenges that we face are how to make sure there is funding for chronically underfunded open source projects and how to explain our unusual architecture to newcomers. The technical issues we are heavily trying to automate away so if it’s a repeating problem, we are going to have more and more infrastructure to deal with it in a more systematic manner.

Ajay Ohri- You and your highly skilled team could probably make much more money per
hour working for companies in consulting projects, Why devote time to open source coding tools. What is the way we can get more people to donate or devote time

PyPy- It is a very interesting question, probably exceeding the scope of this interview, but I will try to give it a go anyway. I think by now it’s pretty obvious that Open Source is just a better way to make software, at least as far as infrastructure goes. I can’t think about a single proprietary language platform that’s not tied to a specific architecture. Even Microsoft and .NET are moving slowly towards Open Source, with Apple owning so much of the platform that no one has a say there.

That means that locally, yes, we could very likely make far more money working for some corporations, but globally it’s pretty clear that both our impact and the value we bring is much higher than it would be working for a corporation looking for its short term gains.

Additionally, the problems we are presented to work with are much more interesting than the ones we would likely encounter in the corporate environment. Funding Open Source is a very tricky question here and I think we need to find answers to that.

Everyone uses Open Source software, directly or indirectly and there is enough money made by companies profiting from using it to fund it. How to funnel this money is a problem that we’re trying to solve on a small scale, but would be wonderful to see the solution on a bigger scale.

Ajay Ohri- How can ensure automatic porting of algorithms from languages to Java Python R rather than manually creating packages. I mean if we can have Google Translate for Human languages, what can we do to make automatic translation of code between computer languages

PyPy- It would be very useful, but no one managed to do it well, maybe that means something. However, it’s quite easy to translate between languages naively – without taking into account best practices, more efficient ways of achieving goals etc. There is a whole discussion to be had, but I don’t think I’m going to have much insight into this.

About-

PyPy is a replacement for CPython. It is built using the RPython language that was co-developed with it. The main reason to use it instead of CPython is speed: it runs generally faster

See more here http://pypy.org/features.html