December 2015 – Page 2 – DECISION STATS

Interview Skipper Seabold Statsmodels #python #rstats

As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Skipper Seabold, creator of statsmodels, Python package. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. Since I have been playing actively with this package, I have added some screenshots to show it is a viable way to build regression models.

sseabold-e596b56c5e7119b013e4f21c2a7db642

Ajay (A)- What prompted you to create Stats Models package?

Skipper (S) I was casting about for an open source project that I could take on to help further my programming skills during my graduate studies. I asked one of my professors who is involved in the Python community for advice. He urged that I look into the Google Summer of Code program under the SciPy project. One of the potential projects was resurrecting some code that used to be in scipy as scipy.stats.models. Getting involved in this project was a great way to strengthen my understanding of econometrics and statistics during my graduate studies. I raised the issue on the scipy mailing list, found a mentor in my co-lead developer Josef Perktold, and we started working in earnest on the project in 2009.

A- What has been the feedback from users so far?

S- Feedback has generally been pretty good. I think people now see that Python is a not only viable but also compelling alternative to R for doing statistics and econometric research as well as applied work.

A- What is your roadmap for Stats Models going forward ?

S- Our roadmap going forward is not much more than continuing to merge good code contributions, working through our current backlog of pull requests, and contuing to work on consistency of naming and API in the package for a better overall user experience. Each developer mainly works on their own research interests for new functionality, such as state-space modeling, survival modeling, statistical testing, high dimensional models, and models for big data.

There has been some expressed interest in developing a kind of plugin system such that community contributions are easier, a more regular release cycle, and merging some long-standing, large pull requests such as exponential smoothing and panel data models.

A- How do you think statsmodels compares with R packages like car and others from https://cran.r-project.org/web/views/Econometrics.html . What are the advantages if any of using Python for building the model than R

S- You could use statsmodels for pretty much any level of applied or pure econometrics research at the moment. We have implementations of discrete choice models, generalized linear models, time-series and state-space models, generalized method of moments, generalized estimating equations, nonparametric models, and support for instrumental variables regression just to pick a few areas of overlap. We provide most of the core components that you are going to find in R. Some of these components may still be more on the experimental side or may be less polished than their R counterparts. Newer functionality could use more user feedback and API design though given that some of these R packages have seen more use, but the implementations are mostly there.

One of the main advantages I see to doing statistical modeling in Python over R are in terms of the community and the experience gained. There’s a huge diversity of backgrounds in the Python community from web developers to computer science researchers to engineers and statisticians. Those doing statistics in Python are able to benefit from this larger Python community. I often see more of a focus on unit testing, API design, and writing maintainable, readable code in Python rather than R. I would also venture to say that the Python community is a little friendlier to those new to programming in terms of the people and the language. While the former isn’t strictly true now that we have stack overflow, the R mailing lists have the reputation of being very unforgiving places. As far as the latter, things like the prevalent generic-function object-oriented style and features like non-standard evaluation are really nice for an experienced R user, but they can be a little opaque and daunting for beginners in my opinion.

That said, I don’t really see R and Python as competitors. I’m an R user and think that the R language provides a wonderful environment for doing interactive statistical computing. There are also some awesome tools like RStudio and Shiny. When it comes down to it both R and Python are most often wrappers around C, C++, and Fortran code and the interactive computing language that you use is largely a matter of personal preference.

Example 1 – Statsmodels in action on diamonds dataset

A- How well is statsmodels integrated with Pandas, sci-kit learn and other Python Packages?

S- Like any scientific computing package in Python, statsmodels relies heavily on numpy and scipy to implement most of the core statistical computations.

Statsmodels integrates well with pandas. I was both an early user and contributor to the pandas project. We have had for years a system for statsmodels such that if a user supplies data structures from pandas to statsmodels, then all relevant information will be preserved and users will get back pandas data structures as results.

Statsmodels also leverages the patsy project to provide a formula framework inspired by that of S and R.

Statsmodels is also used by other projects such as seaborn to provide the number-crunching for the statistical visualizations provided.

As far as scikit-learn, though I am a heavy user of the package, so far statsmodels has not integrated well with it out of the box. We do not implement the scikit-learn API, though I have some proof of concept code that turns the statistical estimators in statsmodels into scikit-learn estimators.

We are certainly open to hearing about use cases that tighter integration would enable, but the packages often have different focuses. Scikit-learn focuses more on things like feature selection and prediction. Statsmodels is more focused on model inference and statistical tests. We are interested in continuing to explore possible integrations with the scikit-learn developers.

A- How effective is Stats Models for creating propensity models, or say logit models for financial industry or others. Which industry do you see using Pythonic statistical modeling the most.

S- I have used statsmodels to do propensity score matching and we have some utility code for this, but it hasn’t been a major focus for the project. Much of the driving force for statsmodels has been the research needs of the developers given our time constraints. This is an area we’d be happy to have contributions in.

All of the core, traditional classification algorithms are implemented in statsmodels with proper post-estimation results that you would expect from a statistical package.

Example 2 – Statsmodels in action on Boston dataset outliers

As far as particular industries, it’s not often clear where the project is being used outside of academics. Most of our core contributors are from academia, as far as I know. I think there is certainly some use of the time-series modeling capabilities in finance, and I know people are using logistic regression for classification and inference. I work as a data scientist, and I see many data scientists using the package in a variety of projects from marketing to churn modeling and forecasting. We’re always interested to hear from people in industry about how they’re using statsmodels or looking for contributions that could make the project work better for their use cases.

About-

Skipper Seabold is a data scientist at Civis Analytics.

Before joining Civis, Skipper was a software engineer and data scientist at DataPad, Inc. He is in the final stages of a PhD in economics from American University in Washington, DC . He is the creator of statsmodels package in Python.

Interview Chris Kiehl Gooey #Python making GUIs in Python

Here is an interview with Chris Kiehl, developer of Python package Gooey. Gooey promises to turn (almost) any Python Console Program into a GUI application with one line

Ajay (A) What was your motivation for making Gooey?

Chris (C)- Gooey came about after getting frustrated with the impedance mismatch between how I like to write and interact with software as a developer, and how the rest of the world interacts with software as consumers. As much as I love my glorious command line, delivering an application that first requires me to explain what a CLI even is feels a little embarrassing. Gooey was my solution to this. It let me build as complex of a program as I wanted, all while using a familiar tool chain, and with none of the complexity that comes with traditional desktop application development. When it was time to ship, I’d attach the Gooey decorator and get the UI side for free

A- Where can Gooey can be used potentially in industry?

C- Gooey can be used anywhere where you bump into a mismatch in computer literacy. One of its core strengths is opening up existing CLI tool chains to users that would otherwise be put off by the unfamiliar nature of the command line. With Gooey, you can expose something as complex as video processing with FFMPEG via a very friendly UI with almost negligible development effort.

A- What other packages have you authored or contributed in Python or other languages?

C- My Github is a smorgasbord of half-completed projects. I have several tool-chain projects related to Gooey. These range from packagers, to web front ends, to example configs. However, outside of Gooey, I created pyRobot, which is a pure Python windows automation library. Dropler, a simple html5 drag-and-drop plugin for CKEditor. DoNotStarveBackup, a Scala program that backs up your Don’t Starve save file while playing (a program which I love, but others actively hate for being “cheating” (pfft..)). And, one of my favorites: Burrito-Bot. It’s a little program that played (and won!) the game Burrito Bison. This was one of the first big things I wrote when I started programming. I keep it around for time capsule, look-at-how-I-didn’t-know-what-a-for-loop-was sentimental reasons.

A- What attracted you to developing in Python. What are some of the advantages and disadvantages of the language?

C– I initially fell in love with Python for the same reasons everyone else does: it’s beautiful. It’s a language that’s simple enough to learn quickly, but has enough depth to be interesting after years of daily use.

Hands down, one of my favorite things about Python that gives it an edge over other languages is it’s amazing introspection. At its core, everything is a dictionary. If you poke around hard enough, you can access just about anything. This lets you do extremely interesting things with meta programming. In fact, this deep introspection of code is what allows Gooey to bootstrap itself when attached to your source file.

Python’s disadvantages vary depending on the space in which you operate. Its concurrency limitations can be extremely frustrating. Granted, you don’t run into them too often, but when you do, it is usually for show stopping reasons. The related side of that is its asynchronous capabilities. This has gotten better with Python3, but it’s still pretty clunky if you compare it to the tooling available to a language like Scala.

A- How can we incentivize open source package creators the same we do it for app stores etc?

C- On an individual level, if I may be super positive, I’d argue that open source development is already so awesome that it almost doesn’t need to be further incentivized. People using, forking, and commiting to your project is the reward. That’s not to say it is without some pains — not everyone on the internet is friendly all the time, but the pleasure of collaborating with people all over the globe on a shared interest are tough to overstate.

Adding a Simple GUI to Your Pandas Script

Install wxPython in Ubuntu

wxPython is a GUI toolkit for the Python programming language. It allows Python programmers to create programs with a robust, highly functional graphical user interface, simply and easily. It is implemented as a Python extension module (native code) that wraps the popular wxWidgets cross platform GUI library, which is written in C++.

At a terminal, enter “lsb_release -a” to print what version of Ubuntu you have.

$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.5 LTS Release: 12.04 Codename: precise


$ lsb_release -sc

precise

Add key sudo apt-key adv --fetch-keys http://repos.codelite.org/CodeLite.asc

Then add the source:

sudo gedit /etc/apt/sources.list

deb http://repos.codelite.org/wx3.0.2/ubuntu/ precise universe

sudo apt-get update

Now install wxPython

sudo apt-get install python-wxgtk2.8 python-wxtools wx2.8-i18n

sudo apt-get install python-wxgtk2.8 python-wxtools wx2.8-i18n libwxgtk2.8-dev libgtk2.0-dev

SAS launches Academy for Data Science

http://www.sas.com/en_us/learn/academy-data-science.html

SAS just launched a very nicely stacked set of two courses for its new data science program. It’s a trifle premium priced and as of now dependent only on it’s own SAS platform but the curriculum and the teaching sound very good. SAS has been around for some time, and no one ever had to worry about a job after getting trained in SAS language.

They are two six week instructor-led courses and it seems they are just tweaking details with a soft launch but it is promising for things to come. Perhaps companies like IBM and SAP et al will follow up on this initiative to CREATE more data scientists as well as UPDATE software in data science 😉

SAS^® Certified Big Data Professional

Build on your basic programming knowledge by learning to gather and analyze big data in SAS. This intensive six-week, level-one bootcamp focuses on big data management, data quality and visual data exploration for advanced analytics, and prepares you for the big data certification exams.*

Learn more & register

SAS^® Certified Data Scientist

Expand your big data certification skill set in our six-week data science bootcamp. This level-two program focuses on analytical modeling, machine learning, model deployment and automation, and critical communication skills. It also prepares you for the data science certification exams.*

Learn more & register

Screenshot from 2015-12-17 21:55:29

Code

I read a chapter from How to Win Friends and Influence People as part of my Holiday reading. It is a remarkably well written book and I am trying to summarize a few key early learnings.

Use lucid examples that people can relate to while writing a book.
Base a book based on what works or does not work in real life.
Do not criticize people (Chapter 1)

Since I criticize a lot, that is my new year resolution. To stop changing other people by criticism.

I also started re-reading from one of my favorite authors. Hemigway lived, died and wrote by a code of his own. Some learnings from him

Keep words simple and sentences short
Write a lot
Be passionate
Be honorable

Honor and self respect seems to be the underlying code for Hemingway.

To cap off , I watched this documentary Code I was really horrified how we hackers have been so busy trying to change the world we forgot to address some issues in the hacker culture

We need more ethnic diversity
We need more gender diversity
Diversity brings better creative mix and stable teams

In addition I learnt that balancing funding with creative creation is essential to survival. Well funded creative projects will be better produced than less funded. What is shown more, sells more. (Jo Dikhta hain woh bikta hain)

Well thats all the code. But yes the movie convinced me to try and lift a finger to help bring more women and African-Hispanic coders in my small way. I hope you try something like that too.

The compromises we make

What if the life you were meant to live never existed except as a figment or your own imagination? What if asking yourself rhetorical questions was the only life you were meant to live. Had I not got a pain in my neck precipitating my getting up and rubbing ointment in it, and writing this post as an exercise in insomniac purging- where would these thoughts go. What if the best ideas that humanity got – individually and in toto were flushed down the toilet everyday because we were too busy compromising for five more minutes of sleep. for five more dollars per hour. for five more years with the unhappy relationship. What if I supposed to write movie scripts that moved millions to laughs and tears instead of writing books a few hundred would read and posts for a few thousand more.

Ever think about the jobs you took for money. You compromised with your own self your own satisfaction and your own conscience. Think about the jobs you took for satisfaction turning down the money. You compromised with your brain, sense of logic the little voice in your head saying hey dumb arse, stop being so egoistic. The girl you saw at the cafe whom you felt was your divine soul but never said hello to because you were afraid to making a fool of yourself.

The compromises we make are the unhappiness we chose to live with. The comprises are the choices.

What if this was all there was to it.

Interview Domino Data Lab #datascience

Here is an interview with Eduardo Ariño de la Rubia, VP of Product & Data Scientist in Residence at Domino Data Lab Here Eduardo weighs in on issues concerning data science and his experiences.

Ajay (A) How does Domino Data Lab give a data scientist an advantage ?

Eduardo (E) – Domino Data Lab’s enterprise data science platform makes data scientists more productive and helps teams collaborate better. For individual data scientists, Domino is a feature rich platform which helps them manage the analytics environment, provides scalable compute resources to run complex and multiple tasks in parallel, and makes it easy to share and productize analytic models. For teams, the Domino platform supports substantially better collaboration by making all the work people are doing viewable and reproducible. Domino provides a central analytics hub where all work is saved and hosted. The result is faster progress for individuals, and better results from teams.

A- What languages and platforms do your currently support?

E- Domino is an open platform that runs on Mac, Windows, or Linux. We’ll run any code that can be run on a Linux system. We have first class support for R, Python, Matlab, SaS, and Julia.

A- How does Domino compare to Python Anywhere , Google Cloud Data Lab (https://cloud.google.com/datalab/) or other hosted Python solutions?

E- Domino was designed from the ground up to be an enterprise collaboration and data science platform. It’s a full featured platform in use at some of the largest research organizations in the world today.

A- What is your experience of Python versus other languages in the field of data science

E- That’s the opening salvo of a religious war, and though I should know better than to involve myself, I will try to navigate it. First and foremost, I think it’s important to note that the two “most common” open source languages used by data scientists today, Python and R, have fundamentally hit feature parity in their maturity. While it’s true that for some particular algorithm, for some poorly trod use-case, one language and environment may have an edge over the other, I believe that for the average data scientist, language comes down to choice.

That being said, my personal experience is slightly more nuanced. My background is primarily computer science and as such, having spent many years about programming first and data analysis second, this has formed the way I approach a problem. I find that if I am doing the “exploratory analysis” or “feature engineering” phase of a data science project, and I am using a language which has roots in “typical programming”, often times this will make me approach the solution of the problem less like a data scientist, and more like a programmer. When I should be thinking in terms of set or vectorized operations, when I should be thinking about whether I’m violating some constraint, instead I’m building a data structure to make an operation O(n log n) so that I can use a for loop when I shouldn’t.

This isn’t an indictment of any language, not is it a statement that there’s a fundamental benefit to thinking one way or another about a problem. It is however a testament to the fact that often when challenged, people will fall back to their most familiar skill set, and begin to treat every problem as a nail to be hammered. If I had come to Python *as* a data scientist first, it is possible this nuance wouldn’t have ever surfaced, however I learned Python before pandas, scikit-learn, and the DS revolution. So those neurons are quite trained up. However, I learned R purely as an endeavor in data science, and as such I don’t find myself falling back on “programmer’s habits” when I hit a wall in R, I take a step back and usually find a way to work around it within the idiomatic approaches.

To summarize, my experience is that language wars accomplish very little, and that most of the modern data science languages are up to the task. Just beware of the mental baggage that you bring with you on the journey.

A- What do you feel about polyglots ( multiple languages ) in data science (like R, Python, Julia) and software like Beaker and Jupyter that enable multiple languages?

E- Data science is a polyglot endeavor. At the very least, you usually have some data manipulation language (such as SQL) and some language for your analysis (R or Python.) Often times you have many more languages, for the data engineering pipeline I often reach for perl (it’s still an amazing language for the transformation of text data), sometimes I have a bit of code that must run very quickly, and I reach for C or C++, etc… I think that multiple languages are a reality. Domino supports, out of the box, fundamentally every language that will run on Linux. If your feature pipeline involves some sed/awk, we understand. If you need a bit of Rcpp, we’re right there with you. If you want to output some amazing d3.js visualizations to summarize the data, we’re happy to provide the framework for you to host it on. Real world data is messy, and being a polyglot is a natural adaptation to that reality.

About-

Eduardo Ariño de la Rubia is VP of Product & Data Scientist in Residence at Domino Data Lab

Domino makes data scientists more productive and facilitates collaborative, reproducible, reusable analysis. The platform runs on Premise or in the Cloud. Its customers come from a wide range of industries, including government, insurance, advanced manufacturing, and pharmaceuticals. It is backed by Zetta Venture Partners, Bloomberg Beta, and In-Q-Tel.