Analytics – Page 25 – DECISION STATS

Happy Holidays

chr

Interview Skipper Seabold Statsmodels #python #rstats

As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Skipper Seabold, creator of statsmodels, Python package. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. Since I have been playing actively with this package, I have added some screenshots to show it is a viable way to build regression models.

sseabold-e596b56c5e7119b013e4f21c2a7db642

Ajay (A)- What prompted you to create Stats Models package?

Skipper (S) I was casting about for an open source project that I could take on to help further my programming skills during my graduate studies. I asked one of my professors who is involved in the Python community for advice. He urged that I look into the Google Summer of Code program under the SciPy project. One of the potential projects was resurrecting some code that used to be in scipy as scipy.stats.models. Getting involved in this project was a great way to strengthen my understanding of econometrics and statistics during my graduate studies. I raised the issue on the scipy mailing list, found a mentor in my co-lead developer Josef Perktold, and we started working in earnest on the project in 2009.

A- What has been the feedback from users so far?

S- Feedback has generally been pretty good. I think people now see that Python is a not only viable but also compelling alternative to R for doing statistics and econometric research as well as applied work.

A- What is your roadmap for Stats Models going forward ?

S- Our roadmap going forward is not much more than continuing to merge good code contributions, working through our current backlog of pull requests, and contuing to work on consistency of naming and API in the package for a better overall user experience. Each developer mainly works on their own research interests for new functionality, such as state-space modeling, survival modeling, statistical testing, high dimensional models, and models for big data.

There has been some expressed interest in developing a kind of plugin system such that community contributions are easier, a more regular release cycle, and merging some long-standing, large pull requests such as exponential smoothing and panel data models.

A- How do you think statsmodels compares with R packages like car and others from https://cran.r-project.org/web/views/Econometrics.html . What are the advantages if any of using Python for building the model than R

S- You could use statsmodels for pretty much any level of applied or pure econometrics research at the moment. We have implementations of discrete choice models, generalized linear models, time-series and state-space models, generalized method of moments, generalized estimating equations, nonparametric models, and support for instrumental variables regression just to pick a few areas of overlap. We provide most of the core components that you are going to find in R. Some of these components may still be more on the experimental side or may be less polished than their R counterparts. Newer functionality could use more user feedback and API design though given that some of these R packages have seen more use, but the implementations are mostly there.

One of the main advantages I see to doing statistical modeling in Python over R are in terms of the community and the experience gained. There’s a huge diversity of backgrounds in the Python community from web developers to computer science researchers to engineers and statisticians. Those doing statistics in Python are able to benefit from this larger Python community. I often see more of a focus on unit testing, API design, and writing maintainable, readable code in Python rather than R. I would also venture to say that the Python community is a little friendlier to those new to programming in terms of the people and the language. While the former isn’t strictly true now that we have stack overflow, the R mailing lists have the reputation of being very unforgiving places. As far as the latter, things like the prevalent generic-function object-oriented style and features like non-standard evaluation are really nice for an experienced R user, but they can be a little opaque and daunting for beginners in my opinion.

That said, I don’t really see R and Python as competitors. I’m an R user and think that the R language provides a wonderful environment for doing interactive statistical computing. There are also some awesome tools like RStudio and Shiny. When it comes down to it both R and Python are most often wrappers around C, C++, and Fortran code and the interactive computing language that you use is largely a matter of personal preference.

Example 1 – Statsmodels in action on diamonds dataset

A- How well is statsmodels integrated with Pandas, sci-kit learn and other Python Packages?

S- Like any scientific computing package in Python, statsmodels relies heavily on numpy and scipy to implement most of the core statistical computations.

Statsmodels integrates well with pandas. I was both an early user and contributor to the pandas project. We have had for years a system for statsmodels such that if a user supplies data structures from pandas to statsmodels, then all relevant information will be preserved and users will get back pandas data structures as results.

Statsmodels also leverages the patsy project to provide a formula framework inspired by that of S and R.

Statsmodels is also used by other projects such as seaborn to provide the number-crunching for the statistical visualizations provided.

As far as scikit-learn, though I am a heavy user of the package, so far statsmodels has not integrated well with it out of the box. We do not implement the scikit-learn API, though I have some proof of concept code that turns the statistical estimators in statsmodels into scikit-learn estimators.

We are certainly open to hearing about use cases that tighter integration would enable, but the packages often have different focuses. Scikit-learn focuses more on things like feature selection and prediction. Statsmodels is more focused on model inference and statistical tests. We are interested in continuing to explore possible integrations with the scikit-learn developers.

A- How effective is Stats Models for creating propensity models, or say logit models for financial industry or others. Which industry do you see using Pythonic statistical modeling the most.

S- I have used statsmodels to do propensity score matching and we have some utility code for this, but it hasn’t been a major focus for the project. Much of the driving force for statsmodels has been the research needs of the developers given our time constraints. This is an area we’d be happy to have contributions in.

All of the core, traditional classification algorithms are implemented in statsmodels with proper post-estimation results that you would expect from a statistical package.

Example 2 – Statsmodels in action on Boston dataset outliers

As far as particular industries, it’s not often clear where the project is being used outside of academics. Most of our core contributors are from academia, as far as I know. I think there is certainly some use of the time-series modeling capabilities in finance, and I know people are using logistic regression for classification and inference. I work as a data scientist, and I see many data scientists using the package in a variety of projects from marketing to churn modeling and forecasting. We’re always interested to hear from people in industry about how they’re using statsmodels or looking for contributions that could make the project work better for their use cases.

About-

Skipper Seabold is a data scientist at Civis Analytics.

Before joining Civis, Skipper was a software engineer and data scientist at DataPad, Inc. He is in the final stages of a PhD in economics from American University in Washington, DC . He is the creator of statsmodels package in Python.

SAS launches Academy for Data Science

http://www.sas.com/en_us/learn/academy-data-science.html

SAS just launched a very nicely stacked set of two courses for its new data science program. It’s a trifle premium priced and as of now dependent only on it’s own SAS platform but the curriculum and the teaching sound very good. SAS has been around for some time, and no one ever had to worry about a job after getting trained in SAS language.

They are two six week instructor-led courses and it seems they are just tweaking details with a soft launch but it is promising for things to come. Perhaps companies like IBM and SAP et al will follow up on this initiative to CREATE more data scientists as well as UPDATE software in data science 😉

SAS^® Certified Big Data Professional

Build on your basic programming knowledge by learning to gather and analyze big data in SAS. This intensive six-week, level-one bootcamp focuses on big data management, data quality and visual data exploration for advanced analytics, and prepares you for the big data certification exams.*

Learn more & register

SAS^® Certified Data Scientist

Expand your big data certification skill set in our six-week data science bootcamp. This level-two program focuses on analytical modeling, machine learning, model deployment and automation, and critical communication skills. It also prepares you for the data science certification exams.*

Learn more & register

Screenshot from 2015-12-17 21:55:29

The compromises we make

What if the life you were meant to live never existed except as a figment or your own imagination? What if asking yourself rhetorical questions was the only life you were meant to live. Had I not got a pain in my neck precipitating my getting up and rubbing ointment in it, and writing this post as an exercise in insomniac purging- where would these thoughts go. What if the best ideas that humanity got – individually and in toto were flushed down the toilet everyday because we were too busy compromising for five more minutes of sleep. for five more dollars per hour. for five more years with the unhappy relationship. What if I supposed to write movie scripts that moved millions to laughs and tears instead of writing books a few hundred would read and posts for a few thousand more.

Ever think about the jobs you took for money. You compromised with your own self your own satisfaction and your own conscience. Think about the jobs you took for satisfaction turning down the money. You compromised with your brain, sense of logic the little voice in your head saying hey dumb arse, stop being so egoistic. The girl you saw at the cafe whom you felt was your divine soul but never said hello to because you were afraid to making a fool of yourself.

The compromises we make are the unhappiness we chose to live with. The comprises are the choices.

What if this was all there was to it.

Interview Noah Gift #rstats #python #erlang

Here is an interview with Noah Gift, CTO of Sqor SportsSports . He is a prolific coder in Python , R and Erlang. Since he is an expert coder in both R and Python, I decided to take his views on both. Noah is also the author of the book

Python for Unix and Linux System Administration :Efficient Problem Solving with Python ( O’Reilly 2008)

Ajay (A) -Describe your journey in science and coding. What interests you and keeps you motivated in writing code.

N- Artificial intelligence motivates me to continue to learn and write code, even after 40. In addition, functional programming and cool languages like Erlang are a pleasure to use. Finally, I enjoy problem solving, whether it comes in the form of mastering brazilian jiu jitsu, rock climbing or writing code every week. It is a game, that is enjoyable, and the fact that these types of skills take years to learn, makes it very enjoyable to make progress day by day, potentially until the day I die.

A- You have worked on machine learning in R (http://blog.revolutionanalytics.com/2014/11/partying-r-style.html ) and in Python (http://www.blog.pythonlibrary.org/2015/04/20/pydev-of-the-week-noah-gift/).

The data science community itself has debated this many times. What are your views on it? How do we decided when to use Python and when to use R, and when to use both ( or not)

N- I think R is best for Data Science and Statistics and for cutting edge machine learning, this is what I do. For python, it is very tough to beat it for writing quick scripts or turning thought into working code. I wouldn’t necessarily use either language to build scalable systems though. I think they are both prototyping or “batch” scripting systems.

A- Describe your work in Sports Analytics – what are some of the interesting things about data science in sports

N- I think rating players and teams using ELO rating is an interesting example of simple math used to make powerful conclusions. For example, this has been used effectively in MMA and basketball. Machine learning around movement, say basketball players moving on the court, is also going to be a very interesting Data Science application in sports. We will be able to tell when a player should be pulled out of the game for being tired. Finally, with wearables, we may soon be able to treat athletes the same way we treat machines. Data Science is sports is going to grow exponentially in the near future.

A- How do we get the younger students and next generation excited about coding. How do make sure people from poorer countries also learn coding. Can we teach coding on the mobile?

N- Using simple methods and simple open source languages to solve problems is a good approach. For example, to program in python or erlang or R, and especially if it is functional oriented, it is very simple to write code. The problem I see in motivating and teaching people to program is when needless complexity, like object oriented design, is thrown in. Keeping it simple and teaching people to write simple functions is the best way to go at first.

A -Describe your methodology for work-life balance. How important is health and balance important for programmers and hackers.

N-I train martial arts, specially MMA and BJJ, several times a week and train overall 6 days a week. All true hackers/programmers, should seriously consider being in peak physical condition (almost at the level of a pro athlete) because of the side benefits of: clarity of thought, confidence, pain tolerance, endurance, happiness and more. In addition, taking breaks, including vacations, just like how professional athletes take rest days are very important. How much work is done in one day, or one week, or one month is nothing compared to what someone does every day for years. The overall discipline of doing little bits of work over time is a better way to code until the day you die, like I plan to

About-

Sqor is a sports social network, that gives you the latest news and scores as well as unfiltered access to athletes.

Noah Gift is Chief Technical Officer and General Manager of Sqor. In this role, Noah is responsible for general management, product development and technical engineering. Prior to joining Sqor, Noah led Web Engineering at Linden Lab. He has B.S. in Nutritional Science from Cal Poly S.L.O, an Master’s degree in Computer Information Systems from CSULA, and an MBA from UC Davis. You can read more on him here

Related-

Some Articles on Python by Noah

Cloud business analytics: Write your own dashboard

Data science in the cloud Investment analysis with IPython and pandas

Linear optimization in Python, Part 1: Solve complex problems in the cloud with Pyomo

Linear optimization in Python, Part 2: Build a scalable architecture in the cloud

Using Python to create UNIX command line tools

Guest Blog on KDNuggets_ Using Python and R together _ 3 main approaches

I just got published on KDNuggets for a guest blog at http://www.kdnuggets.com/2015/12/using-python-r-together.html – I list down the reasons to moving to using both Python and R (not just one) and the current technology for the same. I think the R project could greatly benefit if the huge Python community came closer to using R language, and Python developers could greatly benefit from using R packages

An extract is here-

Using Python and R together: 3 main approaches

Both languages borrow from each other. Even seasoned package developers like Hadley Wickham (Rstudio) borrows from Beautiful Soup (python) to make rvest for web scraping.Yhat borrows from sqldf to make pandasql. Rather than reinvent the wheel in the other language developers can focus on innovation
The customer does not care which language the code was written, the customer cares for insights.

To read the complete article … see

http://www.kdnuggets.com/2015/12/using-python-r-together.html

Extract from Eric Siegel’s Forthcoming Book

From http://www.predictiveanalyticsworld.com/patimes/a-rogue-liberal-halting-nsa-bulk-data-collection-compromises-intelligence/6882/

Data Survelliance Image
A National Security Agency (NSA) data gathering facility is seen in Bluffdale, about 25 miles south of Salt Lake City, Utah May 18. Technology holds the power to discover terrorism suspects from data—and yet to also safeguard privacy even with bulk telephone and email data intact, the author argues.

I must disagree with my fellow liberals. The NSA bulk data shutdown scheduled for November 29 is unnecessary and significantly compromises intelligence capabilities. As recent tragic events in Paris and elsewhere turn up the contentious heat on both sides of this issue, I’m keenly aware that mine is not the usual opinion for an avid supporter of Bernie Sanders (who was my hometown mayor in Vermont).

But as a techie, a former Columbia University computer science professor, I’m compelled to break some news: Technology holds the power to discover terrorism suspects from data—and yet to also safeguard privacy even with bulk telephone and email data intact. To be specific, stockpiling data about innocent people in particular is essential for state-of-the-art science that identifies new potential suspects.

I’m not talking about scanning to find perpetrators, the well-known practice of employing vigilant computers to trigger alerts on certain behavior. The system spots a potentially nefarious phone call and notifies a heroic agent—that’s a standard occurrence in intelligence thrillers, and a common topic in casual speculation about what our government is doing. Everyone’s familiar with this concept.

Rather, bulk data takes on a much more difficult, critical problem: precisely defining the alerts in the first place. The actual “intelligence” of an intelligence organization hinges on the patterns it matches against millions of cases—it must develop adept, intricate patterns that flag new potential suspects. Deriving these patterns from data automatically, the function of predictive analytics, is where the scientific rubber hits the road. (Once they’re established, matching the patterns and triggering alerts is relatively trivial, even when applied across millions of cases—that kind of mechanical process is simple for a computer.)

I want your data It may seem paradoxical, but data about the innocent civilian can serve to identify the criminal. Although the ACLU calls it “mass, suspicionless surveillance,” this data establishes a baseline for the behavior of normal civilians. That is to say, law enforcement needs your data in order to learn from you how non-criminals behave. The more such data available, the more effectively it can do so.