Interview Damien Farrell Python GUI DataExplore #python #rstats #pydata

Here is an interview of the Dr Damien Farrell creator of an interesting Python GUI with some data science flavors called DataExplore.  Of course R has many Data Analysis GUI like R Commander, Deducer, Rattle which we have all featured on this site before. Hopefully there can be cross pollination of ideas on GUI design for Data Science in Python/ pydata community.

A- What solution does DataExplore provide to data scientists?

D- It’s not really meant for data scientists specifically. It is targeted towards scientists and students who want to do some analysis but cannot yet code. R-studio is the closest comparison. That’s a very good tool and much more comprehensive but it still does require you know the R language. So there is a bit of a learning curve. I was looking to make something that allows you to manipulate data usefully but with minimal coding knowledge. You could see this as an intermediate between a spreadsheet and using something like R-studio or R commander. Ultimately there is no replacement for being able to write your own code but this could serve as a kind of gateway to introduced the concepts involved. It is also a good way to quickly explore and plot your data and could be seen as complimentary to other tools.
A- What were your motivations for making pandastable/DataExplore?
D- Non-computational scientists are sometimes very daunted by the prospect of data analysis. People who work as wet lab scientists in particular often do not see themselves capable of substantial analysis even though they are well able to do it. Nowadays they are presented with a lot of sometimes heterogeneous data and it is intimidating if you cannot code. Obviously advanced analysis requires programming skills that take time to learn but there is no reason that some comprehensive analysis can’t be done using the right tools. Data ‘munging’ is one skill that is not easily accessible to the non programmer and that must be frustrating. Traditionally the focus is on either using a spreadsheet which can be very limited or plotting with commercial tools like prism. More difficult tasks are passed on to the specialists. So my motivation is to provide something that bridges the data manipulation and plotting steps and allows data to be handled more confidently by a ‘non-data analyst’.
A- What got you into data science and python development. Describe your career journey so far
D- I currently work as a postdoctoral researcher in bovine and pathogen genomics though I am not a biologist. I came from outside the field from a computer science and physics background. When I got the chance to do a PhD in a research group doing structural biology I took the opportunity and stayed in biology. I only started using Python about 7 years ago and use it for nearly everything. I suppose I do what  is now called bioinformatics but the term doesn’t tell you very much in my opinion. In any case I find myself doing a lot of general data analysis.
Early on I developed end user tools in Python but they weren’t that successful since it’s so hard to create a user base in a niche area. I thought I would try something more general this time. I started using Pandas a few years ago and find it pretty indispensable now. Since the pydata stack is quite mature and has a large user community I thought using these libraries as a front-end to a desktop application would be an interesting project.
plot_samples
A-What is your roadmap or plans in future for pandastable?
D- pandastable is the name of the library because it’s a widget for Tkinter that provides a graphical view for a pandas dataframe. DataExplore is then the desktop application based around that. This is a work in progress and really a side project. Hopefully there will be some uptake and then it’s up to users to decide what they want out of it. You can only go so far in guessing what people might find useful or even easy to use. There is a plugin system which makes it easy to add arbitrary functionality if you know Python, so that could be one avenue of development. I implemented this tool in the rather old Tkinter GUI toolkit and whilst quite functional it has certain limitations. So updating to use Qt5 might be an option. Although the fashion is for web applications I think there is still plenty of scope for desktop tools.
A- How can we teach data science to more people in easier way to reduce the demand-supply gap for data scientists? 
D- A can’t speak about business, but in science teaching has certainly lagged behind the technology. I don’t know about other fields, but in molecular biology we are now producing huge amounts of data because something like sequencing has developed so rapidly. This is hard to avoid in research. Probably the concepts need to be introduced early on in undergraduate level so that PhD students don’t come to data analysis cold. In biological sciences I think postgraduate programs are slowly adapting to allow training in wet and dry lab disciplines.

 

About

Dr. Damien Farrell is Postdoctoral fellow of School of Veterinary Medicine at University College Dublin Ireland. The download page for the dataexplore app is : http://dmnfarrell.github.io/pandastable/

Related

 

How does cryptography work?

How does cryptography work?

by Jeroen Ooms

https://cran.r-project.org/web/packages/sodium/vignettes/crypto101.html

This page attempts to give a very basic conceptual introduction to cryptographic methods. Before we start the usual disclaimer:

I am not a cryptographer. This document is only for educational purposes. Crypto is hard, you should never trust your home-grown implementation. Unless you’re a cryptographer you will probably overlook some crucial details. Developers should only use the high-level functions that have been implemented by an actual cryptographer.

Now that we got this is out of the way, let’s start hacking :)

The XOR operator

The logical XOR operator outputs true only when both inputs differ (one is true, the other is false). It is sometimes called an invertor because the output of x gets inverted if and only if y is true:

# XOR two (8bit) bytes 'x' and 'y'
x <- as.raw(0x7a)
y <- as.raw(0xe4)
z <- base::xor(x, y)
dput(z)
as.raw(0x9e)
# Show the bits in each byte
cbind(x = rawToBits(x), y = rawToBits(y), z = rawToBits(z))
      x  y  z
[1,] 00 00 00
[2,] 01 00 01
[3,] 00 01 01
[4,] 01 00 01
[5,] 01 00 01
[6,] 01 01 00
[7,] 01 01 00
[8,] 00 01 01

In cryptography we xor a message x with secret random data y. Because each bit in y is randomly true with probability 0.5, the xor output is completely random and uncorrelated to x. This is called perfect secrecy. Only if we know y we can decipher the message x.

# Encrypt message using random one-time-pad
msg <- charToRaw("TTIP is evil")
one_time_pad <- random(length(msg))
ciphertext <- base::xor(msg, one_time_pad)

# It's really encrypted
rawToChar(ciphertext)
[1] "(8\xd7ȉ%\u035f\x81\xbb\023\xa2"
# Decrypt with same pad
rawToChar(base::xor(ciphertext, one_time_pad))
[1] "TTIP is evil"

This method is perfectly secure and forms the basis for most cryptograhpic methods. However the challenge is generating and communicating unique pseudo-random y data every time we want to encrypt something. One-time-pads as in the example are not very practical for large messages. Also we should never re-use a one-time-pad y for encrypting multiple messages, as this compromises the secrecy.

Stream ciphers

A stream cipher generates a unique stream of pseudo-random data based on a secret key and a unique nonce. For a given set of parameters the stream cipher always generates the same stream of data. Sodium implements a few popular stream ciphers:

password <- "My secret passphrase"
key <- hash(charToRaw(password))
nonce <- random(8)
chacha20(size = 20, key, nonce)
 [1] 51 c6 c9 45 c6 13 6b 3d 6f 5c e3 ab 9f 16 f2 46 ce cb 19 f3

Each stream requires a key and a nonce. The key forms the shared secret and should only be known to trusted parties. The nonce is not secret and is stored or sent along with the ciphertext. The purpose of the nonce is to make a random stream unique to protect gainst re-use attacks. This way you can re-use a your key to encrypt multiple messages, as long as you never re-use the same nonce.

salsa20(size = 20, key, nonce)
 [1] df 7d 13 ca ea 7c ff 93 e5 b6 fe b6 6b e2 91 14 ed ae 17 eb

Over the years cryptographers have come up with many more variants. Many stream ciphers are based on a block cipher such as AES: a keyed permutation of fixed length amount of data. The block ciphers get chained in a particular mode of operation which repeatedly applies the cipher’s single-block operation to securely transform amounts of data larger than a block.

We are not going to discuss implementation details, but you could probably come up with something yourself. For example you could use a hash function such sha256 as the block cipher and append counter which is incremented for each block (this is called CTR mode).

# Illustrative example.
sha256_ctr <- function(size, key, nonce){
  n <- ceiling(size/32)
  output <- raw()
  for(i in 1:n){
    counter <- packBits(intToBits(i))
    block <- sha256(c(key, nonce, counter))
    output <- c(output, block)
  }
  return(output[1:size])
}

This allows us to generate an arbitrary length stream from a single secret key:

password <- "My secret passphrase"
key <- hash(charToRaw(password))
nonce <- random(8)
sha256_ctr(50, key, nonce)
 [1] 07 01 96 02 7e c7 37 b4 8c b1 6a ec 4e 2d 56 34 7d 39 13 bc 72 e0 19
[24] ad b3 44 0e 9f 88 bb 3d 26 94 aa 66 01 2e bd 46 55 2c 04 99 1e af a9
[47] 91 cd 53 b4

In practice, you should never write your own ciphers. A lot of research goes into studying the properties of block ciphers under various modes of operation. In the remainder we just use the standard Sodium ciphers: chacha20, salsa20, xsalsa20 or aes128. See sodium documentation for details.

Symmetric encryption

Symmetric encryption means that the same secret key is used for both encryption and decryption. All that is needed to implement symmetric encryption is xor and a stream cipher. For example to encrypt an arbitrary length message using password:

# Encrypt 'message' using 'password'
myfile <- file.path(R.home(), "COPYING")
message <- readBin(myfile, raw(), file.info(myfile)$size)
passwd <- charToRaw("My secret passphrase")

A hash function converts the password to a key of suitable size for the stream cipher, which we use to generate a psuedo random stream of equal length to the message:

# Basic secret key encryption
key <- hash(passwd)
nonce8 <- random(8)
stream <- chacha20(length(message), key, nonce8)
ciphertext <- base::xor(stream, message)

Now the ciphertext is an encrypted version of the message. Only those that know the key and the nonce can re-generate the same keystream in order to xor the ciphertext back into the original message.

# Decrypt with the same key
key <- hash(charToRaw("My secret passphrase"))
stream <- chacha20(length(ciphertext), key, nonce8)
out <- base::xor(ciphertext, stream)

# Print part of the message
cat(substring(rawToChar(out), 1, 120))
            GNU GENERAL PUBLIC LICENSE
               Version 2, June 1991

 Copyright (C) 1989, 1991 Free Software Foundation, Inc.

The Sodium functions data_encrypt and data_decrypt provide a more elaborate implementation of the above. This is what you should use in practice for secret key encryption.

Symmetric encryption can be used for e.g. encrypting local data. However because the same secret is used for both encryption and decryption, it is impractical for communication with other parties. For exchanging secure messages we need public key encryption.

Public-key encryption and Diffie-Hellman

Rather than using a single secret-key, assymetric (public key) encryption requires a keypair, consisting of a public key for encryption and a private-key for decryption. Data that is encrypted using a given public key can only be decrypted using the corresponding private key.

The public key is not confidential and can be shared on e.g. a website or keyserver. This allows anyone to send somebody a secure message by encrypting it with the receivers public key. The encrypted message will only be readable by the owner of the corresponding private key.

# Create keypair
key <- keygen()
pub <- pubkey(key)

# Encrypt message for receiver using his/her public key
msg <- serialize(iris, NULL)
ciphertext <- simple_encrypt(msg, pub)

# Receiver decrypts with his/her private key
out <- simple_decrypt(ciphertext, key)
identical(msg, out)
[1] TRUE

How does this work? Public key encryption makes use of Diffie-Hellman (D-H): a method which allows two parties that have no prior knowledge of each other to jointly establish a shared secret key over an insecure channel. In the most simple case, both parties generate a temporary keypair and exchange their public key over the insecure channel. Then both parties use the D-H function to calculcate the (same) shared secret key by combining their own private key with the other person’s public key:

# Bob generates keypair
bob_key <- keygen()
bob_pubkey <- pubkey(bob_key)

# Alice generates keypair
alice_key <- keygen()
alice_pubkey <- pubkey(alice_key)

# After Bob and Alice exchange pubkey they can both derive the secret
alice_secret <- diffie_hellman(alice_key, bob_pubkey)
bob_secret <- diffie_hellman(bob_key, alice_pubkey)
identical(alice_secret, bob_secret)
[1] TRUE

Once the shared secret has been established, both parties can discard their temporary public/private key and use the shared secret to start encrypting communications with symmetric encryption as discussed earlier. Because the shared secret cannot be calculated using only the public keys, the process is safe from eavesdroppers.

The classical Diffie-Hellman method is based on the discrete logarithm problem with large prime numbers. Sodium uses curve25519, a state-of-the-art D-H function by Daniel Bernsteinan designed for use with the elliptic curve Diffie–Hellman (ECDH) key agreement scheme.

 

 

(Ajay- I really liked this very nice tutorial on cryptography and hope it helps bring more people in the debate. This is just to share this very excellent vignette based on the Sodium package in R)

Interview Skipper Seabold Statsmodels #python #rstats

As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Skipper Seabold, creator of statsmodels, Python package. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. Since I have been playing actively with this package, I have added some screenshots to show it is a viable way to build regression models.

sseabold-e596b56c5e7119b013e4f21c2a7db642

Ajay (A)- What prompted you to create Stats Models package?
 
Skipper (S) I was casting about for an open source project that I could take on to help further my programming skills during my graduate studies. I asked one of my professors who is involved in the Python community for advice. He urged that I look into the Google Summer of Code program under the SciPy project. One of the potential projects was resurrecting some code that used to be in scipy as scipy.stats.models. Getting involved in this project was a great way to strengthen my understanding of econometrics and statistics during my graduate studies. I raised the issue on the scipy mailing list, found a mentor in my co-lead developer Josef Perktold, and we started working in earnest on the project in 2009.
A- What has been the feedback from users so far?
 
S- Feedback has generally been pretty good. I think people now see that Python is a not only viable but also compelling alternative to R for doing statistics and econometric research as well as applied work.
A- What is your roadmap for Stats Models going forward ?
 
S- Our roadmap going forward is not much more than continuing to merge good code contributions, working through our current backlog of pull requests, and contuing to work on consistency of naming and API in the package for a better overall user experience. Each developer mainly works on their own research interests for new functionality, such as state-space modeling, survival modeling, statistical testing, high dimensional models, and models for big data.
There has been some expressed interest in developing a kind of plugin system such that community contributions are easier, a more regular release cycle, and merging some long-standing, large pull requests such as exponential smoothing and panel data models.
A-  How do you think statsmodels compares with R packages like car and others from https://cran.r-project.org/web/views/Econometrics.html . What are the advantages if any of using Python for building the model than R
 
S- You could use statsmodels for pretty much any level of applied or pure econometrics research at the moment. We have implementations of discrete choice models, generalized linear models, time-series and state-space models, generalized method of moments, generalized estimating equations, nonparametric models, and support for instrumental variables regression just to pick a few areas of overlap. We provide most of the core components that you are going to find in R. Some of these components may still be more on the experimental side or may be less polished than their R counterparts. Newer functionality could use more user feedback and API design though given that some of these R packages have seen more use, but the implementations are mostly there.
One of the main advantages I see to doing statistical modeling in Python over R are in terms of the community and the experience gained. There’s a huge diversity of backgrounds in the Python community from web developers to computer science researchers to engineers and statisticians. Those doing statistics in Python are able to benefit from this larger Python community. I often see more of a focus on unit testing, API design, and writing maintainable, readable code in Python rather than R. I would also venture to say that the Python community is a little friendlier to those new to programming in terms of the people and the language. While the former isn’t strictly true now that we have stack overflow, the R mailing lists have the reputation of being very unforgiving places. As far as the latter, things like the prevalent generic-function object-oriented style and features like non-standard evaluation are really nice for an experienced R user, but they can be a little opaque and daunting for beginners in my opinion.
That said, I don’t really see R and Python as competitors. I’m an R user and think that the R language provides a wonderful environment for doing interactive statistical computing. There are also some awesome tools like RStudio and Shiny. When it comes down to it both R and Python are most often wrappers around C, C++, and Fortran code and the interactive computing language that you use is largely a matter of personal preference.
Selection_025
Example 1 – Statsmodels in action on diamonds dataset 
 
A- How well is statsmodels integrated with Pandas, sci-kit learn and other Python Packages?
 
S- Like any scientific computing package in Python, statsmodels relies heavily on numpy and scipy to implement most of the core statistical computations.
Statsmodels integrates well with pandas. I was both an early user and contributor to the pandas project. We have had for years a system for statsmodels such that if a user supplies data structures from pandas to statsmodels, then all relevant information will be preserved and users will get back pandas data structures as results.
Statsmodels also leverages the patsy project to provide a formula framework inspired by that of S and R.
Statsmodels is also used by other projects such as seaborn to provide the number-crunching for the statistical visualizations provided.
As far as scikit-learn, though I am a heavy user of the package, so far statsmodels has not integrated well with it out of the box. We do not implement the scikit-learn API, though I have some proof of concept code that turns the statistical estimators in statsmodels into scikit-learn estimators.
We are certainly open to hearing about use cases that tighter integration would enable, but the packages often have different focuses. Scikit-learn focuses more on things like feature selection and prediction. Statsmodels is more focused on model inference and statistical tests. We are interested in continuing to explore possible integrations with the scikit-learn developers.
A- How effective is Stats Models for creating propensity models, or say logit models for financial industry or others. Which industry do you see using Pythonic statistical modeling the most.
 
S- I have used statsmodels to do propensity score matching and we have some utility code for this, but it hasn’t been a major focus for the project. Much of the driving force for statsmodels has been the research needs of the developers given our time constraints. This is an area we’d be happy to have contributions in.
All of the core, traditional classification algorithms are implemented in statsmodels with proper post-estimation results that you would expect from a statistical package.
Selection_024
Example 2 – Statsmodels in action on Boston dataset outliers
As far as particular industries, it’s not often clear where the project is being used outside of academics. Most of our core contributors are from academia, as far as I know. I think there is certainly some use of the time-series modeling capabilities in finance, and I know people are using logistic regression for classification and inference. I work as a data scientist, and I see many data scientists using the package in a variety of projects from marketing to churn modeling and forecasting. We’re always interested to hear from people in industry about how they’re using statsmodels or looking for contributions that could make the project work better for their use cases.
About-
Skipper Seabold is a data scientist at Civis Analytics.
Before joining Civis, Skipper was a software engineer and data scientist at DataPad, Inc. He is in the final stages of a PhD in economics from American University in Washington, DC . He is the creator of statsmodels package in Python.

Interview Chris Kiehl Gooey #Python making GUIs in Python

Here is an interview with Chris Kiehl, developer of Python package Gooey.  Gooey promises to turn (almost) any Python Console Program into a GUI application with one line

f54f97f6-07c5-11e5-9bcb-c3c102920769

Ajay (A) What was your motivation for making Gooey?  

Chris (C)- Gooey came about after getting frustrated with the impedance mismatch between how I like to write and interact with software as a developer, and how the rest of the world interacts with software as consumers. As much as I love my glorious command line, delivering an application that first requires me to explain what a CLI even is feels a little embarrassing. Gooey was my solution to this. It let me build as complex of a program as I wanted, all while using a familiar tool chain, and with none of the complexity that comes with traditional desktop application development. When it was time to ship, I’d attach the Gooey decorator and get the UI side for free

A- Where can Gooey can be used potentially in industry? 

C- Gooey can be used anywhere where you bump into a mismatch  in computer literacy. One of its core strengths is opening up existing CLI tool chains to users that would otherwise be put off by the unfamiliar nature of the command line. With Gooey, you can expose something as complex as video processing with FFMPEG via a very friendly UI with almost negligible development effort.

A- What other packages have you authored or contributed in Python or other languages?

C- My Github is a smorgasbord  of half-completed projects. I have several tool-chain projects related to Gooey. These range from packagers, to web front ends, to example configs. However, outside of Gooey, I created pyRobot, which is a pure Python windows automation library. Dropler, a simple html5 drag-and-drop plugin for CKEditor. DoNotStarveBackup, a Scala program that backs up your Don’t Starve save file while playing (a program which I love, but others actively hate for being “cheating” (pfft..)). And, one of my favorites: Burrito-Bot. It’s a little program that played (and won!) the game Burrito Bison. This was one of the first big things I wrote when I started programming. I keep it around for time capsule, look-at-how-I-didn’t-know-what-a-for-loop-was sentimental reasons.

A- What attracted you to developing in Python. What are some of the advantages and disadvantages of the language? 

C– I initially fell in love with Python for the same reasons everyone else does: it’s beautiful. It’s a language that’s simple enough to learn quickly, but has enough depth to be interesting after years of daily use.
Hands down, one of my favorite things about Python that gives it an edge over other languages is it’s amazing introspection. At its core, everything is a dictionary. If you poke around hard enough, you can access just about anything. This lets you do extremely interesting things with meta programming. In fact, this deep introspection of code is what allows Gooey to bootstrap itself when attached to your source file.
Python’s disadvantages vary depending on the space in which you operate. Its concurrency limitations can be extremely frustrating. Granted, you don’t run into them too often, but when you do, it is usually for show stopping reasons. The related side of that is its asynchronous capabilities. This has gotten better with Python3, but it’s still pretty clunky if you compare it to the tooling available to a language like  Scala.

A- How can we incentivize open source package creators the same we do it for app stores etc?

C- On an individual level, if I may be super positive, I’d argue that open source development is already so awesome that it almost doesn’t need to be further incentivized. People using, forking, and commiting to your project is the reward. That’s not to say it is without some pains — not everyone on the internet is friendly all the time, but the pleasure of collaborating with people all over the globe on a shared interest are tough to overstate.
Related-