December 2015 – DECISION STATS

Interview Mike Bayer SQLAlchemy #pydata #python

Here is an interview with Mike Bayer, the creator of popular Python package SQLAlchemy.

Ajay (A)-How and why did you create SQLAlchemy?

Mike (M) – SQLAlchemy was at the end of a string of various database abstraction layers I’d written over the course of my career in various languages, including Java, Perl and (badly) in C. Working for web agencies in the 90’s when there were no tools, or only very bad tools, available for these platforms, we always had to invent things. So the parts of repetition in writing a CRUD application, e.g. those aspects of querying databases and moving their data in and out of object models which we always end up automating, became apparent.

Additionally I had a very SQL-intense position in the early 2000’s at Major League Baseball where we spent lots of time writing “eager” queries and loaders, that is trying to load as much of a particular dataset in as few database round trips as possible, so the need for “eager loading” was also a core use case I learned to value. Other use cases, such as the need to deal with the database in terms of DDL, the need to deal with SQL in terms of intricate SELECT queries with deep use of database-specific features, and the need to relate database rows to in-memory objects in a way that’s agnostic of the SQL which generated those rows, were all things I learned that we have to do all the time.

These were all problems I had spent a lot of time trying and re-trying to solve over and over again so when I approached doing it in Python for SQLAlchemy, I had a lot of direction in mind already. I then read Fowler’s “Patterns of Enterprise Architecture” which gave me a lot more ideas for things I thought the ultimate SQL tool should have.

I wrote the Core first and then the ORM on top. While the first releases were within a year, it took years and years of rewriting, refactoring, learning correct Python idioms and refactoring again for each one,
collecting thousands of end-user emails and issues each of which in some small way led to incremental improvements, as well as totally breaking things for my very early users quite often in the beginning, in order to slowly build up SQLAlchemy as a deeply functional and reliable system without large gaps in capability, code or design quality.

A- What is SQl Alchemy useful for? Name some usage stats on it’s popularity.

M- It’s useful anytime you want to work with relational databases to the degree that the commands you are sending to your database can benefit from being programmatically automated. SQLAlchemy is scripting and automation for databases.

The site gets about 2K unique visitors a day and according to Pypi we have 25K downloads a day, though that is a very inaccurate number; Pypi’s stats themselves record more downloads than actually occur, and a single user might be downloading SQLAlchemy a hundred times a day for a mutli-server continuous integration environment, for example. So I really don’t have any number of users, but it’s a lot at this point for sure.

A- Describe your career journey. What other Python packages have you created?

M- The career journey was way longer and more drawn out than it is for most people I meet today, meaning I had years and years of programming time under my belt but it still took an inordinately long time for me to be “good” at it from a formal point of view, and I still have gaps in my abilities that most people I work with don’t.

I only did a few years of computer programming in college and I didn’t graduate.

Eventually I got into programming in the 90’s because it was a thing I could do better than anything else and due to the rising dot-com bubble in places like NYC it was a totally charged job scene that made it easy to build up a career and income.

But in the 90’s it was much harder to get guidance from better coders, at least for me, so while I was always very good at getting a problem solved and writing things that were more elaborate and complex than what a lot of other people did, I suffered from a lack of good mentors and my code was still very much that awful stuff that only remains inside of a corporate server and gets thrown away every few years anyway. I was obsessed with improving, though.

After I left MLB I decided to get into Python and the first thing I did was port a Perl package I liked called HTML::Mason to Python, and I called it Myghty.

It was an absolutely horrible library from a code quality point of view, because I was an undisciplined Perl programmer who had never written a real unit test.

Then I started SQLAlchemy, early versions of it were equally awful, then as I slowly learned Python while rewriting SQLA over and over I wrote an all-new Myghty-like template system called Mako, so that nobody would ever have to see Myghty again, then I published Alembic migrations and dogpile.cache.

Along with all kinds of dinky things those are the major Python libraries I’ve put out.

A- Is it better or faster to store data within a RDBMS like MySQL and then run queries to it from Python, or is it better to import data say to a Pandas like object. What is the magnitude of the difference in speed and computation?

M- That’s a really open-ended question that depends a ton on what kind of data one is working with and what kind of use cases. I only have a small amount of experience with numpy/pandas but it seems like if one is dealing with chunks of scientifically oriented numerical data that is fairly homogeneous in format, where different datasets are related to each other in a mathematical sense, the fluency you get from a tool like Pandas is probably much easier to work with than an RDBMS.

An RDBMS is going to be better if you are instead dealing with data that is more heterogeneous in format, with a larger number of datasets (e.g. tables) which are related to each other in a relational sense (e.g. row identity).

RDBMS is also the appropriate choice if you need to write or update portions of the data in a transactional way.

As far as speed and computation, that’s kind of an apples to oranges comparison. Pandas starts with the advantage that the data is all in memory, but then what does that imply for datasets that are bigger than typical memory sizes or in cases where the datasize is otherwise prohibitive to move in and out of memory quickly, not to mention relational databases can often get their whole dataset in memory too. But then Pandas can optimize for things like joins in a different way than SQL does which may or may not provide better performance for some use cases.

I don’t have much experience with Pandas performance, though I did write a tool some years ago that expresses SQLAlchemy relational operations in terms of Pandas (google for CALCHIPAN); most relational operations except for extremely simple SELECTs and a specific subset of joins did not translate very well at all.

So Pandas might be super fast for the certain set of things you need to do, but for the more general case, particularly where the data spans across a relational structure, you might have fewer bottlenecks overall with regular SQL (or maybe not).

A- What makes Python a convenient language to work with data?

M- To start with, it’s a scripting language; there’s no compile step. That’s what first brought me to it – a language with strong OO that was still scripting.

The next is that it’s an incredibly consistent and transparent / non-mysterious system with a terrific syntax; from day one I loved that imported modules were just another Python object like everything else, rather than some weird ephemeral construct hoisted in by the interpreter in some mysterious way (I’m thinking of Perl’s “use” here).

It is strongly typed; none of those “conveniences” we get from something like Perl where it decided that hey, that blank string meant zero, right?
That Python is totally open source too is something we take for granted now. I’ve worked with Matlab, which has an awful syntax, but we also had to fight all the time with license keys and license managers and being able to embed it or not and basically copy-protected commercial software implementing a programming language is not a thing that has any place in the world anymore.

I’ve not seen any language besides Python that is scripting, has very good OO as well as a little bit (but not too much) of functional paradigms mixed in, has strong typing, and a huge emphasis on readability and importantly learnability. I’ve never been that interested in learning to write genius-level cleverness in something like Haskell that nobody understands.

If you’re writing code that nobody understands, be very wary – it might be because you’re just so brilliant, or because your code totally sucks, noting that these two things often overlap heavily.

A- What are the key things that a Python package developer should keep in mind ?

M-

Please try to follow as many common conventions as possible.

Use the distutils/setuptools system, have a setup.py file.

Write your docs using Sphinx and publish them on readthedocs.

Make sure you’ve read pep8 and are following most or all of it (and if you’re not, rewrite your code ASAP to do so, don’t wait).

Make sure your code runs on Python 2.7 and Python 3.3+ without any translation steps.

Make sure you have a test suite, make sure it runs simply and quickly and is documented for other people to use, and try to get it on continuous integration somewhere.

Make sure you’re writing small tests that each test just one thing; and verify that a test actually tests the thing it targets by ensuring it fails when that feature is intentionally broken.

Maintain your project’s homepage, bugtracker, mailing list, etc. so that people know how to get to you, and try as hard as possible to be responsive and polite.

Always reply to people, even if it’s to say that you’re sorry you really can’t help them. There is a significant issue with project maintainers that simply don’t reply to emails or bug reports, or just go missing entirely and leave the whole world wondering for months / years if their critical library is something we need to start forking or not.

A- What is your opinion on in-database analytics ? How can we extend the principles and philosophy of SQLAlchemy for Big Data Databases and tools

M- I only had a vague notion what this term meant, but reading the Wikipedia page confirmed my notion was the right idea. The stored procedure vs. app-side debate is a really old one that I’ve been exposed to for a long time.

Traditionally, I’m on the app-side of this. By “traditional” I mean you’re using something like a SQL Server or Oracle with an app server. For this decision, life is much easier if you don’t put your business logic on the database side. With the tools that have been around for the last several decades, the stored procedure route is difficult to travel in, because it is resistant to now-essential techniques like that of using source control, organizing code into modules, libraries and dependencies, and using modern development paradigms such as object-oriented or functional programming.

Critically, it forces us to write much more code than when we place the business logic in the app side and emit straight SQL, because the stored procedure’s data, both incoming and outgoing, still has to be marshaled to and from our application layer, yet this is difficult to automate when dealing with a procedure that has a custom, coarse-grained form of calling signature.

Additionally, SQL abstraction tools that are used to automate the production of SQL strings don’t generally exist in the traditional stored procedure world. Without tools to automate anything, we get the worst of both worlds; we have to write all our SQL by hand on the database side using a typically arcane language like Transact-SQL or PL/SQL, *and* we have to write all the data-marshaling code totally custom to our stored procedures on the app side.

Instead, using modern tools on the app side like a SQLAlchemy we can express data moving between an object model and relational database tables in a very succinct and declarative way without losing any of our SQL fluency for those parts where it’s needed.

Non-traditionally, I think the concept of software embedded in the database could be amazing – note i don’t even want to call it “stored procedures” because already, that implies “procedural development”, which is a dev model that reached its pinnacle with Fortran.

A database like Postgresql allows Python to run within the database process itself, which means that I could probably get SQLAlchemy itself to run within Postgresql. While I don’t have any time to work on it, I do have a notion of a system where a tool like SQLAlchemy could actually run on both the database side and the app side simultaneously, to produce a Python ORM that actually invokes some portion of its logic on the server.

I would imagine this is already the kind of thing a system like Datomic or Vertica is doing, but I’ve not seen this kind of thing outside of the commercial / JVM-oriented space.

ABOUT

Mike Bayer is the creator of many open source programming libraries for the Python Programming Language, including SQLAlchemy, Alembic Migrations, Mako Templates for Python, and Dogpile Caching.

He blogs at http://techspot.zzzeek.org/

SQLAlchemy is an open source SQL toolkit and object-relational mapper (ORM) for the Python programming language released under the MIT License. It gives application developers the full power and flexibility of SQL.

Star Wars awakens

Star Wars Episode 7 woke up a lot of thing in me. Memories of a simpler time, when a good story was a good story and the CGI effects were just additional effects. When ensemble casts were good from all actors speaking even the slightest word. When you could actually feel and sigh and feel sad and happy together even though you were strangers sitting in a dark movie hall.

Of course it awakens the Force, and of course the Box Office is on Fire.

While going out to the toilet in the intermission, I almost thought of using the force to guide you know. I bet no one reads this shit.

George Lucas is thankfully given reins to JJ Abrams. JJ Abrams now has the keys to entire geek kingdorm by directing BOTH Star Trek and Star Wars. The new Jedis are good, and the new Rebels are cool. The bad guys, well, there is no Darth, just a professor Snape look alike. The new robots are good. Where the hell is Luke, well we wonder? The ending left us strangely satisfied and impatient at the same time.

Two more years to watch the next Star Wars movie. Damn. I couldnt wait to get out and use the Force to write this review and chat up my buddies.

Interview Damien Farrell Python GUI DataExplore #python #rstats #pydata

Here is an interview of the Dr Damien Farrell creator of an interesting Python GUI with some data science flavors called DataExplore. Of course R has many Data Analysis GUI like R Commander, Deducer, Rattle which we have all featured on this site before. Hopefully there can be cross pollination of ideas on GUI design for Data Science in Python/ pydata community.

A- What solution does DataExplore provide to data scientists?

D- It’s not really meant for data scientists specifically. It is targeted towards scientists and students who want to do some analysis but cannot yet code. R-studio is the closest comparison. That’s a very good tool and much more comprehensive but it still does require you know the R language. So there is a bit of a learning curve. I was looking to make something that allows you to manipulate data usefully but with minimal coding knowledge. You could see this as an intermediate between a spreadsheet and using something like R-studio or R commander. Ultimately there is no replacement for being able to write your own code but this could serve as a kind of gateway to introduced the concepts involved. It is also a good way to quickly explore and plot your data and could be seen as complimentary to other tools.

A- What were your motivations for making pandastable/DataExplore?

D- Non-computational scientists are sometimes very daunted by the prospect of data analysis. People who work as wet lab scientists in particular often do not see themselves capable of substantial analysis even though they are well able to do it. Nowadays they are presented with a lot of sometimes heterogeneous data and it is intimidating if you cannot code. Obviously advanced analysis requires programming skills that take time to learn but there is no reason that some comprehensive analysis can’t be done using the right tools. Data ‘munging’ is one skill that is not easily accessible to the non programmer and that must be frustrating. Traditionally the focus is on either using a spreadsheet which can be very limited or plotting with commercial tools like prism. More difficult tasks are passed on to the specialists. So my motivation is to provide something that bridges the data manipulation and plotting steps and allows data to be handled more confidently by a ‘non-data analyst’.

A- What got you into data science and python development. Describe your career journey so far

D- I currently work as a postdoctoral researcher in bovine and pathogen genomics though I am not a biologist. I came from outside the field from a computer science and physics background. When I got the chance to do a PhD in a research group doing structural biology I took the opportunity and stayed in biology. I only started using Python about 7 years ago and use it for nearly everything. I suppose I do what is now called bioinformatics but the term doesn’t tell you very much in my opinion. In any case I find myself doing a lot of general data analysis.

Early on I developed end user tools in Python but they weren’t that successful since it’s so hard to create a user base in a niche area. I thought I would try something more general this time. I started using Pandas a few years ago and find it pretty indispensable now. Since the pydata stack is quite mature and has a large user community I thought using these libraries as a front-end to a desktop application would be an interesting project.

A-What is your roadmap or plans in future for pandastable?

D- pandastable is the name of the library because it’s a widget for Tkinter that provides a graphical view for a pandas dataframe. DataExplore is then the desktop application based around that. This is a work in progress and really a side project. Hopefully there will be some uptake and then it’s up to users to decide what they want out of it. You can only go so far in guessing what people might find useful or even easy to use. There is a plugin system which makes it easy to add arbitrary functionality if you know Python, so that could be one avenue of development. I implemented this tool in the rather old Tkinter GUI toolkit and whilst quite functional it has certain limitations. So updating to use Qt5 might be an option. Although the fashion is for web applications I think there is still plenty of scope for desktop tools.

A- How can we teach data science to more people in easier way to reduce the demand-supply gap for data scientists?

D- A can’t speak about business, but in science teaching has certainly lagged behind the technology. I don’t know about other fields, but in molecular biology we are now producing huge amounts of data because something like sequencing has developed so rapidly. This is hard to avoid in research. Probably the concepts need to be introduced early on in undergraduate level so that PhD students don’t come to data analysis cold. In biological sciences I think postgraduate programs are slowly adapting to allow training in wet and dry lab disciplines.

About

Dr. Damien Farrell is Postdoctoral fellow of School of Veterinary Medicine at University College Dublin Ireland. The download page for the dataexplore app is : http://dmnfarrell.github.io/pandastable/

Related

Rattle – Data Mining GUI https://decisionstats.com/2009/01/13/interview-dr-graham-williams/

R Commander a very popular R GUI https://decisionstats.com/2009/09/14/interview-professor-john-fox-creator-r-commander/

Deducer Data Viz GUI https://decisionstats.com/2013/04/03/interview-dr-ian-fellows-fellstat-com-rstats-deducer/

Nikita

Happy Holidays

chr

Data Science tutorial in Python

Here is a small tutorial I put together for upcoming book -“Python for R Users”

How does cryptography work?

by Jeroen Ooms

https://cran.r-project.org/web/packages/sodium/vignettes/crypto101.html

This page attempts to give a very basic conceptual introduction to cryptographic methods. Before we start the usual disclaimer:

I am not a cryptographer. This document is only for educational purposes. Crypto is hard, you should never trust your home-grown implementation. Unless you’re a cryptographer you will probably overlook some crucial details. Developers should only use the high-level functions that have been implemented by an actual cryptographer.

Now that we got this is out of the way, let’s start hacking 🙂

The XOR operator

The logical XOR operator outputs true only when both inputs differ (one is true, the other is false). It is sometimes called an invertor because the output of x gets inverted if and only if y is true:

# XOR two (8bit) bytes 'x' and 'y'
x <- as.raw(0x7a)
y <- as.raw(0xe4)
z <- base::xor(x, y)
dput(z)

as.raw(0x9e)

# Show the bits in each byte
cbind(x = rawToBits(x), y = rawToBits(y), z = rawToBits(z))

      x  y  z
[1,] 00 00 00
[2,] 01 00 01
[3,] 00 01 01
[4,] 01 00 01
[5,] 01 00 01
[6,] 01 01 00
[7,] 01 01 00
[8,] 00 01 01

In cryptography we xor a message x with secret random data y. Because each bit in y is randomly true with probability 0.5, the xor output is completely random and uncorrelated to x. This is called perfect secrecy. Only if we know y we can decipher the message x.

# Encrypt message using random one-time-pad
msg <- charToRaw("TTIP is evil")
one_time_pad <- random(length(msg))
ciphertext <- base::xor(msg, one_time_pad)

# It's really encrypted
rawToChar(ciphertext)

[1] "(8\xd7ȉ%\u035f\x81\xbb\023\xa2"

# Decrypt with same pad
rawToChar(base::xor(ciphertext, one_time_pad))

[1] "TTIP is evil"

This method is perfectly secure and forms the basis for most cryptograhpic methods. However the challenge is generating and communicating unique pseudo-random y data every time we want to encrypt something. One-time-pads as in the example are not very practical for large messages. Also we should never re-use a one-time-pad y for encrypting multiple messages, as this compromises the secrecy.

Stream ciphers

A stream cipher generates a unique stream of pseudo-random data based on a secret key and a unique nonce. For a given set of parameters the stream cipher always generates the same stream of data. Sodium implements a few popular stream ciphers:

password <- "My secret passphrase"
key <- hash(charToRaw(password))
nonce <- random(8)
chacha20(size = 20, key, nonce)

 [1] 51 c6 c9 45 c6 13 6b 3d 6f 5c e3 ab 9f 16 f2 46 ce cb 19 f3

Each stream requires a key and a nonce. The key forms the shared secret and should only be known to trusted parties. The nonce is not secret and is stored or sent along with the ciphertext. The purpose of the nonce is to make a random stream unique to protect gainst re-use attacks. This way you can re-use a your key to encrypt multiple messages, as long as you never re-use the same nonce.

salsa20(size = 20, key, nonce)

 [1] df 7d 13 ca ea 7c ff 93 e5 b6 fe b6 6b e2 91 14 ed ae 17 eb

Over the years cryptographers have come up with many more variants. Many stream ciphers are based on a block cipher such as AES: a keyed permutation of fixed length amount of data. The block ciphers get chained in a particular mode of operation which repeatedly applies the cipher’s single-block operation to securely transform amounts of data larger than a block.

We are not going to discuss implementation details, but you could probably come up with something yourself. For example you could use a hash function such sha256 as the block cipher and append counter which is incremented for each block (this is called CTR mode).

# Illustrative example.
sha256_ctr <- function(size, key, nonce){
  n <- ceiling(size/32)
  output <- raw()
  for(i in 1:n){
    counter <- packBits(intToBits(i))
    block <- sha256(c(key, nonce, counter))
    output <- c(output, block)
  }
  return(output[1:size])
}

This allows us to generate an arbitrary length stream from a single secret key:

password <- "My secret passphrase"
key <- hash(charToRaw(password))
nonce <- random(8)
sha256_ctr(50, key, nonce)

 [1] 07 01 96 02 7e c7 37 b4 8c b1 6a ec 4e 2d 56 34 7d 39 13 bc 72 e0 19
[24] ad b3 44 0e 9f 88 bb 3d 26 94 aa 66 01 2e bd 46 55 2c 04 99 1e af a9
[47] 91 cd 53 b4

In practice, you should never write your own ciphers. A lot of research goes into studying the properties of block ciphers under various modes of operation. In the remainder we just use the standard Sodium ciphers: chacha20, salsa20, xsalsa20 or aes128. See sodium documentation for details.

Symmetric encryption

Symmetric encryption means that the same secret key is used for both encryption and decryption. All that is needed to implement symmetric encryption is xor and a stream cipher. For example to encrypt an arbitrary length message using password:

# Encrypt 'message' using 'password'
myfile <- file.path(R.home(), "COPYING")
message <- readBin(myfile, raw(), file.info(myfile)$size)
passwd <- charToRaw("My secret passphrase")

A hash function converts the password to a key of suitable size for the stream cipher, which we use to generate a psuedo random stream of equal length to the message:

# Basic secret key encryption
key <- hash(passwd)
nonce8 <- random(8)
stream <- chacha20(length(message), key, nonce8)
ciphertext <- base::xor(stream, message)

Now the ciphertext is an encrypted version of the message. Only those that know the key and the nonce can re-generate the same keystream in order to xor the ciphertext back into the original message.

# Decrypt with the same key
key <- hash(charToRaw("My secret passphrase"))
stream <- chacha20(length(ciphertext), key, nonce8)
out <- base::xor(ciphertext, stream)

# Print part of the message
cat(substring(rawToChar(out), 1, 120))

            GNU GENERAL PUBLIC LICENSE
               Version 2, June 1991

 Copyright (C) 1989, 1991 Free Software Foundation, Inc.

The Sodium functions data_encrypt and data_decrypt provide a more elaborate implementation of the above. This is what you should use in practice for secret key encryption.

Symmetric encryption can be used for e.g. encrypting local data. However because the same secret is used for both encryption and decryption, it is impractical for communication with other parties. For exchanging secure messages we need public key encryption.

Public-key encryption and Diffie-Hellman

Rather than using a single secret-key, assymetric (public key) encryption requires a keypair, consisting of a public key for encryption and a private-key for decryption. Data that is encrypted using a given public key can only be decrypted using the corresponding private key.

The public key is not confidential and can be shared on e.g. a website or keyserver. This allows anyone to send somebody a secure message by encrypting it with the receivers public key. The encrypted message will only be readable by the owner of the corresponding private key.

# Create keypair
key <- keygen()
pub <- pubkey(key)

# Encrypt message for receiver using his/her public key
msg <- serialize(iris, NULL)
ciphertext <- simple_encrypt(msg, pub)

# Receiver decrypts with his/her private key
out <- simple_decrypt(ciphertext, key)
identical(msg, out)

[1] TRUE

How does this work? Public key encryption makes use of Diffie-Hellman (D-H): a method which allows two parties that have no prior knowledge of each other to jointly establish a shared secret key over an insecure channel. In the most simple case, both parties generate a temporary keypair and exchange their public key over the insecure channel. Then both parties use the D-H function to calculcate the (same) shared secret key by combining their own private key with the other person’s public key:

# Bob generates keypair
bob_key <- keygen()
bob_pubkey <- pubkey(bob_key)

# Alice generates keypair
alice_key <- keygen()
alice_pubkey <- pubkey(alice_key)

# After Bob and Alice exchange pubkey they can both derive the secret
alice_secret <- diffie_hellman(alice_key, bob_pubkey)
bob_secret <- diffie_hellman(bob_key, alice_pubkey)
identical(alice_secret, bob_secret)

[1] TRUE

Once the shared secret has been established, both parties can discard their temporary public/private key and use the shared secret to start encrypting communications with symmetric encryption as discussed earlier. Because the shared secret cannot be calculated using only the public keys, the process is safe from eavesdroppers.

The classical Diffie-Hellman method is based on the discrete logarithm problem with large prime numbers. Sodium uses curve25519, a state-of-the-art D-H function by Daniel Bernsteinan designed for use with the elliptic curve Diffie–Hellman (ECDH) key agreement scheme.

(Ajay- I really liked this very nice tutorial on cryptography and hope it helps bring more people in the debate. This is just to share this very excellent vignette based on the Sodium package in R)

Month: December 2015

Interview Mike Bayer SQLAlchemy #pydata #python

ABOUT

Star Wars awakens

While going out to the toilet in the intermission, I almost thought of using the force to guide you know. I bet no one reads this shit.

Interview Damien Farrell Python GUI DataExplore #python #rstats #pydata

Nikita

Happy Holidays

Data Science tutorial in Python

How does cryptography work?

2015-10-18

The XOR operator

Stream ciphers

Symmetric encryption

Public-key encryption and Diffie-Hellman

ABOUT

Please share:

While going out to the toilet in the intermission, I almost thought of using the force to guide you know. I bet no one reads this shit.

Please share:

Please share:

Please share:

Please share:

Please share:

2015-10-18

The XOR operator

Stream ciphers

Symmetric encryption

Public-key encryption and Diffie-Hellman

Please share: