Interview Mike Bayer SQLAlchemy #pydata #python

Here is an interview with Mike Bayer, the creator of popular Python package SQLAlchemy.

Ajay (A)-How and why did you create SQLAlchemy?

Mike (M) – SQLAlchemy was at the end of a string of various database abstraction layers I’d written over the course of my career in various languages, including Java, Perl and (badly) in C. Working for web agencies in the 90’s when there were no tools, or only very bad tools, available for these platforms, we always had to invent things.  So the parts of repetition in writing a CRUD application, e.g. those aspects of querying databases and moving their data in and out of object models which we always end up automating, became apparent.

Additionally I had a very SQL-intense position in the early 2000’s at Major League Baseball where we spent lots of time writing “eager” queries and loaders, that is trying to load as much of a particular dataset in as few database round trips as possible, so the need for “eager loading” was also a core use case I learned to value.  Other use cases, such as the need to deal with the database in terms of DDL, the need to deal with SQL in terms of intricate SELECT queries with deep use of database-specific features, and the need to relate database rows to in-memory objects in a way that’s agnostic of the SQL which generated those rows, were all things I learned that we have to do all the time.

These were all problems I had spent a lot of time trying and re-trying to solve over and over again so when I approached doing it in Python for SQLAlchemy, I had a lot of direction in mind already.  I then read Fowler’s “Patterns of Enterprise Architecture” which gave me a lot more ideas for things I thought the ultimate SQL tool should have.

I wrote the Core first and then the ORM on top.   While the first releases were within a year, it took years and years of rewriting, refactoring, learning correct Python idioms and refactoring again for each one,
collecting thousands of end-user emails and issues each of which in some small way led to incremental improvements, as well as totally breaking things for my very early users quite often in the beginning, in order to slowly build up SQLAlchemy as a deeply functional and reliable system without large gaps in capability, code or design quality.

A- What is SQl Alchemy useful for? Name some usage stats on it’s popularity.

M- It’s useful anytime you want to work with relational databases to the degree that the commands you are sending to your database can benefit from being programmatically automated.  SQLAlchemy is scripting and automation for databases.

The site gets about 2K unique visitors a day and according to Pypi we have 25K downloads a day, though that is a very inaccurate number; Pypi’s stats themselves record more downloads than actually occur, and a single user might be downloading SQLAlchemy a hundred times a day for a mutli-server continuous integration environment, for example.   So I really don’t have any number of users, but it’s a lot at this point for sure.

A- Describe your career journey. What other Python packages have you created?

M- The career journey was way longer and more drawn out than it is for most people I meet today, meaning I had years and years of programming time under my belt but it still took an inordinately long time for me to be “good” at it from a formal point of view, and I still have gaps in my abilities that most people I work with don’t.

I only did a few years of computer programming in college and I didn’t graduate.

 Eventually I got into programming in the 90’s because it was a thing I could do better than anything else and due to the rising dot-com bubble in places like NYC it was a totally charged job scene that made it easy to build up a career and income.

But in the 90’s it was much harder to get guidance from better coders, at least for me, so while I was always very good at getting a problem solved and writing things that were more elaborate and complex than what a lot of other people did, I suffered from a lack of good mentors and my code was still very much that awful stuff that only remains inside of a corporate server and gets thrown away every few years anyway.   I was obsessed with improving, though.

After I left MLB I decided to get into Python and the first thing I did was port a Perl package I liked called HTML::Mason to Python, and I called it Myghty.

It was an absolutely horrible library from a code quality point of view, because I was an undisciplined Perl programmer who had never written a real unit test.

Then I started SQLAlchemy, early versions of it were equally awful, then as I slowly learned Python while rewriting SQLA over and over I wrote an all-new Myghty-like template system called Mako, so that nobody would ever have to see Myghty again, then I published Alembic migrations and dogpile.cache.

Along with all kinds of dinky things those are the major Python libraries I’ve put out.

A- Is it better or faster to store data within a RDBMS like MySQL and then run queries to it from Python, or is it better to import data say  to a Pandas like object. What is the magnitude of the difference in speed and computation?

M- That’s a really open-ended question that depends a ton on what kind of data one is working with and what kind of use cases.   I only have a small amount of experience with numpy/pandas but it seems like if one is dealing with chunks of scientifically oriented numerical data that is fairly homogeneous in format, where different datasets are related to each other in a mathematical sense,  the fluency you get from a tool like Pandas is probably much easier to work with than an RDBMS.

An RDBMS is going to be better if you are instead dealing with data that is more heterogeneous in format, with a larger number of datasets (e.g. tables) which are related to each other in a relational sense (e.g. row identity).

RDBMS is also the appropriate choice if you need to write or update portions of the data in a transactional way.

As far as speed and computation, that’s kind of an apples to oranges comparison.   Pandas starts with the advantage that the data is all in memory, but then what does that imply for datasets that are bigger than typical memory sizes or in cases where the datasize is otherwise prohibitive to move in and out of memory quickly, not to mention relational databases can often get their whole dataset in memory too. But then Pandas can optimize for things like joins in a different way than SQL does which may or may not provide better performance for some use cases.

I don’t have much experience with Pandas performance, though I did write a tool some years ago that expresses SQLAlchemy relational operations in terms of Pandas (google for CALCHIPAN); most relational operations except for extremely simple SELECTs and a specific subset of joins did not translate very well at all.

So Pandas might be super fast for the certain set of things you need to do, but for the more general case, particularly where the data spans across a relational structure, you might have fewer bottlenecks overall with regular SQL (or maybe not).

A- What makes Python a convenient language to work with data?

M- To start with, it’s a scripting language; there’s no compile step. That’s what first brought me to it – a language with strong OO that was still scripting.

The next is that it’s an incredibly consistent and transparent / non-mysterious system with a terrific syntax; from day one I loved that imported modules were just another Python object like everything else, rather than some weird ephemeral construct hoisted in by the interpreter in some mysterious way (I’m thinking of Perl’s “use” here).

It is strongly typed; none of those “conveniences” we get from something like Perl where it decided that hey, that blank string meant zero, right?
That Python is totally open source too is something we take for granted now.  I’ve worked with Matlab, which has an awful syntax, but we also had to fight all the time with license keys and license managers and being able to embed it or not and basically copy-protected commercial software implementing a programming language is not a thing that has any place in the world anymore.

I’ve not seen any language besides Python that is scripting, has very good OO as well as a little bit (but not too much) of functional paradigms mixed in, has strong typing, and a huge emphasis on readability and importantly learnability. I’ve never been that interested in learning to write genius-level cleverness in something like Haskell that nobody understands.

If you’re writing code that nobody understands, be very wary – it might be because you’re just so brilliant, or because your code totally sucks, noting that these two things often overlap heavily.

A- What are the key things that a Python package developer should keep in mind ?

M-

Please try to follow as many common conventions as possible.

Use the distutils/setuptools system, have a setup.py file.

Write your docs using Sphinx and publish them on readthedocs.

Make sure you’ve read pep8 and are following most or all of it (and if you’re not, rewrite your code ASAP to do so, don’t wait).

Make sure your code runs on Python 2.7 and Python 3.3+ without any translation steps.

Make sure you have a test suite, make sure it runs simply and quickly and is documented for other people to use, and try to get it on continuous integration somewhere.

Make sure you’re writing small tests that each test just one thing; and verify that a test actually tests the thing it targets by ensuring it fails when that feature is intentionally broken.

Maintain your project’s homepage, bugtracker, mailing list, etc. so that people know how to get to you, and try as hard as possible to be responsive and polite.

Always reply to people, even if it’s to say that you’re sorry you really can’t help them.   There is a significant issue with project maintainers that simply don’t reply to emails or bug reports, or just go missing entirely and leave the whole world wondering for months / years if their critical library is something we need to start forking or not.

A- What is your opinion on in-database analytics ? How can we extend the  principles and philosophy of SQLAlchemy for Big Data Databases and tools

M- I only had a vague notion what this term meant, but reading the Wikipedia page confirmed my notion was the right idea.   The stored procedure vs. app-side debate is a really old one that I’ve been exposed to for a long time.

Traditionally, I’m on the app-side of this.  By “traditional” I mean you’re using something like a SQL Server or Oracle with an app server. For this decision, life is much easier if you don’t put your business logic on the database side.  With the tools that have been around for the last several decades, the stored procedure route is difficult to travel in, because it is resistant to now-essential techniques like that of using source control, organizing code into modules, libraries and dependencies, and using modern development paradigms such as object-oriented or functional programming.

Critically, it forces us to write much more code than when we place the business logic in the app side and emit straight SQL, because the stored procedure’s data, both incoming and outgoing, still has to be marshaled to and from our application layer, yet this is difficult to automate when dealing with a procedure that has a custom, coarse-grained form of calling signature.

Additionally, SQL abstraction tools that are used to automate the production of SQL strings don’t generally exist in the traditional stored procedure world.  Without tools to automate anything, we get the worst of both worlds; we have to write all our SQL by hand on the database side using a typically arcane language like Transact-SQL or PL/SQL, *and* we have to write all the data-marshaling code totally custom to our stored procedures on the app side.

Instead, using modern tools on the app side like a SQLAlchemy we can express data moving between an object model and relational database tables in a very succinct and declarative way without losing any of our SQL fluency for those parts where it’s needed.

Non-traditionally, I think the concept of software embedded in the database could be amazing – note i don’t even want to call it “stored procedures” because already, that implies “procedural development”, which is a dev model that reached its pinnacle with Fortran.

A database like Postgresql allows Python to run within the database process itself, which means that I could probably get SQLAlchemy itself to run within Postgresql.   While I don’t have any time to work on it, I do have a notion of a system where a tool like SQLAlchemy could actually run on both the database side and the app side simultaneously, to produce a Python ORM that actually invokes some portion of its logic on the server.

I would imagine this is already the kind of thing a system like Datomic or Vertica is doing, but I’ve not seen this kind of thing outside of the commercial / JVM-oriented space.

ABOUT

Mike Bayer is the creator of many open source programming libraries for the Python Programming Language, including SQLAlchemy, Alembic MigrationsMako Templates for Python, and Dogpile Caching.

He blogs at http://techspot.zzzeek.org/

SQLAlchemy is an open source SQL toolkit and object-relational mapper (ORM) for the Python programming language released under the MIT License. It gives application developers the full power and flexibility of SQL.

Interview Damien Farrell Python GUI DataExplore #python #rstats #pydata

Here is an interview of the Dr Damien Farrell creator of an interesting Python GUI with some data science flavors called DataExplore.  Of course R has many Data Analysis GUI like R Commander, Deducer, Rattle which we have all featured on this site before. Hopefully there can be cross pollination of ideas on GUI design for Data Science in Python/ pydata community.

A- What solution does DataExplore provide to data scientists?

D- It’s not really meant for data scientists specifically. It is targeted towards scientists and students who want to do some analysis but cannot yet code. R-studio is the closest comparison. That’s a very good tool and much more comprehensive but it still does require you know the R language. So there is a bit of a learning curve. I was looking to make something that allows you to manipulate data usefully but with minimal coding knowledge. You could see this as an intermediate between a spreadsheet and using something like R-studio or R commander. Ultimately there is no replacement for being able to write your own code but this could serve as a kind of gateway to introduced the concepts involved. It is also a good way to quickly explore and plot your data and could be seen as complimentary to other tools.
A- What were your motivations for making pandastable/DataExplore?
D- Non-computational scientists are sometimes very daunted by the prospect of data analysis. People who work as wet lab scientists in particular often do not see themselves capable of substantial analysis even though they are well able to do it. Nowadays they are presented with a lot of sometimes heterogeneous data and it is intimidating if you cannot code. Obviously advanced analysis requires programming skills that take time to learn but there is no reason that some comprehensive analysis can’t be done using the right tools. Data ‘munging’ is one skill that is not easily accessible to the non programmer and that must be frustrating. Traditionally the focus is on either using a spreadsheet which can be very limited or plotting with commercial tools like prism. More difficult tasks are passed on to the specialists. So my motivation is to provide something that bridges the data manipulation and plotting steps and allows data to be handled more confidently by a ‘non-data analyst’.
A- What got you into data science and python development. Describe your career journey so far
D- I currently work as a postdoctoral researcher in bovine and pathogen genomics though I am not a biologist. I came from outside the field from a computer science and physics background. When I got the chance to do a PhD in a research group doing structural biology I took the opportunity and stayed in biology. I only started using Python about 7 years ago and use it for nearly everything. I suppose I do what  is now called bioinformatics but the term doesn’t tell you very much in my opinion. In any case I find myself doing a lot of general data analysis.
Early on I developed end user tools in Python but they weren’t that successful since it’s so hard to create a user base in a niche area. I thought I would try something more general this time. I started using Pandas a few years ago and find it pretty indispensable now. Since the pydata stack is quite mature and has a large user community I thought using these libraries as a front-end to a desktop application would be an interesting project.
plot_samples
A-What is your roadmap or plans in future for pandastable?
D- pandastable is the name of the library because it’s a widget for Tkinter that provides a graphical view for a pandas dataframe. DataExplore is then the desktop application based around that. This is a work in progress and really a side project. Hopefully there will be some uptake and then it’s up to users to decide what they want out of it. You can only go so far in guessing what people might find useful or even easy to use. There is a plugin system which makes it easy to add arbitrary functionality if you know Python, so that could be one avenue of development. I implemented this tool in the rather old Tkinter GUI toolkit and whilst quite functional it has certain limitations. So updating to use Qt5 might be an option. Although the fashion is for web applications I think there is still plenty of scope for desktop tools.
A- How can we teach data science to more people in easier way to reduce the demand-supply gap for data scientists? 
D- A can’t speak about business, but in science teaching has certainly lagged behind the technology. I don’t know about other fields, but in molecular biology we are now producing huge amounts of data because something like sequencing has developed so rapidly. This is hard to avoid in research. Probably the concepts need to be introduced early on in undergraduate level so that PhD students don’t come to data analysis cold. In biological sciences I think postgraduate programs are slowly adapting to allow training in wet and dry lab disciplines.

 

About

Dr. Damien Farrell is Postdoctoral fellow of School of Veterinary Medicine at University College Dublin Ireland. The download page for the dataexplore app is : http://dmnfarrell.github.io/pandastable/

Related

 

Interview Chris Kiehl Gooey #Python making GUIs in Python

Here is an interview with Chris Kiehl, developer of Python package Gooey.  Gooey promises to turn (almost) any Python Console Program into a GUI application with one line

f54f97f6-07c5-11e5-9bcb-c3c102920769

Ajay (A) What was your motivation for making Gooey?  

Chris (C)- Gooey came about after getting frustrated with the impedance mismatch between how I like to write and interact with software as a developer, and how the rest of the world interacts with software as consumers. As much as I love my glorious command line, delivering an application that first requires me to explain what a CLI even is feels a little embarrassing. Gooey was my solution to this. It let me build as complex of a program as I wanted, all while using a familiar tool chain, and with none of the complexity that comes with traditional desktop application development. When it was time to ship, I’d attach the Gooey decorator and get the UI side for free

A- Where can Gooey can be used potentially in industry? 

C- Gooey can be used anywhere where you bump into a mismatch  in computer literacy. One of its core strengths is opening up existing CLI tool chains to users that would otherwise be put off by the unfamiliar nature of the command line. With Gooey, you can expose something as complex as video processing with FFMPEG via a very friendly UI with almost negligible development effort.

A- What other packages have you authored or contributed in Python or other languages?

C- My Github is a smorgasbord  of half-completed projects. I have several tool-chain projects related to Gooey. These range from packagers, to web front ends, to example configs. However, outside of Gooey, I created pyRobot, which is a pure Python windows automation library. Dropler, a simple html5 drag-and-drop plugin for CKEditor. DoNotStarveBackup, a Scala program that backs up your Don’t Starve save file while playing (a program which I love, but others actively hate for being “cheating” (pfft..)). And, one of my favorites: Burrito-Bot. It’s a little program that played (and won!) the game Burrito Bison. This was one of the first big things I wrote when I started programming. I keep it around for time capsule, look-at-how-I-didn’t-know-what-a-for-loop-was sentimental reasons.

A- What attracted you to developing in Python. What are some of the advantages and disadvantages of the language? 

C– I initially fell in love with Python for the same reasons everyone else does: it’s beautiful. It’s a language that’s simple enough to learn quickly, but has enough depth to be interesting after years of daily use.
Hands down, one of my favorite things about Python that gives it an edge over other languages is it’s amazing introspection. At its core, everything is a dictionary. If you poke around hard enough, you can access just about anything. This lets you do extremely interesting things with meta programming. In fact, this deep introspection of code is what allows Gooey to bootstrap itself when attached to your source file.
Python’s disadvantages vary depending on the space in which you operate. Its concurrency limitations can be extremely frustrating. Granted, you don’t run into them too often, but when you do, it is usually for show stopping reasons. The related side of that is its asynchronous capabilities. This has gotten better with Python3, but it’s still pretty clunky if you compare it to the tooling available to a language like  Scala.

A- How can we incentivize open source package creators the same we do it for app stores etc?

C- On an individual level, if I may be super positive, I’d argue that open source development is already so awesome that it almost doesn’t need to be further incentivized. People using, forking, and commiting to your project is the reward. That’s not to say it is without some pains — not everyone on the internet is friendly all the time, but the pleasure of collaborating with people all over the globe on a shared interest are tough to overstate.
Related-

 

Interview Domino Data Lab #datascience

Here is an interview with Eduardo Ariño de la Rubia, VP of Product & Data Scientist in Residence at Domino Data Lab Here Eduardo weighs in on issues concerning data science and his experiences.

ed

Ajay (A) How does Domino Data Lab give a data scientist an advantage ?

Eduardo (E) – Domino Data Lab’s enterprise data science platform makes data scientists more productive and helps teams collaborate better. For individual data scientists, Domino is a feature rich platform which helps them manage the analytics environment, provides scalable compute resources to run complex and multiple tasks in parallel, and makes it easy to share and productize analytic models.   For teams, the Domino platform supports substantially better collaboration by making all the work people are doing viewable and reproducible.  Domino provides a central analytics hub where all work is saved and hosted. The result is faster progress for individuals, and better results from teams.

 

A- What languages and platforms do your currently support?

 E- Domino is an open platform that runs on Mac, Windows, or Linux.   We’ll run any code that can be run on a Linux system. We have first class support for R, Python, Matlab, SaS, and Julia.
A-  How does Domino compare to Python Anywhere , Google Cloud Data Lab (https://cloud.google.com/datalab/) or other hosted Python solutions?
 E- Domino was designed from the ground up to be an enterprise collaboration and data science platform.  It’s a full featured platform in use at some of the largest research organizations in the world today.
  .
A- What is your experience of Python versus other languages in the field of data science
E- That’s the opening salvo of a religious war, and though I should know better than to involve myself, I will try to navigate it. First and foremost, I think it’s important to note that the two “most common” open source languages used by data scientists today, Python and R, have fundamentally hit feature parity in their maturity. While it’s true that for some particular algorithm, for some poorly trod use-case, one language and environment may have an edge over the other, I believe that for the average data scientist, language comes down to choice.
That being said, my personal experience is slightly more nuanced. My background is primarily computer science and as such, having spent many years about programming first and data analysis second, this has formed the way I approach a problem. I find that if I am doing the “exploratory analysis” or “feature engineering” phase of a data science project, and I am using a language which has roots in “typical programming”, often times this will make me approach the solution of the problem less like a data scientist, and more like a programmer. When I should be thinking in terms of set or vectorized operations, when I should be thinking about whether I’m violating some constraint, instead I’m building a data structure to make an operation O(n log n) so that I can use a for loop when I shouldn’t.
This isn’t an indictment of any language, not is it a statement that there’s a fundamental benefit to thinking one way or another about a problem. It is however a testament to the fact that often when challenged, people will fall back to their most familiar skill set, and begin to treat every problem as a nail to be hammered. If I had come to Python *as* a data scientist first, it is possible this nuance wouldn’t have ever surfaced, however I learned Python before pandas, scikit-learn, and the DS revolution. So those neurons are quite trained up. However, I learned R purely as an endeavor in data science, and as such I don’t find myself falling back on “programmer’s habits” when I hit a wall in R, I take a step back and usually find a way to work around it within the idiomatic approaches.
To summarize, my experience is that language wars accomplish very little, and that most of the modern data science languages are up to the task. Just beware of the mental baggage that you bring with you on the journey.
 .
A- What do you feel about polyglots ( multiple languages ) in data science  (like R, Python, Julia) and software like Beaker and Jupyter that enable multiple languages?
E- Data science is a polyglot endeavor. At the very least, you usually have some data manipulation language (such as SQL) and some language for your analysis (R or Python.) Often times you have many more languages, for the data engineering pipeline I often reach for perl (it’s still an amazing language for the transformation of text data), sometimes I have a bit of code that must run very quickly, and I reach for C or C++, etc… I think that multiple languages are a reality. Domino supports, out of the box, fundamentally every language that will run on Linux. If your feature pipeline involves some sed/awk, we understand. If you need a bit of Rcpp, we’re right there with you. If you want to output some amazing d3.js visualizations to summarize the data, we’re happy to provide the framework for you to host it on. Real world data is messy, and being a polyglot is a natural adaptation to that reality.
 .
About-
Eduardo Ariño de la Rubia is VP of Product & Data Scientist in Residence at Domino Data Lab
Domino makes data scientists more productive and facilitates collaborative, reproducible, reusable analysis. The platform runs on Premise or in the Cloud. Its customers come from a wide range of industries, including government, insurance, advanced manufacturing, and pharmaceuticals. It is backed by Zetta Venture Partners, Bloomberg Beta, and In-Q-Tel.
domino-data-lab-logo
You can have a look at their very interesting data science platform at Domino Data Lab
Related-

Interview ScrapingHub #python #webcrawling

Here is an interview with the team behind ScrapingHub.com Scrapinghub is created by the same team as that created Scrapy, a python based open source framework for extracting  data   from websites.

scph

 

 Ajay (A) Describe your journey with Python and Scraping Web Pages.

Shane Evans (Director and Co-founder): I started commercial python development around 15 years ago and loved it from the start. My first significant experience web scraping with python was in 2007, when I was building a vertical search engine. Initially, my team started writing python scripts, but that got problematic very quickly. I wrote a framework to make the job easier, promoting best practises for scraping and avoiding common mistakes. That framework went on to be released as Scrapy. About a year later, in an attempt to improve the efficiency of our Spider development, I lead the development of a visual scraping tool, which we called autoscraping. That tool was the basis of Portia, Scrapinghub’s visual scraping tool.

In addition to python scraping frameworks, I have written many crawlers for specific websites and a lot of web crawling infrastructure, including much of our scrapy cloud product,  nearly entirely in python.

A- How does Scrapy compare with Beautifulsoup. What are the other technologies that you have used for Web scraping?

Denis de Bernardy (Head of Marketing): Scrapy is a web scraping framework; Beautifulsoup is an HTML parser. That is, Scrapy takes care of details such as request throttling, concurrency, retrying urls that fail, caching locally for development, detecting correct page encoding, etc. It also offers a shell utility that comes in handy when you need to step through a crawl to debug what’s going on.You won’t necessarily run into all of these issues in your first project if you try using Requests and BeautifulSoup. But you will run into them eventually. Scrapy takes care of this all for you so you can focus on extraction logic instead.

In some sense, Greenspun’s tenth rule comes to mind: “Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.” The same could be said about a lot of web scraping projects and how they relate to Scrapy.

Other technologies are very diverse. To name only a few

  • Frontera. It allows multiple spiders to share the same crawl frontier – that is, the list of URLs that are slated to get crawled next. We’ve two experimental variations of it to boot. One that dispatches huge crawls across multiple servers, and another that enables you to prioritize crawls and re-crawls so you can go straight for the jugular when you’re looking for something specific – e.g. infosec-related data when you’re monitoring security exploits.
  • Splash. There’s no shortage of websites that require javascript for a reason or another. For instance because they’re single-page applications or because countermeasures are in place to ensure that bots are not, well, crawling your website – it’s a cat and mouse type of game, really. Splash is a headless browser that supports JavaScript to work around all of these. It also allows to take screenshots as you crawl.
  • Crawlera. It’s our smart proxy provider. It works around bot countermeasures by spreading requests across a large pool of IPs. What more, it throttles requests through IPs so they don’t get used too often, to avoid that they get blacklisted, and which IPs are actually banned so you don’t bother trying them to begin with. Crawlera is what allows us and our customers to crawl complex sites with robust bot countermeasures in place.
  • Portia: It’s a visual scraping tool based on Scrapy. It is an open source project that allows users to build web spiders without needing to know any Python or coding at all. The users only need to visually annotate the elements that they want to extract from a web page and then Portia generates the spider code and delivers a working spider ready to be deployed.

A- How does Python compare say with R ( or RCurl et al) for scraping websites?

Valdir Stumm (Developer Evangelist): If you are just scraping some simple web pages, they are not that different, as both work with networking (R: rCurl, Python: Requests, urllib) and with HTML/XML processing libraries (R: rVest, Python: BeautifulSoup, lxml).

However, while simple web scraping might be easy with R, web crawling might be not. Tools like rCurl are useful to retrieve some web pages. If you need to crawl a large collection of URLs (which may be unknown when starting), it may turn out difficult to do it in a timely manner.

Python, in contrast, has a vast ecosystem with libraries and frameworks to do high performance asynchronous networking and multiprocessing. And, of course, there are frameworks like Scrapy that take care of most of the dirty work like scheduling URLs, handling networking responses, character sets, etc.

A- What are some of the drawbacks of using Python?

Shane Evans: Scrapy takes care of a lot of the lower level async programming, which is required to get good performance. This code is awkward to write, hard to understand, and a nightmare to debug (say compared to how nicely you can do it in erlang).

Performance is an often mentioned drawback, and by that people usually mean raw compute power compared to, say, Java. However, in practise this is less of a problem for common scraping tasks due to the reliance on libraries written in C. See my Quora answer about fast open source scrapers.

A- How is web scraping useful as a data input source even for traditional companies like banks, telecom, etc.?

Denis de Bernardy: Scraped data gets used in all sorts of useful ways. To name a few:

  • A slew of companies use data to monitor product information. Think price comparison apps for consumers, spotting price changes by your competitors so you can adjust your own prices faster, monitoring that your resellers are not offering your product for a lower price than agreed, sentiment analysis across end-user forums, and so forth.
  • HR departments and companies are more interested in profiles. There’s a wealth of information you can collect across the web that can help you decide if a candidate will be a good fit for the job you’re offering or not. Or which candidates you should be head hunting to begin with.
  • Legal departments and companies use data to gather documents for discovery or due diligence purposes. Think staying up to date with legal developments in your industry or mining laws and jurisprudence that may relate to your legal case.
  • Yet another good example is marketing companies. There’s a slew of startups out there that are trying to automate lead generation and outbound sales. Data helps them pinpoint the right people in the right companies while providing context and sometimes contact details to boot.
  • A last set worth mentioning is governments. National statistics offices, for instance, are seeking to automate computing consumer price indexes. And law enforcement agencies are scraping the dark web to locate and monitor criminals. To great effect, we dare add: not a month goes by without busting a human trafficking ring in the US, and we’re very proud to be providing some of the tools that enables this.

A-  With the rise of social media and consumer generated content, does web scraping offer a privacy dilemma? What are some of the ways we can ensure content is free and fairly open for both developers as well as people?

Denis de Bernardy: An important point to highlight here is that we normally stick to scraping publicly available information.

There are a few exceptions. For instance if a lawyer needs to log in to download detailed legal data information on a poorly designed government site. This is something they or a clerk would normally do manually each day. We just make it simpler. But they’re the exception.

The reasons are practical: if you excessively scrape something that requires you to be logged in or to use an API key, you get detected and shut down rather quickly. The reasons are also legal: if you’re logged in, you’ve de facto accepted terms of services – and they likely disallow automated crawls to begin with.

Another important point is that we honor robots.txt files as a rule of thumb. In practical terms this means that if Google can crawl it, so can we. The difference between what Google does and what we do is, well, we structure the data. Rather than searching across unstructured web pages for a piece of information, you search across profile names, bios, birthdays, etc.

With this out of the way, does web scraping offer a privacy dilemma? Kind of. But is it specific to web scraping to begin with? Web scraping helps you automate collecting this information. You could hire an army of workers on Amazon Turk and achieve the same result. Would you? No. But the fact is, this information is all online. No one would be collecting it if it was not.

Adding to this, the privacy issue raised is not new to web scraping. Or the internet. Or even the past century. Jeremy Bentham was describing the idea of Panopticons in the late 18th century. Michel Foucault was re-popularizing the term in the late 1970s – at least in France. Jacques Attali was worrying about a coming period of hyper-surveillance in a 2006 best-seller. The list could go on. Modern society has a heavy trend towards transparency. Your health insurance company, like it or not, would like to see the data in your wearables. It’s just a matter of time before someone popularizes the idea of publishing that data. And yes, it’ll get picked up – indeed, scraped – on the spot when that happens.

While we’re on the topic, note the data play as an aside. There are only so many ways you can lay out a user’s profile on a web page. With enough examples, you can train an AI to scrape them automatically on any site. Rinse and repeat for company profiles, comments, news articles, and what have you. Then connect the dots: if you leave your blog’s url around when you leave comments online, it will get tied back to you. Automate that too and – yes – online privacy basically flies out the window.

In the end, if it’s publicly available online it’s not exactly private…

On that note, Eric Schmidt’s controversial comment from 2009 might make a lot more sense than meets the eye. It was: “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.” Perhaps you won’t broadcast it on the internet yourself; but someone else might. It may give you shivers, but it might hold true.
The main issue that got raised back then, from my standpoint at least, revolved around how to deal with false or dubious information and claims. We’ve privacy laws and consumer protection laws for those – or at least we do in the European Union. And recent developments on enforcing a right to be forgotten. Those types of laws and developments, if anything, are the proper ways to deal with privacy issues in my opinion. But they also introduce heaps of not-so-simple territoriality problems. For instance, should a random court in Iran be able to ask a US-based company to take a web page offline on privacy grounds? It’s thorny.

About-

Scrapinghub was created by the team behind the Scrapy Web Crawling framework. It provides web crawling and data processing solutions.  Some of it’s products are

  • Scrapy Cloud is a bit like Heroku for web scraping.
  • Portia , a visual crawler editor, lets you create and run crawlers without touching a line of code.
  • Frontera is a Crawl Frontier framework.
  • Crawlera a smart proxy rotator rotates IPs to deal with IP bans and proxy management
  • Splash is a headless browser that executes JavaScript for people crawling websites

Users of this platform  crawl over 2 billion web pages per month.

Interview Maciej Fijalkowski PyPy

As part of my research for “Python for R Users- A Data Science Approach” (Wiley 2016), I came across PyPy (http://pypy.org/) What is PyPy?

PyPy is a fast, compliant alternative implementation of the Python language (2.7.10 and 3.2.5). It has several advantages and distinct features:

  • Speed: thanks to its Just-in-Time compiler, Python programs often run faster on PyPy.

  • Memory usage: memory-hungry Python programs (several hundreds of MBs or more) might end up taking less space than they do in CPython.

  • Compatibility: PyPy is highly compatible with existing python code. It supports cffi and can run popular python libraries like twisted and django.

  • Stackless: PyPy comes by default with support for stackless mode, providing micro-threads for massive concurrency.

Now R users might remember the debate with Renjin and pqR a few years ago. PyPy is an effort which has been around for some time and they are currently at an interesting phase.

Here is an interview with Maciej Fijalkowski of PyPy

pypy-logo (1)

Ajay Ohr-Why did you create PyPy to serve what need ?

PyPy– I joined pypy in 2006 or 2007, I don’t even remember, but it was about 2 years into the project existence. Shockingly enough, the very first idea was that there will be a python-in-python for educational purposes only. It later occurred to us that we can use the fact that PyPy is written in a high level language and apply various transformations to it, including just-in-time compilation. Overall it was a very roundabout way, but we came to the conclusion that this is the right way to provide a high-performance python virtual machine, after Armins experience writing Psyco, that likely only few people
remember.

Ajay Ohri-  Describe the current state of PyPy especially regarding to using NumPy. Can we use it for Pandas, matplotlib,seaborn, scikit-learn, statsmodels in near future. What hinders your progress?

PyPy- We are right now in the state of flux. I’m almost inclined to say “talk to us in a few weeks/months”. I will describe the status right now as well as possible near futures. Right now, we have a custom version of numpy that supports most of the existing numpy and can be used, although it does not pass all the tests. It has a very fast array item access routines, so you can write your algorithms directly in python without looking into custom solutions. It however, does not provide a C API and so does not support anything else from the numeric stack.

We’re considering also supporting the original numpy with CPython C API, which will enable the whole numeric stack with some caveats. Currently, there are ongoing discussions and I can get back to you once this is resolved.

Our main problem is the CPython C API and the dependency of the entire numeric stack on that. It exposes a lot of CPython internals, like reference counting, the exact layout of lists and strings etc. We have a layer that provides some sort of compatibility with that, but we need more work in order to make it more robust and faster. In the case of C API the main hindrance is funding – I wrote a blog post detailing the current situation: http://lostinjit.blogspot.co.za/2015/11/python-c-api-pypy-and-road-into-future.html We would love to support the entire numeric stack and we will look into ways that make it possible.

Ajay Ohri-A faster more memory efficient Python – will it be useful for analysis of large amounts of numeric data ?

PyPy- Python owes much of it’s success to good integration with the C ecosystem. For years we’ve been told that no one needs a fast Python, because what is necessary to be fast is already in C and we can go away. That has proven to be blatantly false with projects like apache spark embedding python as a way to do computations. There are also a lot of Python programmers and it’s a bit unfair to expect from them to “write all the performance critical parts in C” or any of the other custom languages built around Python, like Cython. I personally think that there is a big place for a faster Python and we’re mostly fulfilling that role, except exactly for the case of integration with numeric libraries that is absolutely crucial for a lot of people. We need to improve that story if we were to fill in that gap completely and while predicting future is hard, we would do our best to support the numeric stack a lot better in the coming months.

Ajay Ohri- What are the day to day challenges you face while working on PyPy? 

PyPy- That’s a tough question. There is no such thing in IT as “day to day challenges with technology” because if it’s really such a hindrance, you can usually automate it away. However, I don’t do only technical work these days, I deal a lot with people asking questions, looking at issues, trying to organize money for PyPy etc. This means that it’s very hard to pinpoint what a day-to-day activity is, let alone what it’s problems are.

The most repeating challenges that we face are how to make sure there is funding for chronically underfunded open source projects and how to explain our unusual architecture to newcomers. The technical issues we are heavily trying to automate away so if it’s a repeating problem, we are going to have more and more infrastructure to deal with it in a more systematic manner.

Ajay Ohri-  You and your highly skilled team could probably make much more money per
hour working for companies in consulting projects, Why devote time to open source coding tools. What is the way we can get more people to donate or  devote time

PyPy- It is a very interesting question, probably exceeding the scope of this interview, but I will try to give it a go anyway. I think by now it’s pretty obvious that Open Source is just a better way to make software, at least as far as infrastructure goes. I can’t think about a single proprietary language platform that’s not tied to a specific architecture. Even Microsoft and .NET are moving slowly towards Open Source, with Apple owning so much of the platform that no one has a say there.

That means that locally, yes, we could very likely make far more money working for some corporations, but globally it’s pretty clear that both our impact and the value we bring is much higher than it would be working for a corporation looking for its short term gains.

Additionally, the problems we are presented to work with are much more interesting than the ones we would likely encounter in the corporate environment. Funding Open Source is a very tricky question here and I think we need to find answers to that.

Everyone uses Open Source software, directly or indirectly and there is enough money made by companies profiting from using it to fund it. How to funnel this money is a problem that we’re trying to solve on a small scale, but would be wonderful to see the solution on a bigger scale.

Ajay Ohri- How can ensure automatic porting of algorithms from languages to Java Python R rather than manually creating packages. I mean if we can have Google Translate for Human languages, what can we do to make automatic translation of code between computer languages

PyPy- It would be very useful, but no one managed to do it well, maybe that means something. However, it’s quite easy to translate between languages naively – without taking into account best practices, more efficient ways of achieving goals etc. There is a whole discussion to be had, but I don’t think I’m going to have much insight into this.

About-

PyPy is a replacement for CPython. It is built using the RPython language that was co-developed with it. The main reason to use it instead of CPython is speed: it runs generally faster

See more here http://pypy.org/features.html

Things I can change and Things I cant change

In the mind of a weekend hacker run a few thoughts too many:

There are some things I cant change even though I wish I could change them because quite clearly I would make a tonne of money more.

Trusting the instincts of others to go north when my instincts tell me to go south. Keeping my mouth and my bloggy fingers shut for things I think I are dishonest. Hanging out with people I wont really like or respect any more. Stop writing technical books and start making more products to mmm sell more soap stars.

I cant change that at all. What can I change? Oh that. Someday I am going to write a piece of code, nothing big , just a few k lines to make the change.

I  can change the world by writing a beautiful piece of code.