Interview ScrapingHub #python #webcrawling

Here is an interview with the team behind ScrapingHub.com Scrapinghub is created by the same team as that created Scrapy, a python based open source framework for extracting  data   from websites.

scph

 

 Ajay (A) Describe your journey with Python and Scraping Web Pages.

Shane Evans (Director and Co-founder): I started commercial python development around 15 years ago and loved it from the start. My first significant experience web scraping with python was in 2007, when I was building a vertical search engine. Initially, my team started writing python scripts, but that got problematic very quickly. I wrote a framework to make the job easier, promoting best practises for scraping and avoiding common mistakes. That framework went on to be released as Scrapy. About a year later, in an attempt to improve the efficiency of our Spider development, I lead the development of a visual scraping tool, which we called autoscraping. That tool was the basis of Portia, Scrapinghub’s visual scraping tool.

In addition to python scraping frameworks, I have written many crawlers for specific websites and a lot of web crawling infrastructure, including much of our scrapy cloud product,  nearly entirely in python.

A- How does Scrapy compare with Beautifulsoup. What are the other technologies that you have used for Web scraping?

Denis de Bernardy (Head of Marketing): Scrapy is a web scraping framework; Beautifulsoup is an HTML parser. That is, Scrapy takes care of details such as request throttling, concurrency, retrying urls that fail, caching locally for development, detecting correct page encoding, etc. It also offers a shell utility that comes in handy when you need to step through a crawl to debug what’s going on.You won’t necessarily run into all of these issues in your first project if you try using Requests and BeautifulSoup. But you will run into them eventually. Scrapy takes care of this all for you so you can focus on extraction logic instead.

In some sense, Greenspun’s tenth rule comes to mind: “Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.” The same could be said about a lot of web scraping projects and how they relate to Scrapy.

Other technologies are very diverse. To name only a few

  • Frontera. It allows multiple spiders to share the same crawl frontier – that is, the list of URLs that are slated to get crawled next. We’ve two experimental variations of it to boot. One that dispatches huge crawls across multiple servers, and another that enables you to prioritize crawls and re-crawls so you can go straight for the jugular when you’re looking for something specific – e.g. infosec-related data when you’re monitoring security exploits.
  • Splash. There’s no shortage of websites that require javascript for a reason or another. For instance because they’re single-page applications or because countermeasures are in place to ensure that bots are not, well, crawling your website – it’s a cat and mouse type of game, really. Splash is a headless browser that supports JavaScript to work around all of these. It also allows to take screenshots as you crawl.
  • Crawlera. It’s our smart proxy provider. It works around bot countermeasures by spreading requests across a large pool of IPs. What more, it throttles requests through IPs so they don’t get used too often, to avoid that they get blacklisted, and which IPs are actually banned so you don’t bother trying them to begin with. Crawlera is what allows us and our customers to crawl complex sites with robust bot countermeasures in place.
  • Portia: It’s a visual scraping tool based on Scrapy. It is an open source project that allows users to build web spiders without needing to know any Python or coding at all. The users only need to visually annotate the elements that they want to extract from a web page and then Portia generates the spider code and delivers a working spider ready to be deployed.

A- How does Python compare say with R ( or RCurl et al) for scraping websites?

Valdir Stumm (Developer Evangelist): If you are just scraping some simple web pages, they are not that different, as both work with networking (R: rCurl, Python: Requests, urllib) and with HTML/XML processing libraries (R: rVest, Python: BeautifulSoup, lxml).

However, while simple web scraping might be easy with R, web crawling might be not. Tools like rCurl are useful to retrieve some web pages. If you need to crawl a large collection of URLs (which may be unknown when starting), it may turn out difficult to do it in a timely manner.

Python, in contrast, has a vast ecosystem with libraries and frameworks to do high performance asynchronous networking and multiprocessing. And, of course, there are frameworks like Scrapy that take care of most of the dirty work like scheduling URLs, handling networking responses, character sets, etc.

A- What are some of the drawbacks of using Python?

Shane Evans: Scrapy takes care of a lot of the lower level async programming, which is required to get good performance. This code is awkward to write, hard to understand, and a nightmare to debug (say compared to how nicely you can do it in erlang).

Performance is an often mentioned drawback, and by that people usually mean raw compute power compared to, say, Java. However, in practise this is less of a problem for common scraping tasks due to the reliance on libraries written in C. See my Quora answer about fast open source scrapers.

A- How is web scraping useful as a data input source even for traditional companies like banks, telecom, etc.?

Denis de Bernardy: Scraped data gets used in all sorts of useful ways. To name a few:

  • A slew of companies use data to monitor product information. Think price comparison apps for consumers, spotting price changes by your competitors so you can adjust your own prices faster, monitoring that your resellers are not offering your product for a lower price than agreed, sentiment analysis across end-user forums, and so forth.
  • HR departments and companies are more interested in profiles. There’s a wealth of information you can collect across the web that can help you decide if a candidate will be a good fit for the job you’re offering or not. Or which candidates you should be head hunting to begin with.
  • Legal departments and companies use data to gather documents for discovery or due diligence purposes. Think staying up to date with legal developments in your industry or mining laws and jurisprudence that may relate to your legal case.
  • Yet another good example is marketing companies. There’s a slew of startups out there that are trying to automate lead generation and outbound sales. Data helps them pinpoint the right people in the right companies while providing context and sometimes contact details to boot.
  • A last set worth mentioning is governments. National statistics offices, for instance, are seeking to automate computing consumer price indexes. And law enforcement agencies are scraping the dark web to locate and monitor criminals. To great effect, we dare add: not a month goes by without busting a human trafficking ring in the US, and we’re very proud to be providing some of the tools that enables this.

A-  With the rise of social media and consumer generated content, does web scraping offer a privacy dilemma? What are some of the ways we can ensure content is free and fairly open for both developers as well as people?

Denis de Bernardy: An important point to highlight here is that we normally stick to scraping publicly available information.

There are a few exceptions. For instance if a lawyer needs to log in to download detailed legal data information on a poorly designed government site. This is something they or a clerk would normally do manually each day. We just make it simpler. But they’re the exception.

The reasons are practical: if you excessively scrape something that requires you to be logged in or to use an API key, you get detected and shut down rather quickly. The reasons are also legal: if you’re logged in, you’ve de facto accepted terms of services – and they likely disallow automated crawls to begin with.

Another important point is that we honor robots.txt files as a rule of thumb. In practical terms this means that if Google can crawl it, so can we. The difference between what Google does and what we do is, well, we structure the data. Rather than searching across unstructured web pages for a piece of information, you search across profile names, bios, birthdays, etc.

With this out of the way, does web scraping offer a privacy dilemma? Kind of. But is it specific to web scraping to begin with? Web scraping helps you automate collecting this information. You could hire an army of workers on Amazon Turk and achieve the same result. Would you? No. But the fact is, this information is all online. No one would be collecting it if it was not.

Adding to this, the privacy issue raised is not new to web scraping. Or the internet. Or even the past century. Jeremy Bentham was describing the idea of Panopticons in the late 18th century. Michel Foucault was re-popularizing the term in the late 1970s – at least in France. Jacques Attali was worrying about a coming period of hyper-surveillance in a 2006 best-seller. The list could go on. Modern society has a heavy trend towards transparency. Your health insurance company, like it or not, would like to see the data in your wearables. It’s just a matter of time before someone popularizes the idea of publishing that data. And yes, it’ll get picked up – indeed, scraped – on the spot when that happens.

While we’re on the topic, note the data play as an aside. There are only so many ways you can lay out a user’s profile on a web page. With enough examples, you can train an AI to scrape them automatically on any site. Rinse and repeat for company profiles, comments, news articles, and what have you. Then connect the dots: if you leave your blog’s url around when you leave comments online, it will get tied back to you. Automate that too and – yes – online privacy basically flies out the window.

In the end, if it’s publicly available online it’s not exactly private…

On that note, Eric Schmidt’s controversial comment from 2009 might make a lot more sense than meets the eye. It was: “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.” Perhaps you won’t broadcast it on the internet yourself; but someone else might. It may give you shivers, but it might hold true.
The main issue that got raised back then, from my standpoint at least, revolved around how to deal with false or dubious information and claims. We’ve privacy laws and consumer protection laws for those – or at least we do in the European Union. And recent developments on enforcing a right to be forgotten. Those types of laws and developments, if anything, are the proper ways to deal with privacy issues in my opinion. But they also introduce heaps of not-so-simple territoriality problems. For instance, should a random court in Iran be able to ask a US-based company to take a web page offline on privacy grounds? It’s thorny.

About-

Scrapinghub was created by the team behind the Scrapy Web Crawling framework. It provides web crawling and data processing solutions.  Some of it’s products are

  • Scrapy Cloud is a bit like Heroku for web scraping.
  • Portia , a visual crawler editor, lets you create and run crawlers without touching a line of code.
  • Frontera is a Crawl Frontier framework.
  • Crawlera a smart proxy rotator rotates IPs to deal with IP bans and proxy management
  • Splash is a headless browser that executes JavaScript for people crawling websites

Users of this platform  crawl over 2 billion web pages per month.

Author: Ajay Ohri

http://about.me/ajayohri

Leave a comment