Interview ScrapingHub #python #webcrawling

Here is an interview with the team behind ScrapingHub.com Scrapinghub is created by the same team as that created Scrapy, a python based open source framework for extracting  data   from websites.

scph

 

 Ajay (A) Describe your journey with Python and Scraping Web Pages.

Shane Evans (Director and Co-founder): I started commercial python development around 15 years ago and loved it from the start. My first significant experience web scraping with python was in 2007, when I was building a vertical search engine. Initially, my team started writing python scripts, but that got problematic very quickly. I wrote a framework to make the job easier, promoting best practises for scraping and avoiding common mistakes. That framework went on to be released as Scrapy. About a year later, in an attempt to improve the efficiency of our Spider development, I lead the development of a visual scraping tool, which we called autoscraping. That tool was the basis of Portia, Scrapinghub’s visual scraping tool.

In addition to python scraping frameworks, I have written many crawlers for specific websites and a lot of web crawling infrastructure, including much of our scrapy cloud product,  nearly entirely in python.

A- How does Scrapy compare with Beautifulsoup. What are the other technologies that you have used for Web scraping?

Denis de Bernardy (Head of Marketing): Scrapy is a web scraping framework; Beautifulsoup is an HTML parser. That is, Scrapy takes care of details such as request throttling, concurrency, retrying urls that fail, caching locally for development, detecting correct page encoding, etc. It also offers a shell utility that comes in handy when you need to step through a crawl to debug what’s going on.You won’t necessarily run into all of these issues in your first project if you try using Requests and BeautifulSoup. But you will run into them eventually. Scrapy takes care of this all for you so you can focus on extraction logic instead.

In some sense, Greenspun’s tenth rule comes to mind: “Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.” The same could be said about a lot of web scraping projects and how they relate to Scrapy.

Other technologies are very diverse. To name only a few

  • Frontera. It allows multiple spiders to share the same crawl frontier – that is, the list of URLs that are slated to get crawled next. We’ve two experimental variations of it to boot. One that dispatches huge crawls across multiple servers, and another that enables you to prioritize crawls and re-crawls so you can go straight for the jugular when you’re looking for something specific – e.g. infosec-related data when you’re monitoring security exploits.
  • Splash. There’s no shortage of websites that require javascript for a reason or another. For instance because they’re single-page applications or because countermeasures are in place to ensure that bots are not, well, crawling your website – it’s a cat and mouse type of game, really. Splash is a headless browser that supports JavaScript to work around all of these. It also allows to take screenshots as you crawl.
  • Crawlera. It’s our smart proxy provider. It works around bot countermeasures by spreading requests across a large pool of IPs. What more, it throttles requests through IPs so they don’t get used too often, to avoid that they get blacklisted, and which IPs are actually banned so you don’t bother trying them to begin with. Crawlera is what allows us and our customers to crawl complex sites with robust bot countermeasures in place.
  • Portia: It’s a visual scraping tool based on Scrapy. It is an open source project that allows users to build web spiders without needing to know any Python or coding at all. The users only need to visually annotate the elements that they want to extract from a web page and then Portia generates the spider code and delivers a working spider ready to be deployed.

A- How does Python compare say with R ( or RCurl et al) for scraping websites?

Valdir Stumm (Developer Evangelist): If you are just scraping some simple web pages, they are not that different, as both work with networking (R: rCurl, Python: Requests, urllib) and with HTML/XML processing libraries (R: rVest, Python: BeautifulSoup, lxml).

However, while simple web scraping might be easy with R, web crawling might be not. Tools like rCurl are useful to retrieve some web pages. If you need to crawl a large collection of URLs (which may be unknown when starting), it may turn out difficult to do it in a timely manner.

Python, in contrast, has a vast ecosystem with libraries and frameworks to do high performance asynchronous networking and multiprocessing. And, of course, there are frameworks like Scrapy that take care of most of the dirty work like scheduling URLs, handling networking responses, character sets, etc.

A- What are some of the drawbacks of using Python?

Shane Evans: Scrapy takes care of a lot of the lower level async programming, which is required to get good performance. This code is awkward to write, hard to understand, and a nightmare to debug (say compared to how nicely you can do it in erlang).

Performance is an often mentioned drawback, and by that people usually mean raw compute power compared to, say, Java. However, in practise this is less of a problem for common scraping tasks due to the reliance on libraries written in C. See my Quora answer about fast open source scrapers.

A- How is web scraping useful as a data input source even for traditional companies like banks, telecom, etc.?

Denis de Bernardy: Scraped data gets used in all sorts of useful ways. To name a few:

  • A slew of companies use data to monitor product information. Think price comparison apps for consumers, spotting price changes by your competitors so you can adjust your own prices faster, monitoring that your resellers are not offering your product for a lower price than agreed, sentiment analysis across end-user forums, and so forth.
  • HR departments and companies are more interested in profiles. There’s a wealth of information you can collect across the web that can help you decide if a candidate will be a good fit for the job you’re offering or not. Or which candidates you should be head hunting to begin with.
  • Legal departments and companies use data to gather documents for discovery or due diligence purposes. Think staying up to date with legal developments in your industry or mining laws and jurisprudence that may relate to your legal case.
  • Yet another good example is marketing companies. There’s a slew of startups out there that are trying to automate lead generation and outbound sales. Data helps them pinpoint the right people in the right companies while providing context and sometimes contact details to boot.
  • A last set worth mentioning is governments. National statistics offices, for instance, are seeking to automate computing consumer price indexes. And law enforcement agencies are scraping the dark web to locate and monitor criminals. To great effect, we dare add: not a month goes by without busting a human trafficking ring in the US, and we’re very proud to be providing some of the tools that enables this.

A-  With the rise of social media and consumer generated content, does web scraping offer a privacy dilemma? What are some of the ways we can ensure content is free and fairly open for both developers as well as people?

Denis de Bernardy: An important point to highlight here is that we normally stick to scraping publicly available information.

There are a few exceptions. For instance if a lawyer needs to log in to download detailed legal data information on a poorly designed government site. This is something they or a clerk would normally do manually each day. We just make it simpler. But they’re the exception.

The reasons are practical: if you excessively scrape something that requires you to be logged in or to use an API key, you get detected and shut down rather quickly. The reasons are also legal: if you’re logged in, you’ve de facto accepted terms of services – and they likely disallow automated crawls to begin with.

Another important point is that we honor robots.txt files as a rule of thumb. In practical terms this means that if Google can crawl it, so can we. The difference between what Google does and what we do is, well, we structure the data. Rather than searching across unstructured web pages for a piece of information, you search across profile names, bios, birthdays, etc.

With this out of the way, does web scraping offer a privacy dilemma? Kind of. But is it specific to web scraping to begin with? Web scraping helps you automate collecting this information. You could hire an army of workers on Amazon Turk and achieve the same result. Would you? No. But the fact is, this information is all online. No one would be collecting it if it was not.

Adding to this, the privacy issue raised is not new to web scraping. Or the internet. Or even the past century. Jeremy Bentham was describing the idea of Panopticons in the late 18th century. Michel Foucault was re-popularizing the term in the late 1970s – at least in France. Jacques Attali was worrying about a coming period of hyper-surveillance in a 2006 best-seller. The list could go on. Modern society has a heavy trend towards transparency. Your health insurance company, like it or not, would like to see the data in your wearables. It’s just a matter of time before someone popularizes the idea of publishing that data. And yes, it’ll get picked up – indeed, scraped – on the spot when that happens.

While we’re on the topic, note the data play as an aside. There are only so many ways you can lay out a user’s profile on a web page. With enough examples, you can train an AI to scrape them automatically on any site. Rinse and repeat for company profiles, comments, news articles, and what have you. Then connect the dots: if you leave your blog’s url around when you leave comments online, it will get tied back to you. Automate that too and – yes – online privacy basically flies out the window.

In the end, if it’s publicly available online it’s not exactly private…

On that note, Eric Schmidt’s controversial comment from 2009 might make a lot more sense than meets the eye. It was: “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.” Perhaps you won’t broadcast it on the internet yourself; but someone else might. It may give you shivers, but it might hold true.
The main issue that got raised back then, from my standpoint at least, revolved around how to deal with false or dubious information and claims. We’ve privacy laws and consumer protection laws for those – or at least we do in the European Union. And recent developments on enforcing a right to be forgotten. Those types of laws and developments, if anything, are the proper ways to deal with privacy issues in my opinion. But they also introduce heaps of not-so-simple territoriality problems. For instance, should a random court in Iran be able to ask a US-based company to take a web page offline on privacy grounds? It’s thorny.

About-

Scrapinghub was created by the team behind the Scrapy Web Crawling framework. It provides web crawling and data processing solutions.  Some of it’s products are

  • Scrapy Cloud is a bit like Heroku for web scraping.
  • Portia , a visual crawler editor, lets you create and run crawlers without touching a line of code.
  • Frontera is a Crawl Frontier framework.
  • Crawlera a smart proxy rotator rotates IPs to deal with IP bans and proxy management
  • Splash is a headless browser that executes JavaScript for people crawling websites

Users of this platform  crawl over 2 billion web pages per month.

Interview Noah Gift #rstats #python #erlang

Here is an interview with Noah Gift, CTO of Sqor SportsSports . He is a prolific coder in Python , R and Erlang.  Since he is an expert coder in both R and Python, I decided to take his views on both. Noah is also the author of the book

Ajay (A) -Describe your journey in science and coding. What interests you and keeps you motivated in writing code.
N- Artificial intelligence motivates me to continue to learn and write code, even after 40.  In addition, functional programming and cool languages like Erlang are a pleasure to use.  Finally, I enjoy problem solving, whether it comes in the form of mastering brazilian jiu jitsu, rock climbing or writing code every week.  It is a game, that is enjoyable, and the fact that these types of skills take years to learn, makes it very enjoyable to make progress day by day, potentially until the day I die.
The data science community itself has debated this many times. What are your views on it? How do we decided when to use Python and when to use R, and when to use both ( or not)
N- I think R is best for Data Science and Statistics and for cutting edge machine learning, this is what I do.  For python, it is very tough to beat it for writing quick scripts or turning thought into working code.  I wouldn’t necessarily use either language to build scalable systems though.  I think they are both prototyping or “batch” scripting systems.
A- Describe your work in Sports Analytics – what are some of the interesting things about data science in sports
N- I think rating players and teams using ELO rating is an interesting example of simple math used to make powerful conclusions.  For example, this has been used effectively in MMA and basketball.  Machine learning around movement, say basketball players moving on the court, is also going to be a very interesting Data Science application in sports.  We will be able to tell when a player should be pulled out of the game for being tired.  Finally, with wearables, we may soon be able to treat athletes the same way we treat machines.  Data Science is sports is going to grow exponentially in the near future.
A- How do we get the younger students and next generation excited about coding. How do make sure people from poorer countries also learn coding. Can we teach coding on the mobile?
N- Using simple methods and simple open source languages to solve problems is a good approach.  For example, to program in python or erlang or R, and especially if it is functional oriented, it is very simple to write code.  The problem I see in motivating and teaching people to program is when needless complexity, like object oriented design, is thrown in.  Keeping it simple and teaching people to write simple functions is the best way to go at first.
A  -Describe your methodology for work-life balance. How important is health and balance important for programmers and hackers.
N-I train martial arts, specially MMA and BJJ, several times a week and train overall 6 days a week.  All true hackers/programmers, should seriously consider being in peak physical condition (almost at the level of a pro athlete) because of the side benefits of:  clarity of thought, confidence, pain tolerance, endurance, happiness and more.  In addition, taking breaks, including vacations, just like how professional athletes take rest days are very important.  How much work is done in one day, or one week, or one month is nothing compared to what someone does every day for years.  The overall discipline of doing little bits of work over time is a better way to code until the day you die, like I plan to
About-
 Sqor is a sports social network, that gives you the latest news and scores as well as unfiltered access to athletes.

Noah Gift is Chief Technical Officer and General Manager of Sqor. In this role, Noah is responsible for general management, product development and technical engineering. Prior to joining Sqor, Noah led Web Engineering at Linden Lab. He has B.S. in Nutritional Science from Cal Poly S.L.O, an Master’s degree in Computer Information Systems from CSULA, and an MBA from UC Davis. You can read more on him here

Related-

Some Articles on Python by Noah

Cloud business analytics: Write your own dashboard

Data science in the cloud Investment analysis with IPython and pandas

Linear optimization in Python, Part 1: Solve complex problems in the cloud with Pyomo

Linear optimization in Python, Part 2: Build a scalable architecture in the cloud

Using Python to create UNIX command line tools

 

Guest Blog on KDNuggets_ Using Python and R together _ 3 main approaches

I just got published on KDNuggets for a guest blog at http://www.kdnuggets.com/2015/12/using-python-r-together.html – I list down the reasons to moving to using both Python and R (not just one) and the current technology for the same. I think the R project could greatly benefit if the huge Python community came closer to using R language, and Python developers could greatly benefit from using R packages

An extract is here-

Using Python and R together: 3 main approaches

  1. Both languages borrow from each other. Even seasoned package developers like Hadley Wickham (Rstudio) borrows from Beautiful Soup (python) to make rvest for web scraping.Yhat borrows from sqldf to make pandasql. Rather than  reinvent the wheel in the other language developers can focus on innovation

  2. The customer does not care which language the code was written, the customer cares for insights.

 

To read the complete article … see

http://www.kdnuggets.com/2015/12/using-python-r-together.html

How geeks can help defeat terrorists part 1

There is a lot of money in defeating terrorism by using analytics, much more than in internet ads alone.

I personally don’t think you can make the world safer by collecting data from my Facebook Twitter WordPress or Gmail account- indeed precious resources are diverted into signal intelligence (sigint) when they could have been used in human intelligence ( humint). But signal intelligence interests the lobbyists of the military-industrial complex more than humint does and post Manning and Snowden it would be questionable to increase the number of analysts without a thorough screening. There are no easy solutions to this unfortunately as the  attacks in Paris and California show the limits of signal intelligence.

If you collect data from Internet in bulk- the terrorists will adapt. Now that everyone terrorist and civilian knows data is being collected. Surprise as a key element ahs been lost due to Snowden.

Perhaps of greater utility is for linking databases of law enforcement across the world, and better  interfaces for querying and analyzing huge data  and automated alerts for reminding global law enforcement when they fail to follow up. Better analytics is needed, not more data. That’s just old news and lazy data analytics.

Interface design is key to solving Big Data. I have a huge pile of intelligence reports to read as a decision maker. What kind of data visualisation extracts signal from noise and gives it in a timely automated manner.  They are all just documents and documents.

Thats a Big problem to solve in Big Data Analytics.

Note from Ajay – These are the author’s personal views.

 

Extract from Eric Siegel’s Forthcoming Book

From http://www.predictiveanalyticsworld.com/patimes/a-rogue-liberal-halting-nsa-bulk-data-collection-compromises-intelligence/6882/

 

Data Survelliance Image
A National Security Agency (NSA) data gathering facility is seen in Bluffdale, about 25 miles south of Salt Lake City, Utah May 18. Technology holds the power to discover terrorism suspects from data—and yet to also safeguard privacy even with bulk telephone and email data intact, the author argues.

I must disagree with my fellow liberals. The NSA bulk data shutdown scheduled for November 29 is unnecessary and significantly compromises intelligence capabilities. As recent tragic events in Paris and elsewhere turn up the contentious heat on both sides of this issue, I’m keenly aware that mine is not the usual opinion for an avid supporter of Bernie Sanders (who was my hometown mayor in Vermont).

But as a techie, a former Columbia University computer science professor, I’m compelled to break some news: Technology holds the power to discover terrorism suspects from data—and yet to also safeguard privacy even with bulk telephone and email data intact. To be specific, stockpiling data about innocent people in particular is essential for state-of-the-art science that identifies new potential suspects.

I’m not talking about scanning to find perpetrators, the well-known practice of employing vigilant computers to trigger alerts on certain behavior. The system spots a potentially nefarious phone call and notifies a heroic agent—that’s a standard occurrence in intelligence thrillers, and a common topic in casual speculation about what our government is doing. Everyone’s familiar with this concept.

Rather, bulk data takes on a much more difficult, critical problem: precisely defining the alerts in the first place. The actual “intelligence” of an intelligence organization hinges on the patterns it matches against millions of cases—it must develop adept, intricate patterns that flag new potential suspects. Deriving these patterns from data automatically, the function of predictive analytics, is where the scientific rubber hits the road. (Once they’re established, matching the patterns and triggering alerts is relatively trivial, even when applied across millions of cases—that kind of mechanical process is simple for a computer.)

I want your dataIt may seem paradoxical, but data about the innocent civilian can serve to identify the criminal. Although the ACLU calls it “mass, suspicionless surveillance,” this data establishes a baseline for the behavior of normal civilians. That is to say, law enforcement needs your data in order to learn from you how non-criminals behave. The more such data available, the more effectively it can do so.

 

This Newsweek article, originally published in Newsweek’s opinion section and excerpted here, resulted from the author’s research for a new extended sidebar on the topic that will appear in the forthcoming Revised and Updated, paperback edition of Eric Siegel’s Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (coming January 6, 2016).

DecisionStats Interview Radim Řehůřek Gensim #python

As part of my research for Python for R Users: A Data Science Approach (Wiley 2016) Here is an interview with Radim Řehůřek, CEO of Rare Consulting and creator of gensim, Python package
Decision Stats (DS)- Describe your work in the Python package gensim. How did you write it, and were the key turning points in the journey. What are some of the key design points you used for creating gensim. How is gensim useful to businesses for text mining or natural language processing ( any links to examples of usage)
Radim Řehůřek (RaRe)-Gensim was born out of frustration with existing software. We were developing a search engine for an academic library back in 2009, and wanted to include this “hot new functionality of semantic search”. All implementations I could find were either arcane FORTRAN (yes, all caps!) or insanely fragile academic code. Good luck debugging and customizing that…
I ended up redesigning these algorithms to be streamed (online), so that we could run them on large out-of-core datasets. This became gensim, as well as the core of my PhD thesis.
Looking back, focusing on data streaming and picking Python were incredibly lucky choices. Both concepts have gained a lot of momentum in the modern data science world, and gensim along with them. It’s just a happy marriage of Python’s “ease of use” and commercial “need to process large datasets quickly”.
Gensim has been applied across industries — apart from the obvious ones (media, marketing, e-commerce), there have been some imaginative uses of topic modeling in biogenetics or literary sciences. Gensim’s also being taught in several universities across the world as a machine learning tool. A few “on-record” testimonials are at its project page.
DS- Have you used other languages like R or Java other than Python. What has been your experience in using them versus Python for machine learning, text mining and data mining especially in production systems
RaRe- I only started with Python around 2007 or 2008. Before it was all Java (does anyone remember Weka?), C, C++, bash, assembly, C#… later we used Javascript, Go and god knows how many others. But just as anyone who’s been around for a while realizes, development is more about proper design and architecture, rather than a particular language choice.
Python has a lot going for it in this nascent, prototype-driven field of machine learning. People claim it’s slow, but you can whip it to run faster than optimized C, if you know what you’re doing 🙂
Screenshot from 2015-12-08 07:08:52
In my opinion, Python’s main disadvantage coincides (as is often the case) with its main advantage — the dynamic duck typing. Its suitability for production is questionable, except maybe for fast-pivoting startups. Without herculean efforts in unit testing and ad-hoc tools for static analysis, it’s easy to get lost in large codebases. By the time the solution is clearly scoped, well defined and unlikely to change (ha!) I’d consider the JVM world for production.
Examples in my PyData Italy keynote “Does Python stand a chance in today’s world of data science” covered this topic in a bit more depth.
DS- You have worked as an academic, as a freelance consultant and now a startup across multiple locations. What are some of the key challenges you faced in this journey
RaRe- I’d say the transition from an academic mindset to a commercial one was a major challenge. It’s underestimated by many fresh graduates. Tinkering with details, hacking, exciting irrelevant detours are all fine, but the consulting business is much more about a pragmatic listen-to-what-the-client-actually-needs and then get-it-done. Preferably in a straightforward, efficient manner.
There’s other stuff that comes with running a business: understanding intellectual property, legalese, cross-country and cross-continent accounting, managing employees, managing clients, marketing… It’s exciting for sure and a lot of hard, novel work, but you kind of expect that, no surprise there.
By the way I’m in the process of writing a series of articles about “the life of a data science consultant” (to appear on our site soon), following the wave of interest after my BerlinBuzzwords talk on the topic.
DS- What are your favourite algorithms in terms of how you use them
RaRe- Funnily enough, I’m a fan of simple, well-understood algorithms.
Linear classifiers are one example; linear scan in place of search is another. Compared to the academic cutting edge these are ridiculous fossils. But what you’ll often find out in real-word projects is that by the time the business problem is sufficiently well defined, implementation scoped, integrations with other systems understood and the whole pipeline working, the few percent gained by a more complex algorithm are the least of your concern.
You’ll mostly hear about startups that live on the cutting edge of AI, where deep learning makes or breaks their business model. But there are gazillions of businesses that don’t need that. Having a clearly understood, interpretable, efficient and integrated predictive model that works is a massive win, and already enough work as is. Most effort goes into business analysis in order to solve the right problem using a manageable process, not pushing the theoretical envelope of life, universe and everything.
There was a great talk on “Linear Models for Data Science” by Brad Klingenberg (of StitchFix) recently, which made a good case for simpler models.
DS- What are your views on Python leveraging multiple cores . What do you think about cloud computing. Why is creating parallel processing of algorithms so not common for other packages as well.
 
RaRe- Higher connectivity and larger computing clusters are the future, no doubt about it.
We’re slowly coming out of an age where every single distributed system that actually worked was something of an art piece. Always NIH-heavy, finely tuned to its particular big-data use case by necessity, while touting completely generic universality for PR reasons.
But I think we’re not far off an age where it will be truly easier to use one of these frameworks than roll your own. The current generation of general-purpose distributed systems (such as Spark) is already getting some parts right. They’re still too raw and hard to manage (debug, integrate) to be practically useful for the mainstream, but we’re getting there, it’s a wave.
What does this means for Python? Who knows, but its pragmatic no-nonsense culture has good potential for producing a useful solution too, though the current distributed ecosystems favour the JVM world heavily. In short term there’s some effort in cross-language interoperability, in the long term, evolution tends to cull dead branches and favour the uncompromising.
DS- What is the best thing you like about coding in Python? and the worst?
 
RaRe- I can only speak for the PyData subset of the (many) Python communities:
Pro: pragmatic mindset codified in the Zen of Python; experienced full-stack developers; duck typing; fast iteration and prototyping cycles, Python makes you think before you write (by virtue of its no-debugger REPL culture) 🙂
Con: duck typing; lack of enterprise maturity: deployment, packaging maintenance, marketing. Continuum.io are doing great work in this area to keep Python alive.
About

Radim Řehůřek, Ph.D. is a Senior software developer and entrepreneur with a passion for machine learning, natural language processing and text analysis. He is the creator of gensim , a Python library that is widely used for Topic Modelling for Humans

Article from PA Times: How hiring is the same as approving a mortgage

In this article http://www.predictiveanalyticsworld.com/patimes/hiring-approving-mortgages-its-the-same-thing/6715/

In this article veteran Industry expert Greta Roberts, CEO of Talent Analytics argues for a case for proactive preventive analytics rather than reactive post mortem analytics

To stay in business and be profitable, lenders need to predict which borrower candidates are a good risk before extending an offer. Once the offer has been extended, all the company can do is restructure their mortgage, coach, cajole, support, train and hopefully manage the borrower to keep them from completely defaulting.

and

Two Problems with Most “Predictive” HR Systems and Approaches

  • Most approaches “predict” flight risk or performance for current employees only – when it’s too late.

  • Most approaches don’t have highly predictive job candidate data – and never consider augmenting their candidate datasets so they can predict pre-hire.

%d bloggers like this: