Home » Analytics » Using Google Docs for Web Scraping

Using Google Docs for Web Scraping

Software

Train in R

Predictive Analytics- The Book

While trying to scrape some data from a Website , I chanced upon the getXML function which is pretty neat, as it basically allows you to import the XML feed of a webpage and then parse the data appropriately.

 

Here is an example-

 

Using the getXML function I parsed all links for “analytics consultant in India” search results in Google.

The GetXML function works as follows (from the support page here )

Functions:

=importXML("URL","query")

  • URL – the URL of the XML or HTML file
  • query – the XPath query to run on the data given at the URL. For example, "//a/@href" returns a list of the href attributes of all <a> tags in the document (i.e. all of the URLs the document links to). For more information about XPath, please visithttp://www.w3schools.com/xpath/
  • Example: =importXml("www.google.com", "//a/@href"). This returns all of the href attributes (the link URLs) in all the <a> tags on www.google.com home page

 

You can see it here-

http://spreadsheets.google.com/pub?key=pS9vSxWuwOllXHdueY0TDdg

or Using the Embed Function

 


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Conferences

Books

Follow

Get every new post delivered to your Inbox.

Join 802 other followers