Web Crawling Automation

Apart from the various ways you can use PERL, or other scripting languages for Automated Web crawling- this is a relatively low technology solution for people who want to download web pages , or web data.It can also be called as web scraping for some people.

 

The First Method is by using the package RCurl package (from R-Help Archives) .

The R –List is also found here http://www.nabble.com/R-help-f13820.html.

 

> library(RCurl)
> my.url <- "
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"
> getURL(my.url)

A variation is the following line of code-

getURL(my.url, followlocation = TRUE)

The information being sent from R and received by R from the server.

getURL(my.url, verbose = TRUE)

The second is by using the package RDCOMClient in R

> library(RDCOMClient)
> my.url <- "
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2"
> ie <- COMCreate("InternetExplorer.Application")
> txt <- list()
> ie$Navigate(my.url)
NULL
> while(ie[["Busy"]]) Sys.sleep(1)
> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
> txt
$`
http://www.nytimes.com/2009/01/07/technology/business-computing/
07program.html?_r=2`

[1] "Skip to article Try Electronic Edition Log …

The third way ( a personal favorite) is by using the Firefox add in IMacros from www.iopus.com if you need to extract huge amounts of data and copy and paste into text and excel files. The Add in works almost the same way as the Record Macro feature works in Excel, with a difference it records all the clicks, download ,url’s etc from the browser.

It can even automate website testing, and data entry tasks.

While Firefox add-in is free the Internet Explorer costs 49 USD.

Happy Republic Day

India celebrates the 59th year of being a Republic ( It took us 3 years to write ,debate and finalize our constitution till 1950 , when we finally got the Constitution and the President as a figure-head kind of Republic democracy).

59 years ,

One billion people,

300 million slum dogs,

many billions of software exports,

more billions of oil imports,

one cricket world cup,

one high tech unmanned mission to Moon from us and

many low tech manned terrorist strikes inflicted on us…

later-

The Indian Republic still stands as the only democracy in its neighborhood with substantial secular rights to minority religions and viewpoints.

May the republic still shine- freely.

Amen.

Tweet-Updated-Using Twitter for better Marketing

A relatively late entrant to the www.twitter.com phenomenon, I started uploading my blog posts on my  twitter account.Here are some insights which I saw in action and maybe they are common knowledge but here goes-

1) Twitter automatically converts links into www.tinyurl.com links so it shortens even the longest link that you have

2) Uploading address book, including anyone who ever wrote an email to you as part of a discussion or reading group, takes a tiny amount of time. Then click follow all ( or at least those for a particular profile –here analytics and data) and you are off.

3) Twitter manners seem to consider it customary to follow people who are following you.Thus an audience or initial leads are assured. Rest content is king.

4) Reading tweets ( or twitter messages) is a great break as it gives you a real time insight on what is happening within the world of your domain or people who belong to same profession or same personal profile as to you. However writing personal tweets takes time,and a healthy dose of self love.

5) Twitter is free. And there are enough twitter tools to ensure it gets updated from your RSS feed automatically so it is one more tool to ensure publicity for your self or your organization.

6) Search for people giving or receiving same services as you provide to get maximized target response.

7) Link up your Face book, and your Yahoo instant messenger with Twitter using applications built exactly for this.

No ,LinkedIn does not have a Twitter app but that should change soon.

 

8) Watch out for useless spam stuff from people whom you don’t know well.Spamming or just being reported leads to suspended accounts and much useless grief.

Happy twittering with tweets on www.twitter.com ( ..what a tongue twister !!)

 

And an update from my favorite tech blog http://bits.blogs.nytimes.com/

Starbucks dishes out updates on special offers and nutritional and store information using Twitter. The online retailer Zappos, Comcast and Southwest Airlines have also created official accounts on Twitter to interact with consumers and respond directly to complaints.

Bank of America’s Twitter stream is maintained by David Knapp, a representative in Phoenix.

And why is http://bits.blogs.nytimes.com/ my favorite-

It shows blogs with better command of English than of technology are better reading than blogs with superb grasp of technology but not of English.

 

In case you want to say hi/ tweet/shout ……..this is where, my twitter sit ’er

http://twitter.com/decisionstats

R in a CorpoRate Environment

Any concerns of using R in a corporate environment especially for compliance reasons can be mitigated from reading the following documents.

R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments

and

Keeling & Parvur’s "A comparative study of the reliability to nine statistical software
packages, May 1, 2007 Computational Statistics & Data Analysis, Vol.51,pp 3811-3831.

 

Thanks to Bob for pointing this out on the R-Help list.

Technorati Tags:

Slum Dogs Come

Young slum dogs chipping away,

writing code,plugging away.

Take the place under shiny sun someday,

Slum puppies wont go away.

 

You let them in,

They are hungry for more, they stay,

Nobody ever gave them a break on the way,

Grew up fast,slum childhood wasn’t a child’s play.

 

 

Still here they are firing away,

Full steam ahead, and

Damn no Torpedo’s to dissuade.

Before you could pause, object

Cut them short saying Boy hey.

 

Slum dog walks away,

In his teeth , the shiny bone of the day.

Blood on his fur ,its there 

Long enough to stay.

 

The dog beens much worse,

 Much tougher days.

His brain the only weapon ,

he chooses to play.

 

Brain red hot, it keeps firing away.

That dog wont roll down, play dead, no way.

Been through much pain already this way,

Now numb, The Slum Dogs come here to stay.

Give yourself a Tax Rebate:Google Docs and other stuff you already knew

If I remember correctly, the last time that the US government sent mail in checks to many people, the tax rebate was as low as 300$. You can save yourself much more that , by doing the following-

1) Switch to Ubuntu Linux at http://www.ubuntu.com/products/WhatIsUbuntu/desktopedition

2) Use only Google Docs from http://docs.google.com (keep data securely online) and Open Office (which comes with Ubuntu above or at http://download.openoffice.org/)

3) Use a trusted anti virus solution from AVG (http://free.avg.com/ ) Hesitant , well it happens to be the most downloaded software on CNET’s Download.com

4) Insist on these freeware with your IT department and at your store even if your new laptop or PC comes bundled with other software . Those costs are embedded within your hardware costs.

5) Start using more Amazon EC2 if you are a large data user at office.

6) Use R for analytics work instead of the hugely expensive analytical closed source programs. Here is the easy to learn GUI http://www.rattle.togaware.com . See book on that from the right sidebar or at www.rforsasandspssusers.com .

Chances are you just saved yourself more than 1000$ per head by doing this.If you used option 5 and 6, the savings could be even more substantial running into tens of thousands of dollars.

If you have to CHOOSE between saving costs , maybe saving your job or even your subordinates job, OR making Bill Gates richer so he can give away YOUR money away to charity, what would you choose ? The time is RIGHT NOW.

The declining relevance of LinkedIn

I can still remember two years ago when a friend and erstwhile client from the United States sent me a link to www.LinkedIn.com. While today social media is a rage, back in early 2006 ,LinkedIn re-defined social networking from chatting with teenagers to actual value delivered to customers. Over a period of time both my network and LinkedIn grew- my network is now 6200 members , Decision Stats on LinkedIn has 570 + members and LinkedIn has 30 million people and a reported 1 billion dollar valuation.

 

Yet Life on LinkedIn has been slowly losing interest to the point where it is now just a directory service of contacts.

image

 

Some reasons for the declining relevance of LinkedIn are –

1) Average User Interface Updating- While www.Facebook.com successfully transformed itself into a new look for 70 million users, the UI at LI does leave some things to be desired. Some glitches include a slower than promised rollout of third party applications, and bugs aplenty in the way you update your status, how to remove connections  and the new cluttered home page.

2) Thrust on Groups rather than content (Questions and Answers)– Q and A at LI were a great interactive feature as people answered and posted interesting questions. This has been reduced in focus, by the group discussion features which are a half way effort from the discussions free for all and making a newsletter happen for the group. Many successful LI groups made the transition to being full communities in their own right, mostly using www.ning.com .LI was also unable to capture the whole value chain of engaged communities by not having a newsletter function in the groups, and by group owners not being able to customize stuff.

3) Top Down User Limits– Limits on groups being at most 50, invites being at most 3000, meant that slowly LI was punishing active users more than controlling spam. The Open Networkers movement (people who network openly with everyone) was neither predicted nor monetized well by LI.

4) Inability to monetize recruiters fully (they exist and flourish thanks to LI’s inability to fully channelize them into a paying media), not able to cut down on spam (which exists in much bigger volumes now  due to bigger user base now), and refusal to create connection specific privacy (as in Face book which allows you to keep levels of privacy display for your connections) are other reasons for the decline.

LI has been a pioneer not only in professional networking but also in using non ad pricing strategies in keeping a steady cash flow. Some new features like LinkedIn Polls are promising , and hopefully the next generation of Third Party applications would make the site interesting again.

So there is hope it will get its act together again. However in a very competitive online ad market, time and speed of reaction are critical. LI does have the first mover advantage, but it can lose relevance just like the Lycos and the Yahoo did if it changes slower than users want it. With the current recession, it is an opportunity for communities like LI to tap into the recruiting market and also focus on owning, creating , if not enabling ,relevant content for reading and sharing by users.