A year ago while working as a virtual research assistant to Dr Vincent Granville( of Analyticbridge.com and who signed my recommendation form for University of Tennessee) I helped download almost 22000 records of almost all the statisticians and economists of the world. This included databases like American Statistical Association and Royal Society ( ASA, ACME, RS etc).
After joining University of Tennessee, i sent a sample of code and database with me by email to two professors ( one a fellow of ASA and the other an expert into internet protocols to make it an academic paper except they did not know any journal or professor who knew stuff on data scraping 😦 )
I am publishing this now in the hope they would have plugged the gap before someone gets that kind of database and exploits for spamming or commercial mal use.
The weak link was once you were in the database using a valid login and password, you can use automated HTML capture to basically do a lot of data scraping using the iMacro macro or Firefox Plugin. Since the login were done on Christmas Eve and during year end- this also used the fact that admins were likely to overlook into analytical logs ( if they had software like clicky or were preserving logs).
Here is the code that was used for scraping the whole database for ASA ( Note the scraping was not used by me- it was sent to Dr Granville and this was an academic research project).
See complete code here- http://docs.google.com/View?id=dcvss358_335dg2xmdcp
1) Use Firefox Browser ( or Download from http://www.mozilla.com/en-US/firefox/ )
2) Install IMacros from https://addons.mozilla.org/en-US/firefox/addon/3863
3) Use the following code, paste in a notepad file and save as “macro1.iim”.
VERSION BUILD=6111213 RECORDER=FX
Note the ‘ prefix denotes commented out code
‘AUTOMATED ENTRY INTO WEBSITE IN CORRECT POSITION
TAB T=1
‘URL GOTO=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
‘TAG POS=1:TEXT FORM=NAME:frmLogin ATTR=NAME:txtUser CONTENT=USERNAME
‘SET !ENCRYPTION NO
‘TAG POS=1:PASSWORD FORM=NAME:frmLogin ATTR=NAME:txtPassword CONTENT=USERPASSWORD
‘TAG POS=1:SUBMIT FORM=NAME:frmLogin ATTR=NAME:btnSubmit&&VALUE:Login
‘TAG POS=1 ATTR=ID:el34
‘ENTER FORM INPUTS
‘TAG POS=1 FORM=NAME:frmSearch ATTR=NAME:txtState CONTENT=%CA
‘TAG POS=1:TEXT FORM=NAME:frmSearch ATTR=NAME:txtName CONTENT=b
‘TAG POS=1:SUBMIT FORM=NAME:frmSearch ATTR=NAME:btnSubmit&&VALUE:Submit
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
SET !LOOP 1
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
TAG POS=1 ATTR=TXT:Name
TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF
SET !VAR1 {{!EXTRACT}}
‘PROMPT {{!EXTRACT}}
URL GOTO={{!VAR1}}
TAG POS=1 ATTR=TXT:Name
TAG POS=R1 ATTR=TXT:* EXTRACT=TXT
TAG POS=1 ATTR=TXT:Email
TAG POS=R1 ATTR=TXT:* EXTRACT=TXT
‘PROMPT {{!EXTRACT}}
BACK
SAVEAS FOLDER=* FILE=*
4) The code should be run after logging in and after giving inputs for name (use wild card of a single alphabet say a) and state from drop down
5) Click submit to get number of records
6)Click on the IOpus Macro button next to address bar in Firefox and load the macro file above
7) Run macro ( Click on run loop button from 1 to X where X is number of records returned in step5.
Repeat Steps 4 to 7 till a single State ( which is the group by variable here ) is complete.
8) Go to C:\Documents and Settings\admin\My Documents\iMacros\Downloads (Check this from IMacros settings and options in your installation)
9) Rename the file index as “state.csv”
10) Open CSV file
11) Use the following Office 2003 Macro to clean the file
Sub Macro1()
‘
‘ Macro1 Macro
‘ Macro recorded 12/22/2008 by ajay
‘
‘ Keyboard Shortcut: Ctrl+q
‘
Cells.Select
Selection.Replace What:=”#NEWLINE#”, Replacement:=””, LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Columns(“B:B”).Select
Selection.TextToColumns Destination:=Range(“B1”), DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _
Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _
:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True
Columns(“C:C”).Select
Selection.TextToColumns Destination:=Range(“C1”), DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _
Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _
:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True
Columns(“B:B”).ColumnWidth = 23.71
Columns(“A:A”).EntireColumn.AutoFit
ActiveWindow.SmallScroll Down:=9
ActiveWorkbook.Save
End Sub
12) In case you have Office 2007 Use The Record Macro feature to create your unique Macro in your personal Macro Workbook, basically replacing all #NEWFILE# with space (using Ctrl+H) and using Text to columns for column 2 and column 3, with type delimited,next, treat successive delimiters as one (check box),next,do not import first column (BY selecting that column”)
13) To append lots of files into 1 file use the following R Commands
Download R from www.r-project.org
>setwd(“C:\\Documents and Settings\\admin\\My Documents\\iMacros\\Downloads”)
Note this is the same folder as in Step 8 above
>list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,
+ recursive = FALSE, ignore.case = FALSE)
The R output is something like below
> list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE, + recursive = FALSE, ignore.case = FALSE) [1] “Automation Robot – Documents – Office Live Workspace” “Book1.xls” [3] “cala.csv” “calb.csv” [5] “calc.csv” “cald.csv” [7] “cale.csv” “calf.csv” [9] “calg.csv” “calh.csv” [11] “cali.csv” “calj.csv” [13] “calk.csv” “call.csv” [15] “calm.csv” “caln.csv” [17] “calo.csv” “calp.csv” [19] “calq.csv” “calr.csv” [21] “cals.csv” “calt.csv” [23] “calu.csv” “calv.csv” [25] “calw.csv” “calx.csv” [27] “caly.csv” “calz.csv” [29] “cola.csv” “colac.csv” [31] “colad.csv” “colae.csv” [33] “colaf.csv” “colag.csv” [35] “coloa.csv” “colob.csv” [37] “index” “login” > file.append(“coloa.csv”,”colob.csv”) [1] TRUE > file.append(“coloa.csv”,”colac.csv”) [1] TRUE > file.append(“coloa.csv”,”colad.csv”) [1] TRUE > file.append(“coloa.csv”,”colae.csv”) [1] TRUE > file.append(“coloa.csv”,”colaf.csv”) [1] TRUE > file.append(“coloa.csv”,”colag.csv”) [1] TRUE > file.append(“cala.csv”,”calb.csv”) [1] TRUE > file.append(“cala.csv”,”calc.csv”) [1] TRUE > file.append(“cala.csv”,”cald.csv”) [1] TRUE > file.append(“cala.csv”,”cale.csv”) [1] TRUE > file.append(“cala.csv”,”calf.csv”) [1] TRUE > file.append(“cala.csv”,”calg.csv”) [1] TRUE > file.append(“cala.csv”,”calh.csv”) [1] TRUE > file.append(“cala.csv”,”cali.csv”) [1] TRUE > file.append(“cala.csv”,”calj.csv”) [1] TRUE > file.append(“cala.csv”,”calk.csv”) [1] TRUE > file.append(“cala.csv”,”call.csv”) [1] TRUE > file.append(“cala.csv”,”calm.csv”) [1] TRUE > file.append(“cala.csv”,”caln.csv”) [1] TRUE > file.append(“cala.csv”,”calo.csv”) [1] TRUE > file.append(“cala.csv”,”calp.csv”) [1] TRUE > file.append(“cala.csv”,”calq.csv”) [1] TRUE > file.append(“cala.csv”,”calr.csv”) [1] TRUE > file.append(“cala.csv”,”cals.csv”) [1] TRUE > file.append(“cala.csv”,”calt.csv”) [1] TRUE > file.append(“cala.csv”,”calu.csv”) [1] TRUE > file.append(“cala.csv”,”calv.csv”) [1] TRUE > file.append(“cala.csv”,”calw.csv”) [1] TRUE > file.append(“cala.csv”,”calx.csv”) [1] TRUE > file.append(“cala.csv”,”caly.csv”) [1] TRUE > file.append(“cala.csv”,”calz.csv”) [1] TRUE
ACTUAL EXECUTION TIME REVISED MACRO
This uses multiple tabs ( using TAB T=1 and TAB T=2) to switch between Tabs. Thus you can search for a big name in Tab 1 , while Tab 2 consists of the details of the table components ( here Name and Email positioned relatively)
Execution of Loop is by the Loop Button on IMacros
VERSION BUILD=6111213 RECORDER=FX TAB T=1 SET !LOOP This sets Initial value of Loop to start from Value=1 SET !ERRORIGNORE YES Setting Errors to be Ignored ( Like in cases when Email is not present ) and thus resume the rest of code SET !EXTRACT_TEST_POPUP NO Setting Popups to be disabled. Note Popups are useful while creating the code, but reduce execution time. TAG POS=1 ATTR=TXT:Name TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF Note here the extratced value takes position of the link (HREF) positioned at (R1) Row 1(from Loop) using the reference from Text ( In Strong) Name SET !VAR1 {{!EXTRACT}} Passing Value of Extract to the new variable var2. TAB T=2 Creating a new tab in Firefox within same window URL GOTO={{!VAR1}} Going to the new URL (which is the link of the table constituent – referenced by its name) TAG POS=1 ATTR=TXT:Name TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Name TAG POS=1 ATTR=TXT:Email TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Email ‘ONDIALOG POS=1 BUTTON=OK CONTENT= Commented out section- Used when Firefox gives a message to resubmit the data TAB T=1 Back to Tab 1 or where Form Inputs Search are present ‘BACK Commented out , instead of using back in same tab, we are moving across tabs to avoid submitting the search again and again SAVEAS FOLDER=* FILE=* Downloading the data into default folder, default format(File)Back to same Steps (Click here)
If you are interested in knowing more you can see the Google Docs
http://docs.google.com/View?id=dcvss358_335dg2xmdcp
cough
and I am also writing my thesis right now