Home » Internet » Weak Security in Internet Databases for Statisticians

Weak Security in Internet Databases for Statisticians

Software

R in the Cloud

Train in R

A year ago while working as a virtual research assistant to Dr Vincent Granville( of Analyticbridge.com and who signed my recommendation form for University of Tennessee) I helped download almost 22000 records of almost all the statisticians and economists of the world. This included databases like American Statistical Association and Royal Society ( ASA, ACME, RS etc).

After joining University of Tennessee, i sent a sample of code and database with me by email  to two professors ( one a fellow of ASA and the other an expert into internet protocols to make it an academic paper except they did not know any journal or professor who knew stuff on data scraping :( )

I am publishing this now in the hope they would have plugged the gap before someone gets that kind of database and exploits for spamming or commercial mal use.

The weak link was once you were in the database using a valid login and password, you can use automated HTML capture to basically do a lot of data scraping using the iMacro macro or Firefox Plugin. Since the login were done on Christmas Eve and during year end- this also used the fact that admins were likely to overlook into analytical logs ( if they had software like clicky or were preserving logs).

Here is the code that was used for scraping the whole database for ASA ( Note the scraping was not used by me- it was sent to Dr Granville and this was an academic research project).

See complete code here- http://docs.google.com/View?id=dcvss358_335dg2xmdcp

1) Use Firefox Browser  ( or Download from  http://www.mozilla.com/en-US/firefox/ )

2) Install  IMacros from https://addons.mozilla.org/en-US/firefox/addon/3863

3) Use the following code, paste in a notepad file and save as “macro1.iim”.

VERSION BUILD=6111213 RECORDER=FX

Note the ‘ prefix denotes commented out code

‘AUTOMATED ENTRY INTO WEBSITE IN CORRECT POSITION

TAB T=1

‘URL GOTO=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

‘TAG POS=1:TEXT FORM=NAME:frmLogin ATTR=NAME:txtUser CONTENT=USERNAME

‘SET !ENCRYPTION NO

‘TAG POS=1:PASSWORD FORM=NAME:frmLogin ATTR=NAME:txtPassword CONTENT=USERPASSWORD

‘TAG POS=1:SUBMIT FORM=NAME:frmLogin ATTR=NAME:btnSubmit&&VALUE:Login

‘TAG POS=1 ATTR=ID:el34

 

‘ENTER FORM INPUTS

‘TAG POS=1 FORM=NAME:frmSearch ATTR=NAME:txtState CONTENT=%CA

‘TAG POS=1:TEXT FORM=NAME:frmSearch ATTR=NAME:txtName CONTENT=b

‘TAG POS=1:SUBMIT FORM=NAME:frmSearch ATTR=NAME:btnSubmit&&VALUE:Submit

‘END FORM INPUTS

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

SET !LOOP 1

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

TAG POS=1 ATTR=TXT:Name

TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF

SET !VAR1 {{!EXTRACT}}

‘PROMPT {{!EXTRACT}}

URL GOTO={{!VAR1}}

TAG POS=1 ATTR=TXT:Name

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

TAG POS=1 ATTR=TXT:Email

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

‘PROMPT {{!EXTRACT}}

 

BACK

SAVEAS FOLDER=* FILE=*

4) The code should be run after logging in and after giving inputs for name (use wild card of a single alphabet say a)  and state  from drop down

5) Click submit to get number of records

6)Click on the IOpus Macro button next to address bar in Firefox  and load the macro file above

7) Run macro ( Click on run loop button from 1 to X where X is number of records returned in step5.

Repeat Steps 4 to 7 till a single State ( which is the group by variable here ) is complete.

8) Go to  C:\Documents and Settings\admin\My Documents\iMacros\Downloads (Check this from IMacros settings and options in your installation)

9) Rename the file index as “state.csv”

10) Open CSV file

11) Use the following Office 2003 Macro to clean the file

Sub Macro1()

‘ Macro1 Macro

‘ Macro recorded 12/22/2008 by ajay

‘ Keyboard Shortcut: Ctrl+q

Cells.Select

Selection.Replace What:=”#NEWLINE#”, Replacement:=””, LookAt:=xlPart, _

SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _

ReplaceFormat:=False

Columns(“B:B”).Select

Selection.TextToColumns Destination:=Range(“B1″), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“C:C”).Select

Selection.TextToColumns Destination:=Range(“C1″), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“B:B”).ColumnWidth = 23.71

Columns(“A:A”).EntireColumn.AutoFit

ActiveWindow.SmallScroll Down:=9

ActiveWorkbook.Save

End Sub

 

12) In case you have Office 2007 Use The Record Macro feature to create your unique Macro in your personal Macro Workbook, basically replacing all #NEWFILE# with space (using Ctrl+H) and using Text to columns for column 2 and column 3, with type delimited,next, treat successive delimiters as one (check box),next,do not import first column (BY selecting that column”)

13) To append lots of files into 1 file use the following R Commands

 

Download R from www.r-project.org

 

>setwd(“C:\\Documents and Settings\\admin\\My Documents\\iMacros\\Downloads”)

Note this is the same folder as in Step 8 above

>list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,

+     recursive = FALSE, ignore.case = FALSE)

 

The R output is something like below

 

 

> list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,  +     recursive = FALSE, ignore.case = FALSE)  [1] “Automation Robot – Documents – Office Live Workspace” “Book1.xls”                                             [3] “cala.csv”                                             “calb.csv”                                              [5] “calc.csv”                                             “cald.csv”                                              [7] “cale.csv”                                             “calf.csv”                                              [9] “calg.csv”                                             “calh.csv”                                             [11] “cali.csv”                                             “calj.csv”                                             [13] “calk.csv”                                             “call.csv”                                             [15] “calm.csv”                                             “caln.csv”                                             [17] “calo.csv”                                             “calp.csv”                                             [19] “calq.csv”                                             “calr.csv”                                             [21] “cals.csv”                                             “calt.csv”                                             [23] “calu.csv”                                             “calv.csv”                                             [25] “calw.csv”                                             “calx.csv”                                             [27] “caly.csv”                                             “calz.csv”                                             [29] “cola.csv”                                             “colac.csv”                                            [31] “colad.csv”                                            “colae.csv”                                            [33] “colaf.csv”                                            “colag.csv”                                            [35] “coloa.csv”                                            “colob.csv”                                            [37] “index”                                                “login”                                                > file.append(“coloa.csv”,”colob.csv”) [1] TRUE > file.append(“coloa.csv”,”colac.csv”) [1] TRUE > file.append(“coloa.csv”,”colad.csv”) [1] TRUE > file.append(“coloa.csv”,”colae.csv”) [1] TRUE > file.append(“coloa.csv”,”colaf.csv”) [1] TRUE > file.append(“coloa.csv”,”colag.csv”) [1] TRUE > file.append(“cala.csv”,”calb.csv”) [1] TRUE > file.append(“cala.csv”,”calc.csv”) [1] TRUE > file.append(“cala.csv”,”cald.csv”) [1] TRUE > file.append(“cala.csv”,”cale.csv”) [1] TRUE > file.append(“cala.csv”,”calf.csv”) [1] TRUE > file.append(“cala.csv”,”calg.csv”) [1] TRUE > file.append(“cala.csv”,”calh.csv”) [1] TRUE > file.append(“cala.csv”,”cali.csv”) [1] TRUE > file.append(“cala.csv”,”calj.csv”) [1] TRUE > file.append(“cala.csv”,”calk.csv”) [1] TRUE > file.append(“cala.csv”,”call.csv”) [1] TRUE > file.append(“cala.csv”,”calm.csv”) [1] TRUE > file.append(“cala.csv”,”caln.csv”) [1] TRUE > file.append(“cala.csv”,”calo.csv”) [1] TRUE > file.append(“cala.csv”,”calp.csv”) [1] TRUE > file.append(“cala.csv”,”calq.csv”) [1] TRUE > file.append(“cala.csv”,”calr.csv”) [1] TRUE > file.append(“cala.csv”,”cals.csv”) [1] TRUE > file.append(“cala.csv”,”calt.csv”) [1] TRUE > file.append(“cala.csv”,”calu.csv”) [1] TRUE > file.append(“cala.csv”,”calv.csv”) [1] TRUE > file.append(“cala.csv”,”calw.csv”) [1] TRUE > file.append(“cala.csv”,”calx.csv”) [1] TRUE > file.append(“cala.csv”,”caly.csv”) [1] TRUE > file.append(“cala.csv”,”calz.csv”) [1] TRUE

ACTUAL EXECUTION TIME REVISED MACRO

 

This uses multiple tabs ( using TAB T=1 and TAB T=2) to switch between Tabs. Thus you can search for a big name in Tab 1 , while Tab 2 consists of the details of the table components ( here Name and Email positioned relatively)

 

Execution of Loop is by the Loop Button on IMacros

 

 

VERSION BUILD=6111213 RECORDER=FX TAB T=1 SET !LOOP  This sets Initial value of Loop to start from Value=1 SET !ERRORIGNORE YES Setting Errors to be Ignored ( Like in cases when Email is not present ) and thus resume the rest of code SET !EXTRACT_TEST_POPUP NO Setting Popups to be disabled. Note Popups are useful while creating the code, but reduce execution time. TAG POS=1 ATTR=TXT:Name TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF Note here the extratced value takes position of the link (HREF) positioned at (R1) Row 1(from Loop) using the reference from Text ( In Strong) Name SET !VAR1 {{!EXTRACT}} Passing Value of Extract to the new variable var2. TAB T=2 Creating a new tab in Firefox within same window  URL GOTO={{!VAR1}} Going to the new URL (which is the link of the table constituent – referenced by its name) TAG POS=1 ATTR=TXT:Name TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Name TAG POS=1 ATTR=TXT:Email TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Email ‘ONDIALOG POS=1 BUTTON=OK CONTENT= Commented out section- Used when Firefox gives a message to resubmit the data TAB T=1 Back to Tab 1 or where Form Inputs Search are present ‘BACK Commented out , instead of using back in same tab, we are moving across tabs to avoid submitting the search again and again SAVEAS FOLDER=* FILE=* Downloading the data into default folder, default format(File)Back to same Steps (Click here)

If you are interested in knowing more you can see the Google Docs

http://docs.google.com/View?id=dcvss358_335dg2xmdcp

 

 


2 Comments

  1. Ajay says:

    and I am also writing my thesis right now

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Conferences

Predictive Analytics- The Book

Books

Follow

Get every new post delivered to your Inbox.

Join 845 other followers