November 2009 – DECISION STATS

Google stuck on Gears

Google has launched support for Droid the mobile operating system but forgot to include support for their own browser- Chromium. Atleast if you can support Windows Explorer and Firefox for Gears, surely you can add support for Gears for Chromium.Maybe with an Ad or two 😉 .Since Al Gore invented the internet and he sits as a consultant for the California boys, maybe he can advise them as well on the anti trust investigations with Apple (cough).

Redlining in Internet Access and notes on Regression Models

This is the definition of Redlining Citation- The AD FREE Wikepedia-

Redlining is the practice of denying, or increasing the cost of, services such as banking, insurance, access to jobs,^[2]access to health care,^[3] or even supermarkets^[4] to residents in certain, often racially determined,^[5] areas. The term “redlining” was coined in the late 1960s by community activists in Chicago.^{[citation needed]} It describes the practice of marking a red line on a map to delineate the area where banks would not invest; later the term was applied todiscrimination against a particular group of people (usually by race or sex) no matter the geography.

As of today, redlining in financial services is outlawed by the Fair Credit Lending Act which prohibits using variables in regression models which end up red-lining districts. However as far as 2005, redlining was used in Auto Insurance by using suitably disguised zip9 variables ( I carried data for 55 million American Citizens and 88 million Accounts for a major North American Automotive Insurance provider as part of an offshoring contract from Atlanta, GA in 2005).

It exists today by informal arrangements between internet service providers who carve up territories and districts. Internet access redlining is still not illegal. This is especially true in Austin ( I traveled there as a consultant last year) and Knoxville, Tennessee where I still study as a grad student.

Neither are suitably proprietary insurance and health care claim denial models used for minimizing litigation risk. Litigation risk minimization is the next level of retail logistic regression model just as predictive modeling used by political consultants during elections.

Open Source Webinar with AsterData

Learn how to make money from open source databases, some business intelligence and more business analytics in this webinare at here.

FCC Disclaimer ( even though it is one day before the rules for Bloggers come in effect)-

AsterData is an advertiser on this blog. See the ad on right.

MapReduce was released by Google in 2004 as how to do big data crunching faster.

Google is not an advertiser nor partner on this site. They are busy with mobile phones and advertising (like the TV series Mad Men.)

And yes, Sergey Brin needs to finish his Phd too.

Ponder This: IBM Research

Ponder This Challenge:

What is the minimal number, X, of yes/no questions needed to find the smallest (but more than 1*) divisor of a number between 2 and 166 (inclusive)?

We are asking for the exact answer in two cases:

In the worst case, i.e., what is the smallest number X for which we can guarantee finding it in no more than X questions?

On average, i.e., assuming that the number was chosen in uniform distribution from 2 to 166 and we want to minimize the expected number of questions.

* For example, the smallest divisor of 105 is 3, and of 103 is 103.

Update (11/05): You should find the exact divisor without knowing the number and answering “prime” is not a valid

Citation-

http://domino.research.ibm.com/Comm/wwwr_ponder.nsf/pages/index.html

A maths challenge by the boys in Blue above and also in employement news, the parent company of SPSS is opening a centre of advanced analytics right here in Washington D.C.

WASHINGTON – 10 Nov 2009: IBM (NYSE: IBM) today announced the opening of the sixth in a network of analytics solution centers – this one dedicated to helping federal agencies and other public sector organizations extract actionable insights from their data.

The new IBM Analytics Solution Center in Washington, D.C., will draw on the expertise of more than 400 IBM professionals. These will include IBM researchers, experts in advanced software platforms, and consultants with deep industry knowledge in areas such as transportation, social services, public safety, customs and border management, revenue management, defense, logistics, healthcare and education. IBM also plans to add an additional 100 professionals, through retraining or new hiring, as demand grows.

SAS with the GUI Enterprise Guide (Updated)

Here is a slideshow I made using Google Docs ( which is good except the PDF version is much worse than Microsoft Slidehare). It is on the latest R GUI called AwkWard. It is based on the webpage here

http://docs.google.com/View?id=dcvss358_1015frg4k8gj

In my last post on WPS , R and Sas I had briefly shown a screenshot of SAS Enterprise Guide with a single comment on how it could do with a upgrade in it’s GUI. Well it seems that the upgrade has been available since March 2009, but probably not applied since no one noticed even once in the Fall Semester here in the Tennessee ( including people from the University who read this blog 🙂 Actually the upgrade was made to local machines but there is also a cloud version but didnt apply the upgrade – where we can use Citrix Server to just run analytics on the browser

Here is a revised update of SAS Enterprise Guide 4.2

SAS Enterprise Guide is a Windows interface to SAS that allows for SAS programming *and* point-and-click tasks for reporting, graphs, analytics, and data filter/query/manipulation. SAS Enterprise Guide can work with SAS on your local machine, and it can connect to SAS servers on Windows, Unix/Linux, and the mainframe.

It doesn’t have decision tree support; that’s provided by a more specialized application for data mining called SAS Enterprise Miner.

And you can easily extend SAS Enterprise Guide with your own tasks. See http://support.sas.com/eguide. You do not need SAS/Toolkit. You can use off-the-shelf development tools for Microsoft .NET, including the freely available express editions of Microsoft Visual C# or Visual Basic .NET.

With credit to Chris from SAS for forwarding me the correct document and answers.

PS-
It would be great if the SAS User Conferences Archives used slideshare or Google Docs ( PDFs are so from the 90s) for saying displaying the documents at the sascommunity.org ( which took the twitter id @sascommunity after two months of requests,threats and friendly pleas from me- only to not use it actively except for one Tip of the Day Tweet, sigh)

Weak Security in Internet Databases for Statisticians

A year ago while working as a virtual research assistant to Dr Vincent Granville( of Analyticbridge.com and who signed my recommendation form for University of Tennessee) I helped download almost 22000 records of almost all the statisticians and economists of the world. This included databases like American Statistical Association and Royal Society ( ASA, ACME, RS etc).

After joining University of Tennessee, i sent a sample of code and database with me by email to two professors ( one a fellow of ASA and the other an expert into internet protocols to make it an academic paper except they did not know any journal or professor who knew stuff on data scraping 😦 )

I am publishing this now in the hope they would have plugged the gap before someone gets that kind of database and exploits for spamming or commercial mal use.

The weak link was once you were in the database using a valid login and password, you can use automated HTML capture to basically do a lot of data scraping using the iMacro macro or Firefox Plugin. Since the login were done on Christmas Eve and during year end- this also used the fact that admins were likely to overlook into analytical logs ( if they had software like clicky or were preserving logs).

Here is the code that was used for scraping the whole database for ASA ( Note the scraping was not used by me- it was sent to Dr Granville and this was an academic research project).

See complete code here- http://docs.google.com/View?id=dcvss358_335dg2xmdcp

1) Use Firefox Browser ( or Download from http://www.mozilla.com/en-US/firefox/ )

2) Install IMacros from https://addons.mozilla.org/en-US/firefox/addon/3863

3) Use the following code, paste in a notepad file and save as “macro1.iim”.

VERSION BUILD=6111213 RECORDER=FX

Note the ‘ prefix denotes commented out code

‘AUTOMATED ENTRY INTO WEBSITE IN CORRECT POSITION

TAB T=1

‘URL GOTO=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

‘TAG POS=1:TEXT FORM=NAME:frmLogin ATTR=NAME:txtUser CONTENT=USERNAME

‘SET !ENCRYPTION NO

‘TAG POS=1:PASSWORD FORM=NAME:frmLogin ATTR=NAME:txtPassword CONTENT=USERPASSWORD

‘TAG POS=1:SUBMIT FORM=NAME:frmLogin ATTR=NAME:btnSubmit&&VALUE:Login

‘TAG POS=1 ATTR=ID:el34

‘ENTER FORM INPUTS

‘TAG POS=1 FORM=NAME:frmSearch ATTR=NAME:txtState CONTENT=%CA

‘TAG POS=1:TEXT FORM=NAME:frmSearch ATTR=NAME:txtName CONTENT=b

‘TAG POS=1:SUBMIT FORM=NAME:frmSearch ATTR=NAME:btnSubmit&&VALUE:Submit

‘END FORM INPUTS

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

SET !LOOP 1

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

TAG POS=1 ATTR=TXT:Name

TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF

SET !VAR1 {{!EXTRACT}}

‘PROMPT {{!EXTRACT}}

URL GOTO={{!VAR1}}

TAG POS=1 ATTR=TXT:Name

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

TAG POS=1 ATTR=TXT:Email

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

‘PROMPT {{!EXTRACT}}

BACK

SAVEAS FOLDER=* FILE=*

4) The code should be run after logging in and after giving inputs for name (use wild card of a single alphabet say a) and state from drop down

5) Click submit to get number of records

6)Click on the IOpus Macro button next to address bar in Firefox and load the macro file above

7) Run macro ( Click on run loop button from 1 to X where X is number of records returned in step5.

Repeat Steps 4 to 7 till a single State ( which is the group by variable here ) is complete.

8) Go to C:\Documents and Settings\admin\My Documents\iMacros\Downloads (Check this from IMacros settings and options in your installation)

9) Rename the file index as “state.csv”

10) Open CSV file

11) Use the following Office 2003 Macro to clean the file

Sub Macro1()

‘

‘ Macro1 Macro

‘ Macro recorded 12/22/2008 by ajay

‘

‘ Keyboard Shortcut: Ctrl+q

‘

Cells.Select

Selection.Replace What:=”#NEWLINE#”, Replacement:=””, LookAt:=xlPart, _

SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _

ReplaceFormat:=False

Columns(“B:B”).Select

Selection.TextToColumns Destination:=Range(“B1”), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“C:C”).Select

Selection.TextToColumns Destination:=Range(“C1”), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“B:B”).ColumnWidth = 23.71

Columns(“A:A”).EntireColumn.AutoFit

ActiveWindow.SmallScroll Down:=9

ActiveWorkbook.Save

End Sub

12) In case you have Office 2007 Use The Record Macro feature to create your unique Macro in your personal Macro Workbook, basically replacing all #NEWFILE# with space (using Ctrl+H) and using Text to columns for column 2 and column 3, with type delimited,next, treat successive delimiters as one (check box),next,do not import first column (BY selecting that column”)

13) To append lots of files into 1 file use the following R Commands

Download R from www.r-project.org

>setwd(“C:\\Documents and Settings\\admin\\My Documents\\iMacros\\Downloads”)

Note this is the same folder as in Step 8 above

>list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,

+ recursive = FALSE, ignore.case = FALSE)

The R output is something like below

> list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE, + recursive = FALSE, ignore.case = FALSE) [1] “Automation Robot – Documents – Office Live Workspace” “Book1.xls” [3] “cala.csv” “calb.csv” [5] “calc.csv” “cald.csv” [7] “cale.csv” “calf.csv” [9] “calg.csv” “calh.csv” [11] “cali.csv” “calj.csv” [13] “calk.csv” “call.csv” [15] “calm.csv” “caln.csv” [17] “calo.csv” “calp.csv” [19] “calq.csv” “calr.csv” [21] “cals.csv” “calt.csv” [23] “calu.csv” “calv.csv” [25] “calw.csv” “calx.csv” [27] “caly.csv” “calz.csv” [29] “cola.csv” “colac.csv” [31] “colad.csv” “colae.csv” [33] “colaf.csv” “colag.csv” [35] “coloa.csv” “colob.csv” [37] “index” “login” > file.append(“coloa.csv”,”colob.csv”) [1] TRUE > file.append(“coloa.csv”,”colac.csv”) [1] TRUE > file.append(“coloa.csv”,”colad.csv”) [1] TRUE > file.append(“coloa.csv”,”colae.csv”) [1] TRUE > file.append(“coloa.csv”,”colaf.csv”) [1] TRUE > file.append(“coloa.csv”,”colag.csv”) [1] TRUE > file.append(“cala.csv”,”calb.csv”) [1] TRUE > file.append(“cala.csv”,”calc.csv”) [1] TRUE > file.append(“cala.csv”,”cald.csv”) [1] TRUE > file.append(“cala.csv”,”cale.csv”) [1] TRUE > file.append(“cala.csv”,”calf.csv”) [1] TRUE > file.append(“cala.csv”,”calg.csv”) [1] TRUE > file.append(“cala.csv”,”calh.csv”) [1] TRUE > file.append(“cala.csv”,”cali.csv”) [1] TRUE > file.append(“cala.csv”,”calj.csv”) [1] TRUE > file.append(“cala.csv”,”calk.csv”) [1] TRUE > file.append(“cala.csv”,”call.csv”) [1] TRUE > file.append(“cala.csv”,”calm.csv”) [1] TRUE > file.append(“cala.csv”,”caln.csv”) [1] TRUE > file.append(“cala.csv”,”calo.csv”) [1] TRUE > file.append(“cala.csv”,”calp.csv”) [1] TRUE > file.append(“cala.csv”,”calq.csv”) [1] TRUE > file.append(“cala.csv”,”calr.csv”) [1] TRUE > file.append(“cala.csv”,”cals.csv”) [1] TRUE > file.append(“cala.csv”,”calt.csv”) [1] TRUE > file.append(“cala.csv”,”calu.csv”) [1] TRUE > file.append(“cala.csv”,”calv.csv”) [1] TRUE > file.append(“cala.csv”,”calw.csv”) [1] TRUE > file.append(“cala.csv”,”calx.csv”) [1] TRUE > file.append(“cala.csv”,”caly.csv”) [1] TRUE > file.append(“cala.csv”,”calz.csv”) [1] TRUE

ACTUAL EXECUTION TIME REVISED MACRO

This uses multiple tabs ( using TAB T=1 and TAB T=2) to switch between Tabs. Thus you can search for a big name in Tab 1 , while Tab 2 consists of the details of the table components ( here Name and Email positioned relatively)

Execution of Loop is by the Loop Button on IMacros

VERSION BUILD=6111213 RECORDER=FX TAB T=1 SET !LOOP This sets Initial value of Loop to start from Value=1 SET !ERRORIGNORE YES Setting Errors to be Ignored ( Like in cases when Email is not present ) and thus resume the rest of code SET !EXTRACT_TEST_POPUP NO Setting Popups to be disabled. Note Popups are useful while creating the code, but reduce execution time. TAG POS=1 ATTR=TXT:Name TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF Note here the extratced value takes position of the link (HREF) positioned at (R1) Row 1(from Loop) using the reference from Text ( In Strong) Name SET !VAR1 {{!EXTRACT}} Passing Value of Extract to the new variable var2. TAB T=2 Creating a new tab in Firefox within same window URL GOTO={{!VAR1}} Going to the new URL (which is the link of the table constituent – referenced by its name) TAG POS=1 ATTR=TXT:Name TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Name TAG POS=1 ATTR=TXT:Email TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Email ‘ONDIALOG POS=1 BUTTON=OK CONTENT= Commented out section- Used when Firefox gives a message to resubmit the data TAB T=1 Back to Tab 1 or where Form Inputs Search are present ‘BACK Commented out , instead of using back in same tab, we are moving across tabs to avoid submitting the search again and again SAVEAS FOLDER=* FILE=* Downloading the data into default folder, default format(File)Back to same Steps (Click here)

If you are interested in knowing more you can see the Google Docs

http://docs.google.com/View?id=dcvss358_335dg2xmdcp

The declining market for Telecommunication Churn Models

[tweetmeme=”decisionstats”]

Users of Predictive Analytics within telecom sector can look into an interesting side effect of the iPhone – AT &T agreement. With Google also jumping into the market with it’s Droid – the new norms in Telecom agreements is lockedin contracts for consumers. While this is permitted by the telecom regulators as fair to competition- this also means that there is very little churn within these locked in contracts. This leads to further savings for the telecom provider allowing them to have higher profits and even share the profits by price decreases-

and thus the traditional bug bear of telecom analytics churn modeling is slowly losing importance to plain vanilla reporting or better data mining dashboard like solutions. Lower Churn , means also lower costs on analytics softwares to predict churn.

As competition within the 3G Mobile market ramps up due to Google’s entry and licensing with partners exclusively- the trend will likely increase for reduced churn due to locked in customers.Even existing mobile providers can offer discounts to lock in customers for not switching ( especially in Mobile Markets like India- where I have personally interacted with large players like Bharti) and China which has even bigger mobile market.

Ergo Lower need to buy softwares that predict churn-

See Below Image from TeraData’s Churn Model.

Please share:

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: