databases – Page 8 – DECISION STATS

Big Data and R: New Product Release by Revolution Analytics

Press Release by the Guys in Revolution Analytics- this time claiming to enable terabyte level analytics with R. Interesting stuff but techie details are awaited.

Revolution Analytics Brings

Big Data Analysis to R

The world’s most powerful statistics language can now tackle terabyte-class data sets using

Revolution R Enterprise—at a fraction of the cost of legacy analytics products

JSM 2010 – VANCOUVER (August 3, 2010) — Revolution Analytics today introduced ‘Big Data’ analysis to its Revolution R Enterprise software, taking the popular R statistics language to unprecedented new levels of capacity and performance for analyzing very large data sets. For the first time, R users will be able to process, visualize and model terabyte-class data sets in a fraction of the time of legacy products—without employing expensive or specialized hardware.

The new version of Revolution R Enterprise introduces an add-on package called RevoScaleR that provides a new framework for fast and efficient multi-core processing of large data sets. It includes:

The XDF file format, a new binary ‘Big Data’ file format with an interface to the R language that provides high-speed access to arbitrary rows, blocks and columns of data.

A collection of widely-used statistical algorithms optimized for Big Data, including high-performance implementations of Summary Statistics, Linear Regression, Binomial Logistic Regressionand Crosstabs—with more to be added in the near future.

Data Reading & Transformation tools that allow users to interactively explore and prepare large data sets for analysis.

Extensibility, expert R users can develop and extend their own statistical algorithms to take advantage of Revolution R Enterprise’s new speed and scalability capabilities.

“The R language’s inherent power and extensibility has driven its explosive adoption as the modern system for predictive analytics,” said Norman H. Nie, president and CEO of Revolution Analytics. “We believe that this new Big Data scalability will help R transition from an amazing research and prototyping tool to a production-ready platform for enterprise applications such as quantitative finance and risk management, social media, bioinformatics and telecommunications data analysis.”

Sage Bionetworks is the nonprofit force behind the open-source collaborative effort, Sage Commons, a place where data and disease models can be shared by scientists to better understand disease biology. David Henderson, Director of Scientific Computing at Sage, commented: “At Sage Bionetworks, we need to analyze genomic databases hundreds of gigabytes in size with R. We’re looking forward to using the high-speed data-analysis features of RevoScaleR to dramatically reduce the times it takes us to process these data sets.”

Take Hadoop and Other Big Data Sources to the Next Level

Revolution R Enterprise fits well within the modern ‘Big Data’ architecture by leveraging popular sources such as Hadoop, NoSQL or key value databases, relational databases and data warehouses. These products can be used to store, regularize and do basic manipulation on very large datasets—while Revolution R Enterprise now provides advanced analytics at unparalleled speed and scale: producing speed on speed.

“Together, Hadoop and R can store and analyze massive, complex data,” said Saptarshi Guha, developer of the popular RHIPE R package that integrates the Hadoop framework with R in an automatically distributed computing environment. “Employing the new capabilities of Revolution R Enterprise, we will be able to go even further and compute Big Data regressions and more.”

Platforms and Availability

The new RevoScaleR package will be delivered as part of Revolution R Enterprise 4.0, which will be available for 32-and 64-bit Microsoft Windows in the next 30 days. Support for Red Hat Enterprise Linux (RHEL 5) is planned for later this year.

On its website (http://www.revolutionanalytics.com/bigdata), Revolution Analytics has published performance and scalability benchmarks for Revolution R Enterprise analyzing a 13.2 gigabyte data set of commercial airline information containing more than 123 million rows, and 29 columns.

Additionally, the company will showcase its new Big Data solution in a free webinar on August 25 at 9:00 a.m. Pacific.

Additional Resources

•      Big Data Benchmark whitepaper

•      The Revolution Analytics Roadmap whitepaper

•      Revolutions Blog

•      Download free academic copy of Revolution R Enterprise

•      Visit Inside-R.org for the most comprehensive set of information on R

•      Spread the word: Add a “Download R!” badge on your website

•      Follow @RevolutionR on Twitter

About Revolution Analytics

Revolution Analytics (http://www.revolutionanalytics.com) is the leading commercial provider of software and support for the popular open source R statistics language. Its Revolution R products help make predictive analytics accessible to every type of user and budget. The company is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital.

Media Contact

Chantal Yang
Page One PR, for Revolution Analytics
Tel: +1 415-875-7494

Email:  revolution@pageonepr.com

Weak Security in Internet Databases for Statisticians

A year ago while working as a virtual research assistant to Dr Vincent Granville( of Analyticbridge.com and who signed my recommendation form for University of Tennessee) I helped download almost 22000 records of almost all the statisticians and economists of the world. This included databases like American Statistical Association and Royal Society ( ASA, ACME, RS etc).

After joining University of Tennessee, i sent a sample of code and database with me by email to two professors ( one a fellow of ASA and the other an expert into internet protocols to make it an academic paper except they did not know any journal or professor who knew stuff on data scraping 😦 )

I am publishing this now in the hope they would have plugged the gap before someone gets that kind of database and exploits for spamming or commercial mal use.

The weak link was once you were in the database using a valid login and password, you can use automated HTML capture to basically do a lot of data scraping using the iMacro macro or Firefox Plugin. Since the login were done on Christmas Eve and during year end- this also used the fact that admins were likely to overlook into analytical logs ( if they had software like clicky or were preserving logs).

Here is the code that was used for scraping the whole database for ASA ( Note the scraping was not used by me- it was sent to Dr Granville and this was an academic research project).

See complete code here- http://docs.google.com/View?id=dcvss358_335dg2xmdcp

1) Use Firefox Browser ( or Download from http://www.mozilla.com/en-US/firefox/ )

2) Install IMacros from https://addons.mozilla.org/en-US/firefox/addon/3863

3) Use the following code, paste in a notepad file and save as “macro1.iim”.

VERSION BUILD=6111213 RECORDER=FX

Note the ‘ prefix denotes commented out code

‘AUTOMATED ENTRY INTO WEBSITE IN CORRECT POSITION

TAB T=1

‘URL GOTO=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

‘TAG POS=1:TEXT FORM=NAME:frmLogin ATTR=NAME:txtUser CONTENT=USERNAME

‘SET !ENCRYPTION NO

‘TAG POS=1:PASSWORD FORM=NAME:frmLogin ATTR=NAME:txtPassword CONTENT=USERPASSWORD

‘TAG POS=1:SUBMIT FORM=NAME:frmLogin ATTR=NAME:btnSubmit&&VALUE:Login

‘TAG POS=1 ATTR=ID:el34

‘ENTER FORM INPUTS

‘TAG POS=1 FORM=NAME:frmSearch ATTR=NAME:txtState CONTENT=%CA

‘TAG POS=1:TEXT FORM=NAME:frmSearch ATTR=NAME:txtName CONTENT=b

‘TAG POS=1:SUBMIT FORM=NAME:frmSearch ATTR=NAME:btnSubmit&&VALUE:Submit

‘END FORM INPUTS

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

SET !LOOP 1

SET !ERRORIGNORE YES

SET !EXTRACT_TEST_POPUP NO

TAG POS=1 ATTR=TXT:Name

TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF

SET !VAR1 {{!EXTRACT}}

‘PROMPT {{!EXTRACT}}

URL GOTO={{!VAR1}}

TAG POS=1 ATTR=TXT:Name

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

TAG POS=1 ATTR=TXT:Email

TAG POS=R1 ATTR=TXT:* EXTRACT=TXT

‘PROMPT {{!EXTRACT}}

BACK

SAVEAS FOLDER=* FILE=*

4) The code should be run after logging in and after giving inputs for name (use wild card of a single alphabet say a) and state from drop down

5) Click submit to get number of records

6)Click on the IOpus Macro button next to address bar in Firefox and load the macro file above

7) Run macro ( Click on run loop button from 1 to X where X is number of records returned in step5.

Repeat Steps 4 to 7 till a single State ( which is the group by variable here ) is complete.

8) Go to C:\Documents and Settings\admin\My Documents\iMacros\Downloads (Check this from IMacros settings and options in your installation)

9) Rename the file index as “state.csv”

10) Open CSV file

11) Use the following Office 2003 Macro to clean the file

Sub Macro1()

‘

‘ Macro1 Macro

‘ Macro recorded 12/22/2008 by ajay

‘

‘ Keyboard Shortcut: Ctrl+q

‘

Cells.Select

Selection.Replace What:=”#NEWLINE#”, Replacement:=””, LookAt:=xlPart, _

SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _

ReplaceFormat:=False

Columns(“B:B”).Select

Selection.TextToColumns Destination:=Range(“B1”), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“C:C”).Select

Selection.TextToColumns Destination:=Range(“C1”), DataType:=xlDelimited, _

TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _

Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _

:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True

Columns(“B:B”).ColumnWidth = 23.71

Columns(“A:A”).EntireColumn.AutoFit

ActiveWindow.SmallScroll Down:=9

ActiveWorkbook.Save

End Sub

12) In case you have Office 2007 Use The Record Macro feature to create your unique Macro in your personal Macro Workbook, basically replacing all #NEWFILE# with space (using Ctrl+H) and using Text to columns for column 2 and column 3, with type delimited,next, treat successive delimiters as one (check box),next,do not import first column (BY selecting that column”)

13) To append lots of files into 1 file use the following R Commands

Download R from www.r-project.org

>setwd(“C:\\Documents and Settings\\admin\\My Documents\\iMacros\\Downloads”)

Note this is the same folder as in Step 8 above

>list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,

+ recursive = FALSE, ignore.case = FALSE)

The R output is something like below

> list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE, + recursive = FALSE, ignore.case = FALSE) [1] “Automation Robot – Documents – Office Live Workspace” “Book1.xls” [3] “cala.csv” “calb.csv” [5] “calc.csv” “cald.csv” [7] “cale.csv” “calf.csv” [9] “calg.csv” “calh.csv” [11] “cali.csv” “calj.csv” [13] “calk.csv” “call.csv” [15] “calm.csv” “caln.csv” [17] “calo.csv” “calp.csv” [19] “calq.csv” “calr.csv” [21] “cals.csv” “calt.csv” [23] “calu.csv” “calv.csv” [25] “calw.csv” “calx.csv” [27] “caly.csv” “calz.csv” [29] “cola.csv” “colac.csv” [31] “colad.csv” “colae.csv” [33] “colaf.csv” “colag.csv” [35] “coloa.csv” “colob.csv” [37] “index” “login” > file.append(“coloa.csv”,”colob.csv”) [1] TRUE > file.append(“coloa.csv”,”colac.csv”) [1] TRUE > file.append(“coloa.csv”,”colad.csv”) [1] TRUE > file.append(“coloa.csv”,”colae.csv”) [1] TRUE > file.append(“coloa.csv”,”colaf.csv”) [1] TRUE > file.append(“coloa.csv”,”colag.csv”) [1] TRUE > file.append(“cala.csv”,”calb.csv”) [1] TRUE > file.append(“cala.csv”,”calc.csv”) [1] TRUE > file.append(“cala.csv”,”cald.csv”) [1] TRUE > file.append(“cala.csv”,”cale.csv”) [1] TRUE > file.append(“cala.csv”,”calf.csv”) [1] TRUE > file.append(“cala.csv”,”calg.csv”) [1] TRUE > file.append(“cala.csv”,”calh.csv”) [1] TRUE > file.append(“cala.csv”,”cali.csv”) [1] TRUE > file.append(“cala.csv”,”calj.csv”) [1] TRUE > file.append(“cala.csv”,”calk.csv”) [1] TRUE > file.append(“cala.csv”,”call.csv”) [1] TRUE > file.append(“cala.csv”,”calm.csv”) [1] TRUE > file.append(“cala.csv”,”caln.csv”) [1] TRUE > file.append(“cala.csv”,”calo.csv”) [1] TRUE > file.append(“cala.csv”,”calp.csv”) [1] TRUE > file.append(“cala.csv”,”calq.csv”) [1] TRUE > file.append(“cala.csv”,”calr.csv”) [1] TRUE > file.append(“cala.csv”,”cals.csv”) [1] TRUE > file.append(“cala.csv”,”calt.csv”) [1] TRUE > file.append(“cala.csv”,”calu.csv”) [1] TRUE > file.append(“cala.csv”,”calv.csv”) [1] TRUE > file.append(“cala.csv”,”calw.csv”) [1] TRUE > file.append(“cala.csv”,”calx.csv”) [1] TRUE > file.append(“cala.csv”,”caly.csv”) [1] TRUE > file.append(“cala.csv”,”calz.csv”) [1] TRUE

ACTUAL EXECUTION TIME REVISED MACRO

This uses multiple tabs ( using TAB T=1 and TAB T=2) to switch between Tabs. Thus you can search for a big name in Tab 1 , while Tab 2 consists of the details of the table components ( here Name and Email positioned relatively)

Execution of Loop is by the Loop Button on IMacros

VERSION BUILD=6111213 RECORDER=FX TAB T=1 SET !LOOP This sets Initial value of Loop to start from Value=1 SET !ERRORIGNORE YES Setting Errors to be Ignored ( Like in cases when Email is not present ) and thus resume the rest of code SET !EXTRACT_TEST_POPUP NO Setting Popups to be disabled. Note Popups are useful while creating the code, but reduce execution time. TAG POS=1 ATTR=TXT:Name TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF Note here the extratced value takes position of the link (HREF) positioned at (R1) Row 1(from Loop) using the reference from Text ( In Strong) Name SET !VAR1 {{!EXTRACT}} Passing Value of Extract to the new variable var2. TAB T=2 Creating a new tab in Firefox within same window URL GOTO={{!VAR1}} Going to the new URL (which is the link of the table constituent – referenced by its name) TAG POS=1 ATTR=TXT:Name TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Name TAG POS=1 ATTR=TXT:Email TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Email ‘ONDIALOG POS=1 BUTTON=OK CONTENT= Commented out section- Used when Firefox gives a message to resubmit the data TAB T=1 Back to Tab 1 or where Form Inputs Search are present ‘BACK Commented out , instead of using back in same tab, we are moving across tabs to avoid submitting the search again and again SAVEAS FOLDER=* FILE=* Downloading the data into default folder, default format(File)Back to same Steps (Click here)

If you are interested in knowing more you can see the Google Docs

http://docs.google.com/View?id=dcvss358_335dg2xmdcp

The World of Data as I think

Post discussions on my performance at grad school and WHAT exactly DO I want to work in- I drew the following curves.

Feel free to draw better circles- and I will include your reference here

Caution- Based upon a very ordinary understanding of extra ordinary technical things.

THE WORLD OF DATA

AND WHAT I WANT TO DO IN IT

ps- What do you think? Add a comment

“Build a better mousetrap, and the world will beat a path to your door.”- Emerson

Interview SPSS Olivier Jouve

SPSS recently launched a major series of products in it’s text mining and data mining product portfolio and rebranded data mining to the PASW series. In an exclusive and extensive interview, Oliver Jouve Vice President,Corporate Development at SPSS Inc talks of science careers, the recent launches, open source support to R by SPSS, Cloud Computing and Business Intelligence.

Ajay: Describe your career in Science. Are careers in science less lucrative than careers in business development? What advice would you give to people re-skilling in the current recession on learning analytical skills?

Olivier: I have a Master of Science in Geophysics and Master of Science in Computer Sciences, both from Paris VI University. I have always tried to combine science and business development in my career as I like to experience all aspects � from idea to concept to business plan to funding to development to marketing to sales.

There was a study published earlier this year that said two of the three best jobs are related to math and statistics. This is reinforced by three societal forces that are converging � better uses of mathematics to drive decision making, the tremendous growth and storage of data, and especially in this economy, the ability to deliver ROI. With more and more commercial and government organizations realizing the value of Predictive Analytics to solve business problems, being equipped with analytical skills can only enhance your career and provide job security.

Ajay: So SPSS has launched new products within its Predictive Analytics Software (PASW) portfolio � Modeler 13 and Text Analytics 13? Is this old wine in a new bottle? What is new in terms of technical terms? What is new in terms of customers looking to mine textual information?

Olivier: Our two new products — PASW Modeler 13 (formerly Clementine) and PASW Text Analytics 13 (formerly Text Mining for Clementine) � extend and automate the power of data mining and text analytics to the business user, while significantly enhancing the productivity, flexibility and performance of the expert analyst.

PASW Modeler 13 data mining workbench has new and enhanced functionality that quickly takes users through the entire data mining process � from data access and preparation to model deployment. Some the newest features include Automated Data Preparation that conditions data in a single step by automatically detecting and correcting quality errors; Auto Cluster that gives users a simple way to determine the best cluster algorithm for a particular data set; and full integration with PASW Statistics (formerly SPSS Statistics).

With PASW Text Analytics 13, SPSS provides the most complete view of the customer through the combined analysis of text, web and survey data. While other companies only provide the text component, SPSS couples text with existing structured data, permitting more accurate results and better predictive modeling. The new version includes pre-built categories for satisfaction surveys, advanced natural language processing techniques, and it supports more than 30 different languages.

Ajay: SPSS has supported open source platforms – Python and R � before it became fashionable to do so. How has this helped your company?

Olivier: Open source software helps the democratization of the analytics movement and SPSS is keen on supporting that democratization while welcoming open source users (and their creativity) into the analytics framework.

Ajay: What are the differences and similarities between Text Analytics and Search Engines? Can we mix the two as well using APIs?

Olivier: Search Engines are fundamentally top-down in that you know what you are looking for when launching a query. However, Text Analytics is bottom-up, uncovering hidden patterns, relationships and trends locked in unstructured data � including call center notes, open-ended survey responses, blogs and social networks. Now businesses have a way of pulling key concepts and extracting customer sentiments, such as emotional responses, preferences and opinions, and grouping them into categories.

For instance, a call center manager will have a hard time extracting why customers are unhappy and churn by using a search engine for millions of call center notes. What would be the query? But, by using Text Analytics, that same call center agent will discover the main reasons why customers are unhappy, and be able to predict if they are going to churn.

Ajay: Why is Text Analytics so important? How will companies use it now and into the future?
Olivier – Actually, the question you should ask is, “Why is unstructured data so important?” Today, more than ever, people love to share their opinions — through the estimated 183 billion emails sent, the 1.6 million blog posts, millions of inquiries captured in call center notes, and thousands of comments on diverse social networking sites and community message boards. And, let�s not forget all data that flows through Twitter. Companies today would be short-sighted to ignore what their customers are saying about their products and services, in their own words. Those opinions � likes and dislikes � are essential nuggets and bear much more insights than demographic or transactional data to reducing customer churn, improving satisfaction, fighting crime, detecting fraud and increasing marketing campaign results.

Ajay: How is SPSS venturing into cloud computing and SaaS?

Olivier: SPSS has been at the origin of the PMML standard to allow organizations to provision their computing power in a very flexible manner � just like provisioning computing power through cloud computing. SPSS strongly believes in the benefits of a cloud computing environment, which is why all of our applications are designed with Service Oriented Architecture components. This enables SPSS to be flexible enough to meet the demands of the market as they change with respect to delivery mode. We are currently analyzing business and technical issues related to SPSS technologies in the cloud, such as the scoring and delivery of analytics. In regards to SaaS, we currently offer hosted services for our PASW Data Collection (formerly Dimensions) survey research suite of products.

Ajay: Do you think business intelligence is an over used term? Why do you think BI and Predictive Analytics failed in mortgage delinquency forecasting and reporting despite the financial sector being a big spender on BI tools?

Oliver: There is a big difference between business intelligence (BI) and Predictive Analytics. Traditional BI technologies focus on what�s happening now or what�s happened in the past by primarily using financial or product data. For organizations to take the most effective action, they need to know and plan for what may happen in the future by using people data � and that�s harnessed through Predictive Analytics.

Another way to look at it � Predictive covers the entire capture, predict and act continuum � from the use of survey research software to capture customer feedback (attitudinal data), to creating models to predict customer behaviors, and then acting on the results to improve business processes. Predictive Analytics, unlike BI, provides the secret ingredient and answers the question, �What will the customer do next?�

That being said, financial institutions didn�t need to use Predictive Analytics to see
that some lenders sold mortgages to unqualified individuals likely to default. Predictive Analytics is an incredible application used to detect fraud, waste and abuse. Companies in the financial services industry can focus on mitigating their overall risk by creating better predictive models that not only encompass richer data sets, but also better rules-based automation.

Ajay: What do people do at SPSS to have fun when they are not making complex mathematical algorithms?
Oliver: SPSS employees love our casual, friendly atmosphere, our professional and talented colleagues, and our cool, cutting-edge technology. The fun part comes from doing meaningful work with great people, across different groups and geographies. Of course being French, I have ensured that my colleagues are fully educated on the best wine and cuisine. And being based in Chicago, there is always a spirited baseball debate between the Cubs and White Sox. However, I am yet to convince anyone that rugby is a better sport.

Biography

Olivier Jouve is Vice President, Corporate Development, at SPSS Inc. He is responsible for defining SPSS strategic directions, growth opportunities through internal development, merger and acquisitions and/or tactical alliances. As a pioneer in the field of data and text mining for the last 20 years, he has created the foundation of Text Analytics technology for analyzing customer interactions at SPSS. Jouve is a successful serial entrepreneur and has had his works published internationally in the area of Analytical CRM, text mining, search engines, competitive intelligence and knowledge management.