Brief Interview Timo Elliott

Here is a brief interview with Timo Elliott.Timo Elliott is a 19-year veteran of SAP Business Objects.

Ajay- What are the top 5 events in Business Integration and Data Visualization services you saw in 2010 and what are the top three trends you see in these in 2011.


Timo-

Top five events in 2010:

(1) Back to strong market growth. IT spending plummeted last year (BI continued to grow, but more slowly than previous years). This year, organizations reopened their wallets and funded new analytics initiatives — all the signs indicate that BI market growth will be double that of 2009.

(2) The launch of the iPad. Mobile BI has been around for years, but the iPad opened the floodgates of organizations taking a serious look at mobile analytics — and the easy-to-use, executive-friendly iPad dashboards have considerably raised the profile of analytics projects inside organizations.

(3) Data warehousing got exciting again. Decades of incremental improvements (column databases, massively parallel processing, appliances, in-memory processing…) all came together with robust commercial offers that challenged existing data storage and calculation methods. And new “NoSQL” approaches, designed for the new problems of massive amounts of less-structured web data, started moving into the mainstream.

(4) The end of Google Wave, the start of social BI.Google Wave was launched as a rethink of how we could bring together email, instant messaging, and social networks. While Google decided to close down the technology this year, it has left its mark, notably by influencing the future of “social BI”, with several major vendors bringing out commercial products this year.

(5) The start of the big BI merge. While several small independent BI vendors reported strong growth, the major trend of the year was consolidation and integration: the BI megavendors (SAP, Oracle, IBM, Microsoft) increased their market share (sometimes by acquiring smaller vendors, e.g. IBM/SPSS and SAP/Sybase) and integrated analytics with their existing products, blurring the line between BI and other technology areas.

Top three trends next year:

(1) Analytics, reinvented. New DW techniques make it possible to do sub-second, interactive analytics directly against row-level operational data. Now BI processes and interfaces need to be rethought and redesigned to make best use of this — notably by blurring the distinctions between the “design” and “consumption” phases of BI.

(2) Corporate and personal BI come together. The ability to mix corporate and personal data for quick, pragmatic analysis is a common business need. The typical solution to the problem — extracting and combining the data into a local data store (either Excel or a departmental data mart) — pleases users, but introduces duplication and extra costs and makes a mockery of information governance. 2011 will see the rise of systems that let individuals and departments load their data into personal spaces in the corporate environment, allowing pragmatic analytic flexibility without compromising security and governance.

(3) The next generation of business applications. Where are the business applications designed to support what people really do all day, such as implementing this year’s strategy, launching new products, or acquiring another company? 2011 will see the first prototypes of people-focused, flexible, information-centric, and collaborative applications, bringing together the best of business intelligence, “enterprise 2.0”, and existing operational applications.

And one that should happen, but probably won’t:

(4) Intelligence = Information + PEOPLE. Successful analytics isn’t about technology — it’s about people, process, and culture. The biggest trend in 2011 should be organizations spending the majority of their efforts on user adoption rather than technical implementation.                 About- http://timoelliott.com/blog/about

Timo Elliott is a 19-year veteran of SAP BusinessObjects, and has spent the last twenty years working with customers around the world on information strategy.

He works closely with SAP research and innovation centers around the world to evangelize new technology prototypes.

His popular Business Analytics and SAPWeb20 blogs track innovation in analytics and social media, including topics such as augmented corporate reality, collaborative decision-making, and social network analysis.

His PowerPoint Twitter Tools lets presenters see and react to tweets in real time, embedded directly within their slides.

A popular and engaging speaker, Elliott presents regularly to IT and business audiences at international conferences, on subjects such as why BI projects fail and what to do about it, and the intersection of BI and enterprise 2.0.

Prior to Business Objects, Elliott was a computer consultant in Hong Kong and led analytics projects for Shell in New Zealand. He holds a first-class honors degree in Economics with Statistics from Bristol University, England. He blogs on http://timoelliott.com/blog/ (one of the best designed blogs in BI) . You can see more about him personal web site here and photo/sketch blog here. You should follow Timo at http://twitter.com/timoelliott

Art Credit- Timo Elliott

Related Articles

Complex Event Processing- SASE Language

Logo of the anti-RFID campaign by German priva...
Image via Wikipedia

Complex Event Processing (CEP- not to be confused by Circular Probability Error) is defined processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time.

Software supporting CEP are-

Oracle http://www.oracle.com/us/technologies/soa/service-oriented-architecture-066455.html

Oracle CEP is a Java application server for the development and deployment of high-performance event driven applications. It can detect patterns in the flow of events and message payloads, often based on filtering, correlation, and aggregation across event sources, and includes industry leading temporal and ordering capabilities. It supports ultra-high throughput (1 million/sec++) and microsecond latency.

Tibco is also trying to get into this market (it claims to have a 40 % market share in the public CEP market 😉 though probably they have not measured the DoE and DoD as worthy of market share yet

– see webcast by TIBCO ‘s head here http://www.tibco.com/products/business-optimization/complex-event-processing/default.jsp

and product info here-http://www.tibco.com/products/business-optimization/complex-event-processing/businessevents/default.jsp

TIBCO is the undisputed leader in complex event processing (CEP) software with over 40 percent market share, according to a recent IDC Study.

A good explanation of how social media itself can be used as an analogy for CEP is given in this SAS Global Paper

http://support.sas.com/resources/papers/proceedings10/040-2010.pdf

You can see a report on Predictive Analytics and Data Mining  in q1 2010 also from SAS’s website  at –http://www.sas.com/news/analysts/forresterwave-predictive-analytics-dm-104388-0210.pdf

A very good explanation on architecture involved is given by SAS CTO Keith Collins here on SAS’s Knowledge Exchange site,

http://www.sas.com/knowledge-exchange/risk/four-ways-divide-conquer.html

What it is: Methods 1 through 3 look at historical data and traditional architectures with information stored in the warehouse. In this environment, it often takes months of data cleansing and preparation to get the data ready to analyze. Now, what if you want to make a decision or determine the effect of an action in real time, as a sale is made, for instance, or at a specific step in the manufacturing process. With streaming data architectures, you can look at data in the present and make immediate decisions. The larger flood of data coming from smart phones, online transactions and smart-grid houses will continue to increase the amount of data that you might want to analyze but not keep. Real-time streaming, complex event processing (CEP) and analytics will all come together here to let you decide on the fly which data is worth keeping and which data to analyze in real time and then discard.

When you use it: Radio-frequency identification (RFID) offers a good user case for this type of architecture. RFID tags provide a lot of information, but unless the state of the item changes, you don’t need to keep warehousing the data about that object every day. You only keep data when it moves through the door and out of the warehouse.

The same concept applies to a customer who does the same thing over and over. You don’t need to keep storing data for analysis on a regular pattern, but if they change that pattern, you might want to start paying attention.

Figure  4: Traditional architecture vs. streaming architecture

Figure 4: Traditional architecture vs. streaming architecture

 

In academia  here is something called SASE Language

  • A rich declarative event language
  • Formal semantics of the event language
  • Theorectical underpinnings of CEP
  • An efficient automata-based implementation

http://sase.cs.umass.edu/

and

http://avid.cs.umass.edu/sase/index.php?page=navleft_1col

Financial Services

The query below retrieves the total trading volume of Google stocks in the 4 hour period after some bad news occurred.

PATTERN SEQ(News a, Stock+ b[ ])WHERE   [symbol]    AND	a.type = 'bad'    AND	b[i].symbol = 'GOOG' WITHIN  4 hoursHAVING  b[b.LEN].volume < 80%*b[1].volumeRETURN  sum(b[ ].volume)

The next query reports a one-hour period in which the price of a stock increased from 10 to 20 and its trading volume stayed relatively stable.

PATTERN	SEQ(Stock+ a[])WHERE 	 [symbol]   AND	  a[1].price = 10   AND	  a[i].price > a[i-1].price   AND	  a[a.LEN].price = 20            WITHIN  1 hourHAVING	avg(a[].volume) ≥ a[1].volumeRETURN	a[1].symbol, a[].price

The third query detects a more complex trend: in an hour, the volume of a stock started high, but after a period of price increasing or staying relatively stable, the volume plummeted.

PATTERN SEQ(Stock+ a[], Stock b)WHERE 	 [symbol]   AND	  a[1].volume > 1000   AND	  a[i].price > avg(a[…i-1].price))   AND	  b.volume < 80% * a[a.LEN].volume           WITHIN  1 hourRETURN	a[1].symbol, a[].(price,volume), b.(price,volume)

(note from Ajay-

 

I was not really happy about the depth of resources on CEP available online- there seem to be missing bits and pieces in both open source, academic and corporate information- one reason for this is the obvious military dual use of this technology- like feeds from Satellite, Audio Scans, etc)

PAWCON -This week in London

Watch out for the twitter hash news on PAWCON and the exciting agenda lined up. If your in the City- you may want to just drop in

http://www.predictiveanalyticsworld.com/london/2010/agenda.php#day1-7

Disclaimer- PAWCON has been a blog partner with Decisionstats (since the first PAWCON ). It is vendor neutral and features open source as well proprietary software, as well case studies from academia and Industry for a balanced view.

 

Little birdie told me some exciting product enhancements may be in the works including a not yet announced R plugin 😉 and the latest SAS product using embedded analytics and Dr Elder’s full day data mining workshop.

Citation-

http://www.predictiveanalyticsworld.com/london/2010/agenda.php#day1-7

Monday November 15, 2010
All conference sessions take place in Edward 5-7

8:00am-9:00am

Registration, Coffee and Danish
Room: Albert Suites


9:00am-9:50am

Keynote
Five Ways Predictive Analytics Cuts Enterprise Risk

All business is an exercise in risk management. All organizations would benefit from measuring, tracking and computing risk as a core process, much like insurance companies do.

Predictive analytics does the trick, one customer at a time. This technology is a data-driven means to compute the risk each customer will defect, not respond to an expensive mailer, consume a retention discount even if she were not going to leave in the first place, not be targeted for a telephone solicitation that would have landed a sale, commit fraud, or become a “loss customer” such as a bad debtor or an insurance policy-holder with high claims.

In this keynote session, Dr. Eric Siegel will reveal:

  • Five ways predictive analytics evolves your enterprise to reduce risk
  • Hidden sources of risk across operational functions
  • What every business should learn from insurance companies
  • How advancements have reversed the very meaning of fraud
  • Why “man + machine” teams are greater than the sum of their parts for
  • enterprise decision support

 

Speaker: Eric Siegel, Ph.D., Program Chair, Predictive Analytics World

Top of this page ] [ Agenda overview ]


IBM9:50am-10:10am

Platinum Sponsor Presentation
The Analytical Revolution

The algorithms at the heart of predictive analytics have been around for years – in some cases for decades. But now, as we see predictive analytics move to the mainstream and become a competitive necessity for organisations in all industries, the most crucial challenges are to ensure that results can be delivered to where they can make a direct impact on outcomes and business performance, and that the application of analytics can be scaled to the most demanding enterprise requirements.

This session will look at the obstacles to successfully applying analysis at the enterprise level, and how today’s approaches and technologies can enable the true “industrialisation” of predictive analytics.

Speaker: Colin Shearer, WW Industry Solutions Leader, IBM UK Ltd

Top of this page ] [ Agenda overview ]


Deloitte10:10am-10:20am

Gold Sponsor Presentation
How Predictive Analytics is Driving Business Value

Organisations are increasingly relying on analytics to make key business decisions. Today, technology advances and the increasing need to realise competitive advantage in the market place are driving predictive analytics from the domain of marketers and tactical one-off exercises to the point where analytics are being embedded within core business processes.

During this session, Richard will share some of the focus areas where Deloitte is driving business transformation through predictive analytics, including Workforce, Brand Equity and Reputational Risk, Customer Insight and Network Analytics.

Speaker: Richard Fayers, Senior Manager, Deloitte Analytical Insight

Top of this page ] [ Agenda overview ]


10:20am-10:45am

Break / Exhibits
Room: Albert Suites


10:45am-11:35am
Healthcare
Case Study: Life Line Screening
Taking CRM Global Through Predictive Analytics

While Life Line is successfully executing a US CRM roadmap, they are also beginning this same evolution abroad. They are beginning in the UK where Merkle procured data and built a response model that is pulling responses over 30% higher than competitors. This presentation will give an overview of the US CRM roadmap, and then focus on the beginning of their strategy abroad, focusing on the data procurement they could not get anywhere else but through Merkle and the successful modeling and analytics for the UK.

Speaker: Ozgur Dogan, VP, Quantitative Solutions Group, Merkle Inc.

Speaker: Trish Mathe, Life Line Screening

Top of this page ] [ Agenda overview ]


11:35am-12:25pm
Open Source Analytics; Healthcare
Case Study: A large health care organization
The Rise of Open Source Analytics: Lowering Costs While Improving Patient Care

Rapidminer and R were the number 1 and 2 in this years annual KDNuggets data mining tool usage poll, followed by Knime on place 4 and Weka on place 6. So what’s going on here? Are these open source tools really that good or is their popularity strongly correlated with lower acquisition costs alone? This session answers these questions based on a real world case for a large health care organization and explains the risks & benefits of using open source technology. The final part of the session explains how these tools stack up against their traditional, proprietary counterparts.

Speaker: Jos van Dongen, Associate & Principal, DeltIQ Group

Top of this page ] [ Agenda overview ]


12:25pm-1:25pm

Lunch / Exhibits
Room: Albert Suites


1:25pm-2:15pm
Keynote
Thought Leader:
Case Study: Yahoo! and other large on-line e-businesses
Search Marketing and Predictive Analytics: SEM, SEO and On-line Marketing Case Studies

Search Engine Marketing is a $15B industry in the U.S. growing to double that number over the next 3 years. Worldwide the SEM market was over $50B in 2010. Not only is this a fast growing area of marketing, but it is one that has significant implications for brand and direct marketing and is undergoing rapid change with emerging channels such as mobile and social. What is unique about this area of marketing is a singularly heavy dependence on analytics:

 

  • Large numbers of variables and options
  • Real-time auctions/bids and a need to adjust strategies in real-time
  • Difficult optimization problems on allocating spend across a huge number of keywords
  • Fast-changing competitive terrain and heavy competition on the obvious channels
  • Complicated interactions between various channels and a large choice of search keyword expansion possibilities
  • Profitability and ROI analysis that are complex and often challenging

 

The size of the industry, its growing importance in marketing, its upcoming role in Mobile Advertising, and its uniquely heavy reliance on analytics makes it particularly interesting as an area for predictive analytics applications. In this session, not only will hear about some of the latest strategies and techniques to optimize search, you will hear case studies that illustrate the important role of analytics from industry practitioners.

Speaker: Usama Fayyad, , Ph.D., CEO, Open Insights

Top of this page ] [ Agenda overview ]


SAS2:15pm-2:35pm

Platinum Sponsor Presentation
Creating a Model Factory Using in-Database Analytics

With the ever-increasing number of analytical models required to make fact-based decisions, as well as increasing audit compliance regulations, it is more important than ever that these models can be created, monitored, retuned and deployed as quickly and automatically as possible. This paper, using a case study from a major financial organisation, will show how organisations can build a model factory efficiently using the latest SAS technology that utilizes the power of in-database processing.

Speaker: John Spooner, Analytics Specialist, SAS (UK)

Top of this page ] [ Agenda overview ]


2:35pm-2:45pm

Session Break
Room: Albert Suites


2:45pm-3:35pm

Retail
Case Study: SABMiller
Predictive Analytics & Global Marketing Strategy

Over the last few years SABMiller plc, the second largest brewing company in the world operating in 70 countries, has been systematically segmenting its markets in different countries globally in order optimize their portfolio strategy & align it to their long term country specific growth strategy. This presentation talks about the overall methodology followed and the challenges that had to be overcome both from a technical as well as from a change management stand point in order to successfully implement a standard analytics approach to diverse markets and diverse business positions in a highly global setting.

The session explains how country specific growth strategies were converted to objective variables and consumption occasion segments were created that differentiated the market effectively by their growth potential. In addition to this the presentation will also provide a discussion on issues like:

  • The dilemmas of static vs. dynamic solutions and standardization vs. adaptable solutions
  • Challenges in acceptability, local capability development, overcoming implementation inertia, cost effectiveness, etc
  • The role that business partners at SAB and analytics service partners at AbsolutData together play in providing impactful and actionable solutions

 

Speaker: Anne Stephens, SABMiller plc

Speaker: Titir Pal, AbsolutData

Top of this page ] [ Agenda overview ]


3:35pm-4:25pm

Retail
Case Study: Overtoom Belgium
Increasing Marketing Relevance Through Personalized Targeting

 

Since many years, Overtoom Belgium – a leading B2B retailer and division of the French Manutan group – focuses on an extensive use of CRM. In this presentation, we demonstrate how Overtoom has integrated Predictive Analytics to optimize customer relationships. In this process, they employ analytics to develop answers to the key question: “which product should we offer to which customer via which channel”. We show how Overtoom gained a 10% revenue increase by replacing the existing segmentation scheme with accurate predictive response models. Additionally, we illustrate how Overtoom succeeds to deliver more relevant communications by offering personalized promotional content to every single customer, and how these personalized offers positively impact Overtoom’s conversion rates.

Speaker: Dr. Geert Verstraeten, Python Predictions

Top of this page ] [ Agenda overview ]


4:25pm-4:50pm

Break / Exhibits
Room: Albert Suites


4:50pm-5:40pm
Uplift Modelling:
Case Study: Lloyds TSB General Insurance & US Bank
Uplift Modelling: You Should Not Only Measure But Model Incremental Response

Most marketing analysts understand that measuring the impact of a marketing campaign requires a valid control group so that uplift (incremental response) can be reported. However, it is much less widely understood that the targeting models used almost everywhere do not attempt to optimize that incremental measure. That requires an uplift model.

This session will explain why a switch to uplift modelling is needed, illustrate what can and does go wrong when they are not used and the hugely positive impact they can have when used effectively. It will also discuss a range of approaches to building and assessing uplift models, from simple basic adjustments to existing modelling processes through to full-blown uplift modelling.

The talk will use Lloyds TSB General Insurance & US Bank as a case study and also illustrate real-world results from other companies and sectors.

 

Speaker: Nicholas Radcliffe, Founder and Director, Stochastic Solutions

Top of this page ] [ Agenda overview ]


5:40pm-6:30pm

Consumer services
Case Study: Canadian Automobile Association and other B2C examples
The Diminishing Marginal Returns of Variable Creation in Predictive Analytics Solutions

 

Variable Creation is the key to success in any predictive analytics exercise. Many different approaches are adopted during this process, yet there are diminishing marginal returns as the number of variables increase. Our organization conducted a case study on four existing clients to explore this so-called diminishing impact of variable creation on predictive analytics solutions. Existing predictive analytics solutions were built using our traditional variable creation process. Yet, presuming that we could exponentially increase the number of variables, we wanted to determine if this added significant benefit to the existing solution.

Speaker: Richard Boire, BoireFillerGroup

Top of this page ] [ Agenda overview ]


6:30pm-7:30pm

Reception / Exhibits
Room: Albert Suites


Tuesday November 16, 2010
All conference sessions take place in Edward 5-7

8:00am-9:00am

Registration, Coffee and Danish
Room: Albert Suites


9:00am-9:55am
Keynote
Multiple Case Studies: Anheuser-Busch, Disney, HP, HSBC, Pfizer, and others
The High ROI of Data Mining for Innovative Organizations

Data mining and advanced analytics can enhance your bottom line in three basic ways, by 1) streamlining a process, 2) eliminating the bad, or 3) highlighting the good. In rare situations, a fourth way – creating something new – is possible. But modern organizations are so effective at their core tasks that data mining usually results in an iterative, rather than transformative, improvement. Still, the impact can be dramatic.

Dr. Elder will share the story (problem, solution, and effect) of nine projects conducted over the last decade for some of America’s most innovative agencies and corporations:

    Streamline:

  • Cross-selling for HSBC
  • Image recognition for Anheuser-Busch
  • Biometric identification for Lumidigm (for Disney)
  • Optimal decisioning for Peregrine Systems (now part of Hewlett-Packard)
  • Quick decisions for the Social Security Administration
    Eliminate Bad:

  • Tax fraud detection for the IRS
  • Warranty Fraud detection for Hewlett-Packard
    Highlight Good:

  • Sector trading for WestWind Foundation
  • Drug efficacy discovery for Pharmacia & UpJohn (now Pfizer)

Moderator: Eric Siegel, Program Chair, Predictive Analytics World

Speaker: John Elder, Ph.D., Elder Research, Inc.

Also see Dr. Elder’s full-day workshop

 

Top of this page ] [ Agenda overview ]


9:55am-10:30am

Break / Exhibits
Room: Albert Suites


10:30am-11:20am
Telecommunications
Case Study: Leading Telecommunications Operator
Predictive Analytics and Efficient Fact-based Marketing

The presentation describes what are the major topics and issues when you introduce predictive analytics and how to build a Fact-Based marketing environment. The introduced tools and methodologies proved to be highly efficient in terms of improving the overall direct marketing activity and customer contact operations for the involved companies. Generally, the introduced approaches have great potential for organizations with large customer bases like Mobile Operators, Internet Giants, Media Companies, or Retail Chains.

Main Introduced Solutions:-Automated Serial Production of Predictive Models for Campaign Targeting-Automated Campaign Measurements and Tracking Solutions-Precise Product Added Value Evaluation.

Speaker: Tamer Keshi, Ph.D., Long-term contractor, T-Mobile

Speaker: Beata Kovacs, International Head of CRM Solutions, Deutsche Telekom

Top of this page ] [ Agenda overview ]


11:20am-11:25am

Session Changeover


11:25am-12:15pm
Thought Leader
Nine Laws of Data Mining

Data mining is the predictive core of predictive analytics, a business process that finds useful patterns in data through the use of business knowledge. The industry standard CRISP-DM methodology describes the process, but does not explain why the process takes the form that it does. I present nine “laws of data mining”, useful maxims for data miners, with explanations that reveal the reasons behind the surface properties of the data mining process. The nine laws have implications for predictive analytics applications: how and why it works so well, which ambitions could succeed, and which must fail.

 

Speaker: Tom Khabaza, khabaza.com

 

Top of this page ] [ Agenda overview ]


12:15pm-1:30pm

Lunch / Exhibits
Room: Albert Suites


1:30pm-2:25pm
Expert Panel: Kaboom! Predictive Analytics Hits the Mainstream

Predictive analytics has taken off, across industry sectors and across applications in marketing, fraud detection, credit scoring and beyond. Where exactly are we in the process of crossing the chasm toward pervasive deployment, and how can we ensure progress keeps up the pace and stays on target?

This expert panel will address:

  • How much of predictive analytics’ potential has been fully realized?
  • Where are the outstanding opportunities with greatest potential?
  • What are the greatest challenges faced by the industry in achieving wide scale adoption?
  • How are these challenges best overcome?

 

Panelist: John Elder, Ph.D., Elder Research, Inc.

Panelist: Colin Shearer, WW Industry Solutions Leader, IBM UK Ltd

Panelist: Udo Sglavo, Global Analytic Solutions Manager, SAS

Panel moderator: Eric Siegel, Ph.D., Program Chair, Predictive Analytics World


2:25pm-2:30pm

Session Changeover


2:30pm-3:20pm
Crowdsourcing Data Mining
Case Study: University of Melbourne, Chessmetrics
Prediction Competitions: Far More Than Just a Bit of Fun

Data modelling competitions allow companies and researchers to post a problem and have it scrutinised by the world’s best data scientists. There are an infinite number of techniques that can be applied to any modelling task but it is impossible to know at the outset which will be most effective. By exposing the problem to a wide audience, competitions are a cost effective way to reach the frontier of what is possible from a given dataset. The power of competitions is neatly illustrated by the results of a recent bioinformatics competition hosted by Kaggle. It required participants to pick markers in HIV’s genetic sequence that coincide with changes in the severity of infection. Within a week and a half, the best entry had already outdone the best methods in the scientific literature. This presentation will cover how competitions typically work, some case studies and the types of business modelling challenges that the Kaggle platform can address.

Speaker: Anthony Goldbloom, Kaggle Pty Ltd

Top of this page ] [ Agenda overview ]


3:20pm-3:50pm

Breaks /Exhibits
Room: Albert Suites


3:50pm-4:40pm
Human Resources; e-Commerce
Case Study: Naukri.com, Jeevansathi.com
Increasing Marketing ROI and Efficiency of Candidate-Search with Predictive Analytics

InfoEdge, India’s largest and most profitable online firm with a bouquet of internet properties has been Google’s biggest customer in India. Our team used predictive modeling to double our profits across multiple fronts. For Naukri.com, India’s number 1 job portal, predictive models target jobseekers most relevant to the recruiter. Analytical insights provided a deeper understanding of recruiter behaviour and informed a redesign of this product’s recruiter search functionality. This session will describe how we did it, and also reveal how Jeevansathi.com, India’s 2nd-largest matrimony portal, targets the acquisition of consumers in the market for marriage.

 

Speaker: Suvomoy Sarkar, Chief Analytics Officer, HT Media & Info Edge India (parent company of the two companies above)

 

Top of this page ] [ Agenda overview ]


4:40pm-5:00pm
Closing Remarks

Speaker: Eric Siegel, Ph.D., Program Chair, Predictive Analytics World

Top of this page ] [ Agenda overview ]


Wednesday November 17, 2010

Full-day Workshop
The Best and the Worst of Predictive Analytics:
Predictive Modeling Methods and Common Data Mining Mistakes

Click here for the detailed workshop description

  • Workshop starts at 9:00am
  • First AM Break from 10:00 – 10:15
  • Second AM Break from 11:15 – 11:30
  • Lunch from 12:30 – 1:15pm
  • First PM Break: 2:00 – 2:15
  • Second PM Break: 3:15 – 3:30
  • Workshop ends at 4:30pm

Speaker: John Elder, Ph.D., CEO and Founder, Elder Research, Inc.

 

SAS for Job Interviews

SAS Institute, Solutions
Image via Wikipedia

Yeah. I hope someone wrote a book like that.

Basically,

  1. Libname
  2. Proc Datasets
  3. Proc Import
  4. Proc Contents
  5. Proc Freq
  6. Proc Means
  7. Proc Univariate
  8. Proc Reg
  9. Proc Logistic
  10. Proc Export (to excel where you do the graphs)
  11. ODS
  12. Proc Tabulate

(note – it would be interesting to do a proc freq on all procs say used say on SAS On Demand)

Any thing else is not needed for a entry level job for a fresh grad student or job for a veteran re-trained worker.

Just like society needs science and commerce as twin pillars, analytics needs SAS (great Marketing) and R (great research) for expanding the pie of analytics which is woefully underutilized and stupidly overcapitalized by jazzy-copy-paste-data-from-query- software disguised as “intelligent software”.  R has no certification and no formal training for jobs (as yet) though this should change. SAS looks great (still) for getting jobs for grad students. R looks great (yup) for getting research jobs probably not corporate analytics jobs ?What do you think?

 

Interfaces to R

This is a fairly long post and is a basic collection  of material for a book/paper. It is on interfaces to use R. If you feel I need to add more on a  particular R interface, or if there is an error in this- please feel to contact me on twitter @decisionstats or mail ohri2007 on google mail.

R Interfaces

There are multiple ways to use the R statistical language.

Command Line- The default method is using the command prompt by the installed software on download from http://r-project.org
For windows users there is a simple GUI which has an option for Packages (loading package, installing package, setting CRAN mirror for downloading packages) , Misc (useful for listing all objects loaded in workspace as well as clearing objects to free up memory), and Help Menu.

Using Click and Point- Besides the command prompt, there are many Graphical User Interfaces which enable the analyst to use click and point methods to analyze data without getting into the details of learning complex and at times overwhelming R syntax. R GUIs are very popular both as mode of instruction in academia as well as in actual usage as it cuts down considerably on time taken to adapt to the language. As with all command line and GUI software, for advanced tweaks and techniques, command prompt will come in handy as well.

Advantages and Limitations of using Visual Programming Interfaces to R as compared to Command Line.

 

Advantages Limitations
Faster learning for new programmers Can create junk analysis by clicking menus in GUI
Easier creation of advanced models or graphics Cannot create custom functions unless you use command line
Repeatability of analysis is better Advanced techniques and custom flexibility of data handling R can be done in command line
Syntax is auto-generated Can limit scope and exposure in learning R syntax




A brief list of the notable Graphical User Interfaces is below-

1) R Commander- Basic statistics
2) Rattle- Data Mining
3) Deducer- Graphics (including GGPlot Integration) and also uses JGR (a Jave based  GUI)
4) RKward- Comprehensive R GUI for customizable graphs
5) Red-R – Dataflow programming interface using widgets

1) R Commander- R Commander was primarily created by Professor John Fox of McMaster University to cover the content of a basic statistics course. However it is extensible and many other packages can be added in menu form to it- in the form R Commander Plugins. Quite noticeably it is one of the most widely used R GUI and it also has a script window so you can write R code in combination with the menus.
As you point and click a particular menu item, the corresponding R code is automatically generated in the log window and executed.

It can be found on CRAN at http://cran.r-project.org/web/packages/Rcmdr/index.html



Advantages of Using  R Commander-
1) Useful for beginner in R language to do basic graphs and analysis and building models.
2) Has script window, output window and log window (called messages) in same screen which helps user as code is auto-generated on clicking on menus, and can be customized easily. For example in changing labels and options in Graphs.  Graphical output is shown in seperate window from output window.
3) Extensible for other R packages like qcc (for quality control), Teaching Demos (for training), survival analysis and Design of Experiments (DoE)
4) Easy to understand interface even for first time user.
5) Menu items which are not relevant are automatically greyed out- if there are only two variables, and you try to build a 3D scatterplot graph, that menu would simply not be available and is greyed out.

Comparative Disadvantages of using R Commander-
1) It is basically aimed at a statistical audience( originally students in statistics) and thus the terms as well as menus are accordingly labeled. Hence it is more of a statistical GUI rather than an analytics GUI.
2) Has limited ability to evaluate models from a business analysts perspective (ROC curve is not given as an option) even though it has extensive statistical tests for model evaluation in model sub menu. Indeed creating a Model is treated as a subsection of statistics rather than a separate menu item.
3) It is not suited for projects that do not involve advanced statistical testing and for users not proficient in statistics (particularly hypothesis testing), and for data miners.

Menu items in the R Commander window:
File Menu – For loading script files and saving Script files, Output and Workspace
It is also needed for changing the present working directory and for exiting R.
Edit Menu – For editing scripts and code in the script window.
Data Menu – For creating new dataset, inputting or importing data and manipulating data through variables. Data Import can be from text,comma separated values,clipboard, datasets from SPSS, Stata,Minitab, Excel ,dbase,  Access files or from url.
Data manipulation included deleting rows of data as well as manipulating variables.
Also this menu has the option for merging two datasets by row or columns.
Statistics Menu-This menu has options for descriptive statistics, hypothesis tests, factor analysis and clustering and also for creating models. Note there is a separate menu for evaluating the model so created.
Graphs Menu-It has options for creating various kinds of graphs including box-plot, histogram, line, pie charts and x-y plots.
The first option is color palette- it can be used for customizing the colors. It is recommended you adjust colors based on your need for publication or presentation.
A notable option is 3 D graphs for evaluating 3 variables at a time- this is really good and impressive feature and exposes the user to advanced graphs in R all at few clicks. You may want to dazzle a presentation using this graph.
Also consider scatterplot matrix graphs for graphical display of variables.
Graphical display of R surpasses any other statistical software in appeal as well as ease of creation- using GUI to create graphs can further help the user to get the most of data insights using R at a very minimum effort.
Models Menu-This is somewhat of a labeling peculiarity of R Commander as this menu is only for evaluating models which have been created using the statistics menu-model sub menu.
It includes options for graphical interpretation of model results,residuals,leverage and confidence intervals and adding back residuals to the data set.
Distributions Menu- is for cumulative probabilities, probability density, graphs of distributions, quantiles and features for standard distributions and can be used in lieu of standard statistical tables for the distributions. It has 13 standard statistical continuous distributions and 5 discrete distributions.
Tools Menu- allows you to load other packages and also load R Commander plugins (which are then added to the Interface Menu after the R Commander GUI is restarted). It also contains options sub menu for fine tuning (like opting to send output to R Menu)
Help Menu- Standard documentation and help menu. Essential reading is the short 25 page manual in it called Getting “Started With the R Commander”.

R Commander Plugins- There are twenty extensions to R Commander that greatly enhance it’s appeal -these include basic time series forecasting, survival analysis, qcc and more.

see a complete list at

  1. DoE – http://cran.r-project.org/web/packages/RcmdrPlugin.DoE/RcmdrPlugin.DoE.pdf
  2. doex
  3. EHESampling
  4. epack- http://cran.r-project.org/web/packages/RcmdrPlugin.epack/RcmdrPlugin.epack.pdf
  5. Export- http://cran.r-project.org/web/packages/RcmdrPlugin.Export/RcmdrPlugin.Export.pdf
  6. FactoMineR
  7. HH
  8. IPSUR
  9. MAc- http://cran.r-project.org/web/packages/RcmdrPlugin.MAc/RcmdrPlugin.MAc.pdf
  10. MAd
  11. orloca
  12. PT
  13. qcc- http://cran.r-project.org/web/packages/RcmdrPlugin.qcc/RcmdrPlugin.qcc.pdf and http://cran.r-project.org/web/packages/qcc/qcc.pdf
  14. qual
  15. SensoMineR
  16. SLC
  17. sos
  18. survival-http://cran.r-project.org/web/packages/RcmdrPlugin.survival/RcmdrPlugin.survival.pdf
  19. SurvivalT
  20. Teaching Demos

Note the naming convention for above e plugins is always with a Prefix of “RCmdrPlugin.” followed by the names above
Also on loading a Plugin, it must be already installed locally to be visible in R Commander’s list of load-plugin, and R Commander loads the e-plugin after restarting.Hence it is advisable to load all R Commander plugins in the beginning of the analysis session.

However the notable E Plugins are
1) DoE for Design of Experiments-
Full factorial designs, orthogonal main effects designs, regular and non-regular 2-level fractional
factorial designs, central composite and Box-Behnken designs, latin hypercube samples, and simple D-optimal designs can currently be generated from the GUI. Extensions to cover further latin hypercube designs as well as more advanced D-optimal designs (with blocking) are planned for the future.
2) Survival- This package provides an R Commander plug-in for the survival package, with dialogs for Cox models, parametric survival regression models, estimation of survival curves, and testing for differences in survival curves, along with data-management facilities and a variety of tests, diagnostics and graphs.
3) qcc -GUI for  Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts
4) epack- an Rcmdr “plug-in” based on the time series functions. Depends also on packages like , tseries, abind,MASS,xts,forecast. It covers Log-Exceptions garch
and following Models -Arima, garch, HoltWinters
5)Export- The package helps users to graphically export Rcmdr output to LaTeX or HTML code,
via xtable() or Hmisc::latex(). The plug-in was originally intended to facilitate exporting Rcmdr
output to formats other than ASCII text and to provide R novices with an easy-to-use,
easy-to-access reference on exporting R objects to formats suited for printed output. The
package documentation contains several pointers on creating reports, either by using
conventional word processors or LaTeX/LyX.
6) MAc- This is an R-Commander plug-in for the MAc package (Meta-Analysis with
Correlations). This package enables the user to conduct a meta-analysis in a menu-driven,
graphical user interface environment (e.g., SPSS), while having the full statistical capabilities of
R and the MAc package. The MAc package itself contains a variety of useful functions for
conducting a research synthesis with correlational data. One of the unique features of the MAc
package is in its integration of user-friendly functions to complete the majority of statistical steps
involved in a meta-analysis with correlations.
You can read more on R Commander Plugins at http://wp.me/p9q8Y-1Is
—————————————————————————————————————————-
Rattle- R Analytical Tool To Learn Easily (download from http://rattle.togaware.com/)
Rattle is more advanced user Interface than R Commander though not as popular in academia. It has been designed explicitly for data mining and it also has a commercial version for sale by Togaware. Rattle has a Tab and radio button/check box rather than Menu- drop down approach towards the graphical design. Also the Execute button needs to be clicked after checking certain options, just the same as submit button is clicked after writing code. This is different from clicking on a drop down menu.

Advantages of Using Rattle
1) Useful for beginner in R language to do building models,cluster and data mining.
2) Has separate tabs for data entry,summary, visualization,model building,clustering, association and evaluation. The design is intuitive and easy to understand even for non statistical background as the help is conveniently explained as each tab, button is clicked. Also the tabs are placed in a very sequential and logical order.
3) Uses a lot of other R packages to build a complete analytical platform. Very good for correlation graph,clustering as well decision trees.
4) Easy to understand interface even for first time user.
5) Log  for R code is auto generated and time stamp is placed.
6) Complete solution for model building from partitioning datasets randomly for testing,validation to building model, evaluating lift and ROC curve, and exporting PMML output of model for scoring.
7) Has a well documented online help as well as in-software documentation. The help helps explain terms even to non statistical users and is highly useful for business users.

Example Documentation for Hypothesis Testing in Test Tab in Rattle is ”
Distribution of the Data
* Kolomogorov-Smirnov     Non-parametric Are the distributions the same?
* Wilcoxon Signed Rank    Non-parametric Do paired samples have the same distribution?
Location of the Average
* T-test               Parametric     Are the means the same?
* Wilcoxon Rank-Sum    Non-parametric Are the medians the same?
Variation in the Data
* F-test Parametric Are the variances the same?
Correlation
* Correlation    Pearsons Are the values from the paired samples correlated?”

Comparative Disadvantages of using Rattle-
1) It is basically aimed at a data miner.  Hence it is more of a data mining GUI rather than an analytics GUI.
2) Has limited ability to create different types of graphs from a business analysts perspective Numeric variables can be made into Box-Plot, Histogram, Cumulative as well Benford Graphs. While interactivity using GGobi and Lattiticist is involved- the number of graphical options is still lesser than other GUI.
3) It is not suited for projects that involve multiple graphical analysis and which do not have model building or data mining.For example Data Plot is given in clustering tab but not in general Explore tab.
4) Despite the fact that it is meant for data miners, no support to biglm packages, as well as parallel programming is enabled in GUI for bigger datasets, though these can be done by R command line in conjunction with the Rattle GUI. Data m7ining is typically done on bigger datsets.
5) May have some problems installing it as it is dependent on GTK and has a lot of packages as dependencies.

Top Row-
This has the Execute Button (shown as two gears) and which has keyboard shortcut F2. It is used to execute the options in Tabs-and is equivalent of submit code button.
Other buttons include new Projects,Save  and Load projects which are files with extension to .rattle an which store all related information from Rattle.
It also has a button for exporting information in the current Tab as an open office document, and buttons for interrupting current process as well as exiting Rattle.

Data Tab-
It has the following options.
●        Data Type- These are radio buttons between Spreadsheet (and Comma Separated Values), ARFF files (Weka), ODBC (for Database Connections),Library (for Datasets from Packages),R Dataset or R datafile, Corpus (for Text Mining) and Script for generating the data by code.
●        The second row-in Data Tab in Rattle is Detail on Data Type- and its apperance shifts as per the radio button selection of data type in previous step. For Spreadsheet, it will show Path of File, Delimiters, Header Row while for ODBC it will show DSN, Tables, Rows and for Library it will show you a dropdown of all datasets in all R packages installed locally.
●        The third row is a Partition field for splitting dataset in training,testing,validation and it shows ratio. It also specifies a Random seed which can be customized for random partitions which can be replicated. This is very useful as model building requires model to be built and tested on random sub sets of full dataset.
●        The fourth row is used to specify the variable type of inputted data. The variable types are
○        Input: Used for modeling as independent variables
○        Target: Output for modeling or the dependent variable. Target is a categoric variable for classification, numeric for regression and for survival analysis both Time and Status need to be defined
○        Risk: A variable used in the Risk Chart
○        Ident: An identifier for unique observations in the data set like AccountId or Customer Id
○        Ignore: Variables that are to be ignored.
●        In addition the weight calculator can be used to perform mathematical operations on certain variables and identify certain variables as more important than others.

Explore Tab-
Summary Sub-Tab has Summary for brief summary of variables, Describe for detailed summary and Kurtosis and Skewness for comparing them across numeric variables.
Distributions Sub-Tab allows plotting of histograms, box plots, and cumulative plots for numeric variables and for categorical variables Bar Plot and Dot Plot.
It also has Benford Plot for Benford’s Law on probability of distribution of digits.
Correlation Sub-Tab– This displays corelation between variables as a table and also as a very nice plot.
Principal Components Sub-Tab– This is for use with Principal Components Analysis including the SVD (singular value decomposition) and Eigen methods.
Interactive Sub-Tab- Allows interactive data exploration using GGobi and Lattice software. It is a powerful visual tool.

Test Tab-This has options for hypothesis testing of data for two sample tests.
Transform Tab-This has options for rescaling data, missing values treatment, and deleting invalid or missing values.
Cluster Tab-It gives an option to KMeans, Hierarchical and Bi-Cluster clustering methods with automated graphs,plots (including dendogram, discriminant plot and data plot) and cluster results available. It is highly recommended for clustering projects especially for people who are proficient in clustering but not in R.

Associate Tab-It helps in building association rules between categorical variables, which are in the form of “if then”statements. Example. If day is Thursday, and someone buys Milk, there is 80% chance they will buy Diapers. These probabilities are generated from observed frequencies.

Model Tab-The Model tab makes Rattle one of the most advanced data mining tools, as it incorporates decision trees(including boosted models and forest method), linear and logistic regression, SVM,neural net,survival models.
Evaluate Tab-It as functionality for evaluating models including lift,ROC,confusion matrix,cost curve,risk chart,precision, specificity, sensitivity as well as scoring datasets with built model or models. Example – A ROC curve generated by Rattle for Survived Passengers in Titanic (as function of age,class,sex) This shows comparison of various models built.

Log Tab- R Code is automatically generated by Rattle as the respective operation is executed. Also timestamp is done so it helps in reviewing error as well as evaluating speed for code optimization.
—————————————————————————————————————————-
JGR- Deducer- (see http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual
JGR is a Java Based GUI. Deducer is recommended for use with JGR.
Deducer has basically been made to implement GGPLOT in a GUI- an advanced graphics package based on Grammer of Graphics and was part of Google Summer of Code project.

It first asks you to either open existing dataset or load a new dataset with just two icons. It has two initial views in Data Viewer- a Data view and Variable view which is quite similar to Base SPSS. The other Deducer options are loaded within the JGR console.

Advantages of Using  Deducer
1.      It has an option for factor as well as reliability analysis which is missing in other graphical user interfaces like R Commander and Rattle.
2.      The plot builder option gives very good graphics -perhaps the best in other GUIs. This includes a color by option which allows you to shade the colors based on variable value. An addition innovation is the form of templates which enables even a user not familiar with data visualization to choose among various graphs and click and drag them to plot builder area.
3.      You can set the Java Gui for R (JGR) menu to automatically load some packages by default using an easy checkbox list.
4.      Even though Deducer is a very young package, it offers a way for building other R GUIs using Java Widgets.
5.      Overall feel is of SPSS (Base GUI) to it’s drop down menu, and selecting variables in the sub menu dialogue by clicking to transfer to other side.SPSS users should be more comfortable at using this.
6.      A surprising thing is it rearranges the help documentation of all R in a very presentable and organized manner
7.      Very convenient to move between two or more datasets using dropdown.
8.      The most convenient GUI for merging two datasets using common variable.

Dis Advantages of Using  Deducer
1.      Not able to save plots as images (only options are .pdf and .eps), you can however copy as image.
2.      Basically a data viualization GUI – it does offer support for regression, descriptive statistics in the menu item Extras- however the menu suggests it is a work in progress.
3.      Website for help is outdated, and help documentation specific to Deducer lacks detail.



Components of Deducer-
Data Menu-Gives options for data manipulation including recoding variables,transform variables (binning, mathematical operation), sort dataset,  transpose dataset ,merge two datasets.
Analysis Menu-Gives options for frequency tables, descriptive statistics,cross tabs, one sample tests (with plots) ,two sample tests (with plots),k sample tests, correlation,linear and logistic models,generalized linear models.
Plot Builder Menu- This allows plots of various kinds to be made in an interactive manner.

Correlation using Deducer.

————————————————————————————————————————–
Red-R – A dataflow user interface for R (see http://red-r.org/

Red R uses dataflow concepts as a user interface rather than menus and tabs. Thus it is more similar to Enterprise Miner or Rapid Miner in design. For repeatable analysis dataflow programming is preferred by some analysts. Red-R is written in Python.


Advantages of using Red-R
1) Dataflow style makes it very convenient to use. It is the only dataflow GUI for R.
2) You can save the data as well as analysis in the same file.
3) User Interface makes it easy to read R code generated, and commit code.
4) For repeatable analysis-like reports or creating models it is very useful as you can replace just one widget and other widget/operations remain the same.
5) Very easy to zoom into data points by double clicking on graphs. Also to change colors and other options in graphs.
6) One minor feature- It asks you to set CRAN location just once and stores it even for next session.
7) Automated bug report submission.

Disadvantages of using Red-R
1) Current version is 1.8 and it needs a lot of improvement for building more modeling types as well as debugging errors.
2) Limited features presently.
———————————————————————————————————————-
RKWard (see http://rkward.sourceforge.net/)

It is primarily a KDE GUI for R, so it can be used on Ubuntu Linux. The windows version is available but has some bugs.

Advantages of using RKWard
1) It is the only R GUI for time series at present.
In addition it seems like the only R GUI explicitly for Item Response Theory (which includes credit response models,logistic models) and plots contains Pareto Charts.
2) It offers a lot of detail in analysis especially in plots(13 types of plots), analysis and  distribution analysis ( 8 Tests of normality,14 continuous and 6 discrete distributions). This detail makes it more suitable for advanced statisticians rather than business analytics users.
3) Output can be easily copied to Office documents.

Disadvantages of using RKWard
1) It does not have stable Windows GUI. Since a graphical user interface is aimed at making interaction easier for users- this is major disadvantage.
2) It has a lot of dependencies so may have some issues in installing.
3) The design categorization of analysis,plots and distributions seems a bit unbalanced considering other tabs are File, Edit, View, Workspace,Run,Settings, Windows,Help.
Some of the other tabs can be collapsed, while the three main tabs of analysis,plots,distributions can be better categorized (especially into modeling and non-modeling analysis).
4) Not many options for data manipulation (like subset or transpose) by the GUI.
5) Lack of detail in documentation as it is still on version 0.5.3 only.

Components-
Analysis, Plots and Distributions are the main components and they are very very extensive, covering perhaps the biggest range of plots,analysis or distribution analysis that can be done.
Thus RKWard is best combined with some other GUI, when doing advanced statistical analysis.

 

GNU General Public License
Image via Wikipedia

GrapherR

GrapheR is a Graphical User Interface created for simple graphs.

Depends: R (>= 2.10.0), tcltk, mgcv
Description: GrapheR is a multiplatform user interface for drawing highly customizable graphs in R. It aims to be a valuable help to quickly draw publishable graphs without any knowledge of R commands. Six kinds of graphs are available: histogram, box-and-whisker plot, bar plot, pie chart, curve and scatter plot.
License: GPL-2
LazyLoad: yes
Packaged: 2011-01-24 17:47:17 UTC; Maxime
Repository: CRAN
Date/Publication: 2011-01-24 18:41:47

More information about GrapheR at CRAN
Path: /cran/newpermanent link

Advantages of using GrapheR

  • It is bi-lingual (English and French) and can import in text and csv files
  • The intention is for even non users of R, to make the simple types of Graphs.
  • The user interface is quite cleanly designed. It is thus aimed as a data visualization GUI, but for a more basic level than Deducer.
  • Easy to rename axis ,graph titles as well use sliders for changing line thickness and color

Disadvantages of using GrapheR

  • Lack of documentation or help. Especially tips on mouseover of some options should be done.
  • Some of the terms like absicca or ordinate axis may not be easily understood by a business user.
  • Default values of color are quite plain (black font on white background).
  • Can flood terminal with lots of repetitive warnings (although use of warnings() function limits it to top 50)
  • Some of axis names can be auto suggested based on which variable s being chosen for that axis.
  • Package name GrapheR refers to a graphical calculator in Mac OS – this can hinder search engine results

Using GrapheR

  • Data Input -Data Input can be customized for CSV and Text files.
  • GrapheR gives information on loaded variables (numeric versus Factors)
  • It asks you to choose the type of Graph 
  • It then asks for usual Graph Inputs (see below). Note colors can be customized (partial window). Also number of graphs per Window can be easily customized 
  • Graph is ready for publication



Related Articles

 

Summary of R GUIs


Using R from other software- Please note that interfaces to R exist from other software as well. These include software from SAS Institute, IBM SPSS, Rapid Miner,Knime  and Oracle.

A brief list is shown below-

1) SAS/IML Interface to R- You can read about the SAS Institute’s SAS/ IML Studio interface to R at http://www.sas.com/technologies/analytics/statistics/iml/index.html
2) Rapid  Miner Extension to R-You can view integration with Rapid Miner’s extension to R here at http://www.youtube.com/watch?v=utKJzXc1Cow
3) IBM SPSS plugin for R-SPSS software has R integration in the form of a plugin. This was one of the earliest third party software offering interaction with R and you can read more at http://www.spss.com/software/statistics/developer/
4) Knime- Konstanz Information Miner also has R integration. You can view this on
http://www.knime.org/downloads/extensions
5) Oracle Data Miner- Oracle has a data mining offering to it’s very popular database software which is integrated with the R language. The R Interface to Oracle Data Mining ( R-ODM) allows R users to access the power of Oracle Data Mining’s in-database functions using the familiar R syntax. http://www.oracle.com/technetwork/database/options/odm/odm-r-integration-089013.html
6) JMP- JMP version 9 is the latest to offer interface to R.  You can read example scripts here at http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html

R Excel- Using R from Microsoft Excel

Microsoft Excel is the most widely used spreadsheet program for data manipulation, entry and graphics. Yet as dataset sizes have increased, Excel’s statistical capabilities have lagged though it’s design has moved ahead in various product versions.

R Excel basically works at adding a .xla plugin to
Excel just like other Plugins. It does so by connecting to R through R packages.

Basically it offers the functionality of R
functions and capabilities to the most widely distributed spreadsheet program. All data summaries, reports and analysis end up in a spreadsheet-

R Excel enables R to be very useful for people not
knowing R. In addition it adds (by option) the menus of R Commander as menus in Excel spreadsheet.


Advantages-
Enables R and Excel to communicate thus tieing an advanced statistical tool to the most widely used business analytics tool.

Disadvantages-
No major disadvatage at all to a business user. For a data statistical user, Microsoft Excel is limited to 100,000 rows, so R data needs to be summarized or reduced.

Graphical capabilities of R are very useful, but to a new user, interactive graphics in Excel may be easier than say using Ggplot ot Ggobi.
You can read more on this at http://rcom.univie.ac.at/ or  the complete Springer Book http://www.springer.com/statistics/computanional+statistics/book/978-1-4419-0051-7

The combination of cloud computing and internet offers a new kind of interaction possible for scientists as well analysts.

Here is a way to use R on an Amazon EC2 machine, thus renting by hour hardware and computing resources which are scaleable to massive levels , whereas the software is free.

Here is how you can connect to Amazon EC2 and run R.
Running R for Cloud Computing.
1) Logging onto Amazon Console http://aws.amazon.com/ec2/
Note you need your Amazon Id (even the same id which you use for buying books).Note we are into Amazon EC2 as shown by the upper tab. Click upper tab to get into the Amazon EC2
2) Choosing the right AMI-On the left margin, you can click AMI -Images. Now you can search for the image-I chose Ubuntu images (linux images are cheaper) and latest Ubuntu Lucid  in the search .You can choose whether you want 32 bit or 64 bit image. 64 bit images will lead to  faster processing of data.Click on launch instance in the upper tab ( near the search feature). A pop up comes up, which shows the 5 step process to launch your computing.
3) Choose the right compute instance- – there are various compute instances and they all are at different multiples of prices or compute units. They differ in terms of RAM memory and number of processors.After choosing the compute instance of your choice (extra large is highlighted)- click on continue-
4) Instance Details-Do not  choose cloudburst monitoring if you are on a budget as it has a extra charge. For critical production it would be advisable to choose cloudburst monitoring once you have become comfortable with handling cloud computing..
5) Add Tag Details- If you are running a lot of instances you need to create your own tags to help you manage them. It is advisable if you are going to run many instances.
6) Create a key pair- A key pair is an added layer of encryption. Click on create new pair and name it (note the name will be handy in coming steps)
7) After clicking and downloading the key pair- you come into security groups. Security groups is just a set of instructions to help keep your data transfer secure. You want to enable access to your cloud instance to certain IP addresses (if you are going to connect from fixed IP address and to certain ports in your computer. It is necessary in security group to enable  SSH using Port 22.
Last step- Review Details and Click Launch
8) On the Left margin click on instances ( you were in Images.>AMI earlier)
It will take some 3-5 minutes to launch an instance. You can see status as pending till then.
9) Pending instance as shown by yellow light-
10) Once the instance is running -it is shown by a green light.
Click on the check box, and on upper tab go to instance actions. Click on connect-
You see a popup with instructions like these-
· Open the SSH client of your choice (e.g., PuTTY, terminal).
·  Locate your private key, nameofkeypair.pem
·  Use chmod to make sure your key file isn’t publicly viewable, ssh won’t work otherwise:
chmod 400 decisionstats.pem
·  Connect to your instance using instance’s public DNS [ec2-75-101-182-203.compute-1.amazonaws.com].
Example
Enter the following command line:
ssh -i decisionstats2.pem root@ec2-75-101-182-203.compute-1.amazonaws.com

Note- If you are using Ubuntu Linux on your desktop/laptop you will need to change the above line to ubuntu@… from root@..

ssh -i yourkeypairname.pem -X ubuntu@ec2-75-101-182-203.compute-1.amazonaws.com

(Note X11 package should be installed for Linux users- Windows Users will use Remote Desktop)

12) Install R Commander on the remote machine (which is running Ubuntu Linux) using the command

sudo apt-get install r-cran-rcmdr


Amcharts- Cool Charts Web Editor

Here is a really good website if you want to create charts for your website. It offers both flash as well as Silverlight charts.

http://extra.amcharts.com/editor/line/

This is an example of a Line Chart. Note since I am on wordpress.com I cant use Javascript so have pasted the screenshot-All you can do is paste the data from csv file, and even the swf file is hosted on their servers.


Windows Azure vs Amazon EC2 (and Google Storage)

Here is a comparison of Windows Azure instances vs Amazon compute instances

Compute Instance Sizes:

Developers have the ability to choose the size of VMs to run their application based on the applications resource requirements. Windows Azure compute instances come in four unique sizes to enable complex applications and workloads.

Compute Instance Size CPU Memory Instance Storage I/O Performance
Small 1.6 GHz 1.75 GB 225 GB Moderate
Medium 2 x 1.6 GHz 3.5 GB 490 GB High
Large 4 x 1.6 GHz 7 GB 1,000 GB High
Extra large 8 x 1.6 GHz 14 GB 2,040 GB High

Standard Rates:

Windows Azure

  • Compute
    • Small instance (default): $0.12 per hour
    • Medium instance: $0.24 per hour
    • Large instance: $0.48 per hour
    • Extra large instance: $0.96 per hour
  • Storage
    • $0.15 per GB stored per month
    • $0.01 per 10,000 storage transactions
  • Content Delivery Network (CDN)
    • $0.15 per GB for data transfers from European and North American locations*
    • $0.20 per GB for data transfers from other locations*
    • $0.01 per 10,000 transactions*

Source –

http://www.microsoft.com/windowsazure/offers/popup/popup.aspx?lang=en&locale=en-US&offer=MS-AZR-0001P

and

http://www.microsoft.com/windowsazure/windowsazure/

Amazon EC2 has more options though——————————-

http://aws.amazon.com/ec2/pricing/

Standard On-Demand Instances Linux/UNIX Usage Windows Usage
Small (Default) $0.085 per hour $0.12 per hour
Large $0.34 per hour $0.48 per hour
Extra Large $0.68 per hour $0.96 per hour
Micro On-Demand Instances Linux/UNIX Usage Windows Usage
Micro $0.02 per hour $0.03 per hour
High-Memory On-Demand Instances
Extra Large $0.50 per hour $0.62 per hour
Double Extra Large $1.00 per hour $1.24 per hour
Quadruple Extra Large $2.00 per hour $2.48 per hour
High-CPU On-Demand Instances
Medium $0.17 per hour $0.29 per hour
Extra Large $0.68 per hour $1.16 per hour
Cluster Compute Instances
Quadruple Extra Large $1.60 per hour N/A*
* Windows is not currently available for Cluster Compute Instances.

http://aws.amazon.com/ec2/instance-types/

Standard Instances

Instances of this family are well suited for most applications.

Small Instance – default*

1.7 GB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
160 GB instance storage (150 GB plus 10 GB root partition)
32-bit platform
I/O Performance: Moderate
API name: m1.small

Large Instance

7.5 GB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage (2×420 GB plus 10 GB root partition)
64-bit platform
I/O Performance: High
API name: m1.large

Extra Large Instance

15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage (4×420 GB plus 10 GB root partition)
64-bit platform
I/O Performance: High
API name: m1.xlarge

Micro Instances

Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPUcapacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.

Micro Instance

613 MB memory
Up to 2 EC2 Compute Units (for short periodic bursts)
EBS storage only
32-bit or 64-bit platform
I/O Performance: Low
API name: t1.micro

High-Memory Instances

Instances of this family offer large memory sizes for high throughput applications, including database and memory caching applications.

High-Memory Extra Large Instance

17.1 GB of memory
6.5 EC2 Compute Units (2 virtual cores with 3.25 EC2 Compute Units each)
420 GB of instance storage
64-bit platform
I/O Performance: Moderate
API name: m2.xlarge

High-Memory Double Extra Large Instance

34.2 GB of memory
13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each)
850 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.2xlarge

High-Memory Quadruple Extra Large Instance

68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.4xlarge

High-CPU Instances

Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

High-CPU Medium Instance

1.7 GB of memory
5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each)
350 GB of instance storage
32-bit platform
I/O Performance: Moderate
API name: c1.medium

High-CPU Extra Large Instance

7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge

Cluster Compute Instances

Instances of this family provide proportionally high CPU resources with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications. Learn more about use of this instance type for HPC applications.

Cluster Compute Quadruple Extra Large Instance

23 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cc1.4xlarge

Also http://www.microsoft.com/en-us/sqlazure/default.aspx

offers SQL Databases as a service with a free trial offer

If you are into .Net /SQL big time or too dependent on MS, Azure is a nice option to EC2 http://www.microsoft.com/windowsazure/offers/popup/popup.aspx?lang=en&locale=en-US&offer=COMPARE_PUBLIC

Updated- I just got approved for Google Storage so am adding their info- though they are in Preview (and its free right now) 🙂

https://code.google.com/apis/storage/docs/overview.html

Functionality

Google Storage for Developers offers a rich set of features and capabilities:

Basic Operations

  • Store and access data from anywhere on the Internet.
  • Range-gets for large objects.
  • Manage metadata.

Security and Sharing

  • User authentication using secret keys or Google account.
  • Authenticated downloads from a web browser for Google account holders.
  • Secure access using SSL.
  • Easy, powerful sharing and collaboration via ACLs for individuals and groups.

Performance and scalability

  • Up to 100 gigabytes per object and 1,000 buckets per account during the preview.
  • Strong data consistency—read-after-write consistency for all upload and delete operations.
  • Namespace for your domain—only you can create bucket URIs containing your domain name.
  • Data replicated in multiple data centers across the U.S. and within the same data center.

Tools

  • Web-based storage manager.
  • GSUtil, an open source command line tool.
  • Compatible with many existing cloud storage tools and libraries.

Read the Getting Started Guide to learn more about the service.

Note: Google Storage for Developers does not support Google Apps accounts that use your company domain name at this time.

Back to top

Pricing

Google Storage for Developers pricing is based on usage.

  • Storage—$0.17/gigabyte/month
  • Network
    • Upload data to Google
      • $0.10/gigabyte
    • Download data from Google
      • $0.15/gigabyte for Americas and EMEA
      • $0.30/gigabyte for Asia-Pacific
  • Requests
    • PUT, POST, LIST—$0.01 per 1,000 requests
    • GET, HEAD—$0.01 per 10,000 requests
%d bloggers like this: