Brief Interview with James G Kobielus

Here is a brief one question interview with James Kobielus, Senior Analyst, Forrester.

Ajay-Describe the five most important events in Predictive Analytics you saw in 2010 and the top three trends in 2011 as per you.

Jim-

Five most important developments in 2010:

  • Continued emergence of enterprise-grade Hadoop solutions as the core of the future cloud-based platforms for advanced analytics
  • Development of the market for analytic solution appliances that incorporate several key features for advanced analytics: massively parallel EDW appliance, in-database analytics and data management function processing, embedded statistical libraries, prebuilt logical domain models, and integrated modeling and mining tools
  • Integration of advanced analytics into core BI platforms with user-friendly, visual, wizard-driven, tools for quick, exploratory predictive modeling, forecasting, and what-if analysis by nontechnical business users
  • Convergence of predictive analytics, data mining, content analytics, and CEP in integrated tools geared  to real-time social media analytics
  • Emergence of CRM and other line-of-business applications that support continuously optimized “next-best action” business processes through embedding of predictive models, orchestration engines, business rules engines, and CEP agility

Three top trends I see in the coming year, above and beyond deepening and adoption of the above-bulleted developments:

  • All-in-memory, massively parallel analytic architectures will begin to gain a foothold in complex EDW environments in support of real-time elastic analytics
  • Further crystallization of a market for general-purpose “recommendation engines” that, operating inline to EDWs, CEP environments, and BPM platforms, enable “next-best action” approaches to emerge from today’s application siloes
  • Incorporation of social network analysis functionality into a wider range of front-office business processes to enable fine-tuned behavioral-based customer segmentation to drive CRM optimization

About –http://www.forrester.com/rb/analyst/james_kobielus

James G. Kobielus
Senior Analyst, Forrester Research

RESEARCH FOCUS

James serves Business Process & Applications professionals. He is a leading expert on data warehousing, predictive analytics, data mining, and complex event processing. In addition to his core coverage areas, James contributes to Forrester’s research in business intelligence, data integration, data quality, and master data management.

PREVIOUS WORK EXPERIENCE

James has a long history in IT research and consulting and has worked for both vendors and research firms. Most recently, he was at Current Analysis, an IT research firm, where he was a principal analyst covering topics ranging from data warehousing to data integration and the Semantic Web. Prior to that position, James was a senior technical systems analyst at Exostar (a hosted supply chain management and eBusiness hub for the aerospace and defense industry). In this capacity, James was responsible for identifying and specifying product/service requirements for federated identity, PKI, and other products. He also worked as an analyst for the Burton Group and was previously employed by LCC International, DynCorp, ADEENA, International Center for Information Technologies, and the North American Telecommunications Association. He is both well versed and experienced in product and market assessments. James is a widely published business/technology author and has spoken at many industry events

Who searches for this Blog?

Statue of Michael Jackson in Eindhoven, the Ne...
Image via Wikipedia

Using WP- Stats I set about answering this question-

What search keywords lead here-

Clearly Michael Jackson is down this year

And R GUI, Data Mining is up.

How does that affect my writing- given I get almost 250 visitors by search engines alone daily- assume I write nothing on this blog from now on.

It doesnt- I still write what ever code or poem that comes to my mind. So it is hurtful people misunderstimate the effort in writing and jump to conclusions (esp if I write about a company- I am not on payroll of that company- just like if  I write about a poem- I am not a full time poet)

Over to xkcd

All Time (for Decisionstats.Wordpress.com)

Search Views
libre office 818
facebook analytics 806
michael jackson history 240
wps sas lawsuit 180
r gui 168
wps sas 154
wordle.net 118
sas wps 116
decision stats 110
sas wps lawsuit 100
google maps jet ski 94
data mining 88
doug savage 72
hive tutorial 63
spss certification 63
hadley wickham 63
google maps jetski 62
sas sues wps 60
decisionstats 58
donald farmer microsoft 45
libreoffice 44
wps statistics 44
best statistics software 42
r gui ubuntu 41
rstat 37
tamilnadu advanced technical training institute tatti 37

YTD

2009-11-24 to Today

Search Views
libre office 818
facebook analytics 781
wps sas lawsuit 170
r gui 164
wps sas 125
wordle.net 118
sas wps 101
sas wps lawsuit 95
google maps jet ski 94
data mining 86
decision stats 82
doug savage 63
hadley wickham 63
google maps jetski 62
hive tutorial 56
donald farmer microsoft 45

Interview Jamie Nunnelly NISS

An interview with Jamie Nunnelly, Communications Director of National Institute of Statistical Sciences

Ajay– What does NISS do? And What does SAMSI do?

Jamie– The National Institute of Statistical Sciences (NISS) was established in 1990 by the national statistics societies and the Research Triangle universities and organizations, with the mission to identify, catalyze and foster high-impact, cross-disciplinary and cross-sector research involving the statistical sciences.

NISS is dedicated to strengthening and serving the national statistics community, most notably by catalyzing community members’ participation in applied research driven by challenges facing government and industry. NISS also provides career development opportunities for statisticians and scientists, especially those in the formative stages of their careers.

The Institute identifies emerging issues to which members of the statistics community can make key contributions, and then catalyzes the right combinations of researchers from multiple disciplines and sectors to tackle each problem. More than 300 researchers from over 100 institutions have worked on our projects.

The Statistical and Applied Mathematical Sciences Institute (SAMSI) is a partnership of Duke University,  North Carolina State University, The University of North Carolina at Chapel Hill, and NISS in collaboration with the William Kenan Jr. Institute for Engineering, Technology and Science and is part of the Mathematical Sciences Institutes of the NSF.

SAMSI focuses on 1-2 programs of research interest in the statistical and/or applied mathematical area and visitors from around the world are involved with the programs and come from a variety of disciplines in addition to mathematics and statistics.

Many come to SAMSI to attend workshops, and also participate in working groups throughout the academic year. Many of the working groups communicate via WebEx so people can be involved with the research remotely. SAMSI also has a robust education and outreach program to help undergraduate and graduate students learn about cutting edge research in applied mathematics and statistics.

Ajay– What successes have you had in 2010- and what do you need to succeed in 2011. Whats planned for 2011 anyway

Jamie– NISS has had a very successful collaboration with the National Agricultural Statistical Service (NASS) over the past two years that was just renewed for the next two years. NISS & NASS had three teams consisting of a faculty researcher in statistics, a NASS researcher, a NISS mentor, a postdoctoral fellow and a graduate student working on statistical modeling and other areas of research for NASS.

NISS is also working on a syndromic surveillance project with Clemson University, Duke University, The University of Georgia, The University of South Carolina. The group is currently working with some hospitals to test out a model they have been developing to help predict disease outbreak.

SAMSI had a very successful year with two programs ending this past summer, which were the Stochastic Dynamics program and the Space-time Analysis for Environmental Mapping, Epidemiology and Climate Change. Several papers were written and published and many presentations have been made at various conferences around the world regarding the work that was conducted as SAMSI last year.

Next year’s program is so big that the institute has decided to devote all it’s time and energy around it, which is uncertainty quantification. The opening workshop, in addition to the main methodological theme, will be broken down into three areas of interest under this broad umbrella of research: climate change, engineering and renewable energy, and geosciences.

Ajay– Describe your career in science and communication.

Jamie– I have been in communications since 1985, working for large Fortune 500 companies such as General Motors and Tropicana Products. I moved to the Research Triangle region of North Carolina after graduate school and got into economic development and science communications first working for the Research Triangle Regional Partnership in 1994.

From 1996-2005 I was the communications director for the Research Triangle Park, working for the Research Triangle Foundation of NC. I published a quarterly magazine called The Park Guide for awhile, then came to work for NISS and SAMSI in 2008.

I really enjoy working with the mathematicians and statisticians. I always joke that I am the least educated person working here and that is not far from the truth! I am honored to help get the message out about all of the important research that is conducted here each day that is helping to improve the lives of so many people out there.

Ajay– Research Triangle or Silicon Valley– Which is better for tech people and why? Your opinion

Jamie– Both the Silicon Valley and Research Triangle are great regions for tech people to locate, but of course, I have to be biased and choose Research Triangle!

Really any place in the world that you find many universities working together with businesses and government, you have an area that will grow and thrive, because the collaborations help all of us generate new ideas, many of which blossom into new businesses, or new endeavors of research.

The quality of life in places such as the Research Triangle is great because you have people from around the world moving to a place, each bringing his/her culture, food, and uniqueness to this place, and enriching everyone else as a result.

Two advantages the Research Triangle has over Silicon Valley are that the Research Triangle has a bigger diversity of industries, so when the telecommunications industry busted back in 2001-02, the region took a hit, but the biotechnology industry was still growing, so unemployment rose, but not to the extent that other areas might have experienced.

The latest recession has hit us all very hard, so even this strategy has not made us immune to having high unemployment, but the Research Triangle region has been pegged by experts to be one of the first regions to emerge out of the Great Recession.

The other advantage I think we have is that our cost of living is still much more reasonable than Silicon Valley. It’s still possible to get a nice sized home, some land and not break the bank!

Ajay– How do you manage an active online social media presence, your job and your family. How important is balance in professional life and when young professional should realize this?

Jamie– Balance is everything, isn’t it? When I leave the office, I turn off my iPhone and disconnect from Twitter/Facebook etc.

I know that is not recommended by some folks, but I am a one person communications department and I love my family and friends and feel its important to devote time to them as well as to my career.

I think it is very important for young people to establish this early in their careers because if they don’t they will fall victim to working way too many hours and really, who loves you at the end of the day?

Your company may appreciate all you do for them, but if you leave, or you get sick and cannot work for them, you will be replaced

. Lee Iacocca, former CEO of Chrystler, said, “No matter what you’ve done for yourself or for humanity, if you can’t look back on having given love and attention to your own family, what have you really accomplished?” I think that is what is really most important in life.

About-

Jamie Nunnelly has been in communications for 25 years. She is currently on the board of directors for Chatham County Economic Development Corporation and Leadership Triangle & is a member of the International Association of Business Communicators and the Public Relations Society of America. She earned a bachelor’s degree in interpersonal and public communications at Bowling Green State University and a master’s degree in mass communications at the University of South Florida.

You can contact Jamie at http://niss.org/content/jamie-nunnelly or on twitter at

SAS for Job Interviews

SAS Institute, Solutions
Image via Wikipedia

Yeah. I hope someone wrote a book like that.

Basically,

  1. Libname
  2. Proc Datasets
  3. Proc Import
  4. Proc Contents
  5. Proc Freq
  6. Proc Means
  7. Proc Univariate
  8. Proc Reg
  9. Proc Logistic
  10. Proc Export (to excel where you do the graphs)
  11. ODS
  12. Proc Tabulate

(note – it would be interesting to do a proc freq on all procs say used say on SAS On Demand)

Any thing else is not needed for a entry level job for a fresh grad student or job for a veteran re-trained worker.

Just like society needs science and commerce as twin pillars, analytics needs SAS (great Marketing) and R (great research) for expanding the pie of analytics which is woefully underutilized and stupidly overcapitalized by jazzy-copy-paste-data-from-query- software disguised as “intelligent software”.  R has no certification and no formal training for jobs (as yet) though this should change. SAS looks great (still) for getting jobs for grad students. R looks great (yup) for getting research jobs probably not corporate analytics jobs ?What do you think?

 

Cloud Computing with R

Illusion of Depth and Space (4/22) - Rotating ...
Image by Dominic's pics via Flickr

Here is a short list of resources and material I put together as starting points for R and Cloud Computing It’s a bit messy but overall should serve quite comprehensively.

Cloud computing is a commonly used expression to imply a generational change in computing from desktop-servers to remote and massive computing connections,shared computers, enabled by high bandwidth across the internet.

As per the National Institute of Standards and Technology Definition,
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

(Citation: The NIST Definition of Cloud Computing

Authors: Peter Mell and Tim Grance
Version 15, 10-7-09
National Institute of Standards and Technology, Information Technology Laboratory
http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

From http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Web-Interfaces

R Web Interfaces

Rweb is developed and maintained by Jeff Banfield. The Rweb Home Page provides access to all three versions of Rweb—a simple text entry form that returns output and graphs, a more sophisticated JavaScript version that provides a multiple window environment, and a set of point and click modules that are useful for introductory statistics courses and require no knowledge of the R language. All of the Rweb versions can analyze Web accessible datasets if a URL is provided.
The paper “Rweb: Web-based Statistical Analysis”, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, was published in the Journal of Statistical Software (http://www.jstatsoft.org/v04/i01/).

Ulf Bartel has developed R-Online, a simple on-line programming environment for R which intends to make the first steps in statistical programming with R (especially with time series) as easy as possible. There is no need for a local installation since the only requirement for the user is a JavaScript capable browser. See http://osvisions.com/r-online/ for more information.

Rcgi is a CGI WWW interface to R by MJ Ray. It had the ability to use “embedded code”: you could mix user input and code, allowing the HTMLauthor to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output was possible in PostScript or GIF formats and the executed code was presented to the user for revision. However, it is not clear if the project is still active.

Currently, a modified version of Rcgi by Mai Zhou (actually, two versions: one with (bitmap) graphics and one without) as well as the original code are available from http://www.ms.uky.edu/~statweb/.

CGI-based web access to R is also provided at http://hermes.sdu.dk/cgi-bin/go/. There are many additional examples of web interfaces to R which basically allow to submit R code to a remote server, see for example the collection of links available from http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatCompCourse.

David Firth has written CGIwithR, an R add-on package available from CRAN. It provides some simple extensions to R to facilitate running R scripts through the CGI interface to a web server, and allows submission of data using both GET and POST methods. It is easily installed using Apache under Linux and in principle should run on any platform that supports R and a web server provided that the installer has the necessary security permissions. David’s paper “CGIwithR: Facilities for Processing Web Forms Using R” was published in the Journal of Statistical Software (http://www.jstatsoft.org/v08/i10/). The package is now maintained by Duncan Temple Lang and has a web page athttp://www.omegahat.org/CGIwithR/.

Rpad, developed and actively maintained by Tom Short, provides a sophisticated environment which combines some of the features of the previous approaches with quite a bit of JavaScript, allowing for a GUI-like behavior (with sortable tables, clickable graphics, editable output), etc.
Jeff Horner is working on the R/Apache Integration Project which embeds the R interpreter inside Apache 2 (and beyond). A tutorial and presentation are available from the project web page at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RApacheProject.

Rserve is a project actively developed by Simon Urbanek. It implements a TCP/IP server which allows other programs to use facilities of R. Clients are available from the web site for Java and C++ (and could be written for other languages that support TCP/IP sockets).

OpenStatServer is being developed by a team lead by Greg Warnes; it aims “to provide clean access to computational modules defined in a variety of computational environments (R, SAS, Matlab, etc) via a single well-defined client interface” and to turn computational services into web services.

Two projects use PHP to provide a web interface to R. R_PHP_Online by Steve Chen (though it is unclear if this project is still active) is somewhat similar to the above Rcgi and Rweb. R-php is actively developed by Alfredo Pontillo and Angelo Mineo and provides both a web interface to R and a set of pre-specified analyses that need no R code input.

webbioc is “an integrated web interface for doing microarray analysis using several of the Bioconductor packages” and is designed to be installed at local sites as a shared computing resource.

Rwui is a web application to create user-friendly web interfaces for R scripts. All code for the web interface is created automatically. There is no need for the user to do any extra scripting or learn any new scripting techniques. Rwui can also be found at http://rwui.cryst.bbk.ac.uk.

Finally, the R.rsp package by Henrik Bengtsson introduces “R Server Pages”. Analogous to Java Server Pages, an R server page is typically HTMLwith embedded R code that gets evaluated when the page is requested. The package includes an internal cross-platform HTTP server implemented in Tcl, so provides a good framework for including web-based user interfaces in packages. The approach is similar to the use of the brew package withRapache with the advantage of cross-platform support and easy installation.

Also additional R Cloud Computing Use Cases
http://wwwdev.ebi.ac.uk/Tools/rcloud/

ArrayExpress R/Bioconductor Workbench

Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster

Start the workbench by downloading the package for your operating system (Macintosh or Windows), or via Java Web Start, and you will get access to an instance of R running on one of EBI’s powerful machines. You can install additional packages, upload your own data, work with graphics and collaborate with colleagues, all as if you are running R locally, but unlimited by your machine’s memory, processor or data storage capacity.

  • Most up-to-date R version built for multicore CPUs
  • Access to all Bioconductor packages
  • Access to our computing infrastructure
  • Fast access to data stored in EBI’s repositories (e.g., public microarray data in ArrayExpress)

Using R Google Docs
http://www.omegahat.org/RGoogleDocs/run.pdf
It uses the XML and RCurl packages and illustrates that it is relatively quick and easy
to use their primitives to interact with Web services.

Using R with Amazon
Citation
http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:

  • Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
  • On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
  • Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
  • Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.

Connecting to R on Amazon EC2- Detailed tutorials
Ubuntu Linux version
https://decisionstats.com/2010/09/25/running-r-on-amazon-ec2/
and Windows R version
https://decisionstats.com/2010/10/02/running-r-on-amazon-ec2-windows/

Connecting R to Data on Google Storage and Computing on Google Prediction API
https://github.com/onertipaday/predictionapirwrapper
R wrapper for working with Google Prediction API

This package consists in a bunch of functions allowing the user to test Google Prediction API from R.
It requires the user to have access to both Google Storage for Developers and Google Prediction API:
see
http://code.google.com/apis/storage/ and http://code.google.com/apis/predict/ for details.

Example usage:

#This example requires you had previously created a bucket named data_language on your Google Storage and you had uploaded a CSV file named language_id.txt (your data) into this bucket – see for details
library(predictionapirwrapper)

and Elastic R for Cloud Computing
http://user2010.org/tutorials/Chine.html

Abstract

Elastic-R is a new portal built using the Biocep-R platform. It enables statisticians, computational scientists, financial analysts, educators and students to use cloud resources seamlessly; to work with R engines and use their full capabilities from within simple browsers; to collaborate, share and reuse functions, algorithms, user interfaces, R sessions, servers; and to perform elastic distributed computing with any number of virtual machines to solve computationally intensive problems.
Also see Karim Chine’s http://biocep-distrib.r-forge.r-project.org/

R for Salesforce.com

At the point of writing this, there seem to be zero R based apps on Salesforce.com This could be a big opportunity for developers as both Apex and R have similar structures Developers could write free code in R and charge for their translated version in Apex on Salesforce.com

Force.com and Salesforce have many (1009) apps at
http://sites.force.com/appexchange/home for cloud computing for
businesses, but very few forecasting and statistical simulation apps.

Example of Monte Carlo based app is here
http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016cT9EAI#

These are like iPhone apps except meant for business purposes (I am
unaware if any university is offering salesforce.com integration
though google apps and amazon related research seems to be on)

Force.com uses a language called Apex  and you can see
http://wiki.developerforce.com/index.php/App_Logic and
http://wiki.developerforce.com/index.php/An_Introduction_to_Formulas
Apex is similar to R in that is OOPs

SAS Institute has an existing product for taking in Salesforce.com data.

A new SAS data surveyor is
available to access data from the Customer Relationship Management
(CRM) software vendor Salesforce.com. at
http://support.sas.com/documentation/cdl/en/whatsnew/62580/HTML/default/viewer.htm#datasurveyorwhatsnew902.htm)

Personal Note-Mentioning SAS in an email to a R list is a big no-no in terms of getting a response and love. Same for being careless about which R help list to email (like R devel or R packages or R help)

For python based cloud see http://pi-cloud.com

Here comes PySpread- 85,899,345 rows and 14,316,555 columns

A Bold GNU Head
Image via Wikipedia

Whats new/ One more open source analytics package. Built like a spreadsheet with an ability to import a million cells-

From http://pyspread.sourceforge.net/index.html

about Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python.

Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.

Pyspread screenshot
features In pyspread, cells expect Python expressions and return Python objects. Therefore, complex data types such as lists, trees or matrices can be handled within a single cell. Macros can be used for functions that are too complex for a single expression.

Since Python modules can be easily used without external scripts, arbitrary size rational numbers (via gmpy), fixed point decimal numbers for business calculations, (via the decimal module from the standard library) and advanced statistics including plotting functions (via RPy) can be used in the spreadsheet. Everything is directly available from each cell. Just use the grid

Data can be imported and exported using csv files or the clipboard. Other forms of data exchange is possible using external Python modules.

In  order to simplify sparse matrix editing, pyspread features a three dimensional grid that can be sized up to 85,899,345 rows and 14,316,555 columns (64 bit-systems, depends on row height and column width). Note that importing a million cells requires about 500 MB of memory.

The concept of pyspread allows doing everything from each cell that a Python script can do. This may very well include deleting your hard drive or sending your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets. Since this is not the case for everyone (see the discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can trust a file manually. Inspect carefully.

Pyspread screenshot

requirements Pyspread runs on Linux, Windows and *nix platforms with GTK+ support. There are reports that it works with MacOS X as well. If you would like to contribute by testing on OS X please contact me.

Dependencies

Highly recommended for full functionality

  • PyMe >=0.8.1, Note for Windows™ users: If you want to use signatures without compiling PyMe try out Gpg4win.
  • gmpy >=1.1.0 and
  • rpy >=1.0.3.
maturity Pyspread is in early Beta release. This means that the core functionality is fully implemented but the program needs testing and polish.

and from the wiki

http://sourceforge.net/apps/mediawiki/pyspread/index.php?title=Main_Page

a spreadsheet with more powerful functions and data structures that are accessible inside each cell. Something like Python that empowers you to do things quickly. And yes, it should be free and it should run on Linux as well as on Windows. I looked around and found nothing that suited me. Therefore, I started pyspread.

Concept

  • Each cell accepts any input that works in a Python command line.
  • The inputs are parsed and evaluated by Python’s eval command.
  • The result objects are accessible via a 3D numpy object array.
  • String representations of the result objects are displayed in the cells.

Benefits

  • Each cell returns a Python object. This object can be anything including arrays and third party library objects.
  • Generator expressions can be used efficiently for data manipulation.
  • Efficient numpy slicing is used.
  • numpy methods are accessible for the data.

Installation

  1. Download the pyspread tarball or zip and unzip at a convenient place
  2. In case you do not have it already get and install Python, wxpython and numpy
If you want the examples to work, install gmpy, R and rpy
Really do check the version requirements that are mentioned on http://pyspread.sf.net
  1. Get install privileges (e.g. become root)
  2. Change into the directory and type
python setup.py install
Windows: Replace “python” with your Python interpreter (absolute path)
  1. Become normal user again
  2. Start pyspread by typing
pyspread
  1. Enjoy

Links

Next on Spreadsheet wishlist-

a MSI bundle /Windows Self Installer which has all dependencies bundled in it-linking to PostGresSQL 😉 etc

way to go Mr Martin Manns

mmanns < at > gmx < dot > net

Using R for Time Series in SAS

 

Time series: random data plus trend, with best...
Image via Wikipedia

 

Here is a great paper on using Time Series in R, and it specifically allows you to use just R output in Base SAS.

SAS Code

/* three methods: */

/* 1. Call R directly – Some errors are not reported to log */

x “’C:\Program Files\R\R-2.12.0\bin\r.exe’–no-save –no-restore <“”&rsourcepath\tsdiag.r””>””&rsourcepath\tsdiag.out”””;

/* include the R log in the SAS log */7data _null_;

infile “&rsourcepath\tsdiag.out”;

file log;

input;

put ’R LOG: ’ _infile_;

run;

/* include the image in the sas output.Specify a file if you are not using autogenerated html output */

ods html;

data _null_;

file print;

put “<IMG SRC=’” “&rsourcepath\plot.png” “’ border=’0’>”;

put “<IMG SRC=’” “&rsourcepath\acf.png” “’ border=’0’>”;

put “<IMG SRC=’” “&rsourcepath\pacf.png” “’ border=’0’>”;

put “<IMG SRC=’” “&rsourcepath\spect.png” “’ border=’0’>”;

put “<IMG SRC=’” “&rsourcepath\fcst.png” “’ border=’0’>”;

run;

ods html close;

The R code to create a time series plot is quite elegant though-


library(tseries)

air <- AirPassengers #Datasetname

ts.plot(air)

acf(air)

pacf(air)

plot(decompose(air))

air.fit <- arima(air,order=c(0,1,1), seasonal=list(order=c(0,1,1), period=12) #The ARIMA Model Based on PACF and ACF Graphs

tsdiag(air.fit)

library(forecast)

air.forecast <- forecast(air.fit)

plot.forecast(air.forecast)

You can download the fascinating paper from the Analytics NCSU Website http://analytics.ncsu.edu/sesug/2008/ST-146.pdf

About the Author-

Sam Croker has a MS in Statistics from the University of South Carolina and has over ten years of experience in analytics.   His research interests are in time series analysis and forecasting with focus on stream-flow analysis.  He is currently using SAS, R and other analytical tools for fraud and abuse detection in Medicare and Medicaid data. He also has experience in analyzing, modeling and forecasting in the finance, marketing, hospitality, retail and pharmaceutical industries.