Journal of Statistical Software

Here is a good open content Journal for people wanting to keep track of latest in statistical software.

It is called Journal of Statistical Software.

Citation: http://www.jstatsoft.org/

Established in 1996, the Journal of Statistical Software publishes articles, book reviews, code snippets, and software reviews on the subject of statistical software and algorithms.  The contents are freely available on-line.  For both articles and code snippets the source code is published along with the paper.

Implementations can use languages such as C, C++, S, Fortran, Java, PHP, Python and Ruby or environments such as Mathematica, MATLAB, R, S-PLUS, SAS, Stata, and XLISP-STAT.

E.g Book Reviews of  A Handbook of Statistical Analyses Using SAS (Third Edition)

and Statistics and Data with R: An Applied Approach Through Examples

jss

It is really cutting edge stuff for someone who wants to keep up with the latest and fast moving tech trends in statistical software and has convenient RSS feeds as well announce alerts for emails.

Note- Various Journals can be ranked using a quantitative index called Impact Factor

Citation http://in-cites.com/research/2007/august_27_2007-2.html

E.G For Statistics

In these columns, total citations to a journal’s published papers are divided by the total number of papers that the journal published, producing a citations-per-paper impact score over a five-year period (middle column) and a 26-year period (right-hand column).

Journals Ranked by Impact:
Statistics & Probability

Rank

2006
Impact Factor

Impact
2002-06

Impact
1981-2006
1 Bioinformatics
(4.89)
Bioinformatics
(9.87)
Econometrica
(52.93)
2 Biostatistics
(3.01)
J. Royal Stat. Soc. B
(6.75)
J. Royal Stat. Soc. B
(27.32)
3 Chemom. Intell. Lab.
(2.45)
Biostatistics
(6.56)
J. Am. Stat. Assoc.
(25.11)
4 Econometrica
(2.40)
J. Computat. Biology
(6.49)
Biometrika
(22.75)
5 J. Royal Stat. Soc. B
(2.32)
Econometrica
(5.82)
Annals of Statistics
(21.31)
6 IEEE ACM T Comp. Bi.
(2.28)
J. Chemometrics
(5.08)
Biometrics
(20.32)
7 J. Am. Stat. Assoc.
(2.17)
J. Am. Stat. Assoc.
(4.95)
Technometrics
(17.74)
8 Multivar. Behav. Res.
(2.10)
Statistical Science
(4.19)
Multivar. Behav. Res.
(16.62)
9 J. Computat. Biology
(2.00)
Annals of Statistics
(3.94)
Bioinformatics
(16.37)
10 Annals of Statistics
(1.90)
Stat. in Medicine
(3.62)
J. Royal Stat. Soc. A
(14.46)

High Performance Computing and R

From http://cran.r-project.org/web/views/HighPerformanceComputing.html

The following is an excellent list of High Performance Computing using R.

CRAN Task View: High Performance and Parallel Computing

Maintainer: Dirk Eddelbuettel
Contact: Dirk.Eddelbuettel at R-project.org
Version: 2009-06-12

This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing (HPC) with R. In this context, we are defining ‘high-performance computing’ rather loosely as just about anything related to pushing R a littler further: using compiled code, parallel computing (in both explicit and implicit modes), working with large objects as well as profiling.

Unless otherwise mentioned, all packages presented with hyperlinks are available from CRAN, the Comprehensive R Archive Network.

Several of the areas discussed in this Task View are undergoing rapid change. Please send suggestions for additions and extensions for this task view to the task view maintainer .

Suggestions and corrections by Achim Zeileis, Markus Schmidberger, Martin Morgan, Max Kuhn, Tomas Radivoyevitch, Jochen Knaus, Tobias Verbeke, Hao Yu, and David Roseberg are gratefully acknowledged.

Parallel computing: Explicit parallelism

  • Several packages provide the communications layer required for parallel computing. The first package in this area was rpvm by Li and Rossini which uses the PVM (Parallel Virtual Machine) standard and libraries. rpvm is no longer actively maintained.
  • In recent years, the alternative MPI (Message Passing Interface) standard has become the de facto standard in parallel computing. It is supported in R via the Rmpi by Yu. Rmpi package is mature yet actively maintained and offers access to numerous functions from the MPI API, as well as a number of R-specific extensions. Rmpi can be used with the LAM/MPI, MPICH / MPICH2, Open MPI, and Deino MPI implementations. It should be noted that LAM/MPI is now in maintenance mode, and new development is focussed on Open MPI.
  • An alternative is provided by the nws (NetWorkSpaces) packages from REvolution Computing. It is the successor to the earlier LindaSpaces approach to parallel computing, and is implemented on top of the Twisted networking toolkit for Python.
  • The snow (Simple Network of Workstations) package by Tierney et al. can use PVM, MPI, NWS as well as direct networking sockets. It provides an abstraction layer by hiding the communications details. The snowFT package provides fault-tolerance extensions to snow.
  • The snowfall package by Knaus provides a more recent alternative to snow. Functions can be used in sequential or parallel mode.
  • The papply package by Currie provided a subset of the Rmpi functionality, but is no longer actively maintained either.
  • The biopara package by Lazar and Schoenfeld offers socket-based parallel execution with some support for load-balancing and fault-tolerance.
  • The taskPR package by Samatova et al. builds on top of LAM/MPI and offers parallel execution of tasks.
  • The Simple Parallel R INTerface (SPRINT) package by Hill et al. ( link , paper ) provides a prototype framework that allows the addition of parallelised functions to R for easy exploitation of HPC systems. Currently only a parallised correlation calculation is provided.

Parallel computing: Implicit parallelism

  • The pnmath package by Tierney ( link ) uses the Open MP parallel processing directives of recent compilers (such gcc 4.2 or later) for implicit parallelism by replacing a number of internal R functions with replacements that can make use of multiple cores — without any explicit requests from the user. The alternate pnmath0 package offers the same functionality using Pthreads for environments in which the newer compilers are not available. Similar functionality is expected to become integrated into R ‘eventually’.
  • The romp package by Jamitzky was presented at useR! 2008 ( slides ) and offers another interface to Open MP using Fortran. The code is still pre-alpha and available from the Google Code project romp. An R-Forge project romp was initiated but there is no package, yet.
  • The fork package by Warnes provides R-equivalents to low-level Unix system functions like fork, signal, wait, kill and exit in order to spawn sub-processes for parallel execution.
  • The multicore package by Urbanek provides a way of running parallel computations in R on machines with multiple cores or CPUs.
  • The R/parallel package by Vera, Jansen and Suppi offers a C++-based master-slave dispatch mechanism for parallel execution ( link )
  • The RScaLAPACK package by Samatova et al. provides an interface to the ScaLAPACK libraries which can replace the standard BLAS libraries and offer parallel execution of the same BLAS functions.
  • The SPRINT package by Hill adds another parallel framework to R ( link ).
  • The mapReduce package by Brown provides a simple framework for parallel computations following the Google mapReduce approach. It provides a pure R implementation, a syntax following the mapReduce paper and a flexible and parallelizable back end.

Parallel computing: Grid computing

  • The GridR package by Wegener et al. can be used in a grid computing environment via a web service, via ssh or via Condor or Globus.
  • The multiR package by Grose was presented at useR! 2008 but has not been released. It may offer a snow-style framework on a grid computing platform.
  • The biocep-distrib project by Chine offers a Java-based framework for local, Grid, or Cloud computing. It is under active development.
  • The RHIPE package by Guha profides an interface between R and Hadoop for a Map/Reduce programming framework. ( link )

Parallel computing: Random numbers

  • Random-number generators for parallel computing are available via the rsprng package by Li, and the rlecuyer package by Sevcikova and Rossini.

Parallel computing: Resource managers and batch schedulers

  • Job-scheduling toolkits permit management of parallel computing resources and tasks. The slurm (Simple Linux Utility for Resource Management) set of programs (written by a consortium led by Lawrence Livermore Labs) works well with MPI. ( link )
  • The Condor toolkit ( link ) from the University of Wisconsin-Madison has been used with R as described in this R News article .
  • The sfCluster package by Knaus can be used with snowfall. ( link ) but is currently limited to LAM/MPI.
  • The Rsge package by Bode offers an interface to the Sun Grid Engine batch-queuing system.
  • The Rlsf package by Smith et al. offers an interface to the LSF cluster/grid system.

Parallel computing: Applications

  • The caret package by Kuhn can use can use various frameworks (MPI, NWS etc) to parallelized cross-validation and bootstrap characterizations of predictive models.
  • The multtest package by Pollard et al. can use snow, Rmpi or rpvm for resampling-based testing of multiple hypothesis.
  • The maanova package by Wu can use snow and Rmpi for the analysis of micro-array experiments.
  • The pvclust package by Suzuki and Shimodaira can use snow and Rmpi for hierarchical clustering via multiscale bootstraps; and the scaleboot package by Shimodaira can use pvclust, snow and Rmpi for computing approximately unbiased p-values via multiscale bootstraps.
  • The tm package by Feinerer can use snow and Rmpi for parallelized text mining.
  • The varSelRF package by Diaz-Uriarte can use snow and Rmpi for parallelized use of variable selection via random forests; and the ADaCGH package by Diaz-Uriarte and Rueda can use Rmpi and papply for parallelized analysis of array CGH data.
  • The bcp package by Erdman and Emerson for the bayesian analysis of change points, and the bigmemory package by Kane and Emerson can use nws for parallelized operations.
  • The networksis package by Admiraal and Handcock can use rpvm and snow for parallelized simulation of bipartite graphs via sequential importance smapling.
  • The BARD package by Altman for better automated redistring, the GAMBoost package by Binder for glm and gam model fitting via boosting using b-splines, the Geneland package by Estoup, Guillot and Santos for structure detection from multilocus genetic data, the Matching package by Sekhon for multivariate and propensity score matching, the STAR package by Pouzat for spike train analysis, the bnlearn package by Scutari for bayesian network structure learning, the latentnet package by Krivitsky and Handcock for latent position and cluster models, the lga package by Harrington for linear grouping analysis, the peperr package by Porelius and Binder for parallised estimation of prediction error, the orloca package by Fernandez-Palacin and Munoz-Marquez for operations research locational analysis, the rgenoud package by Mebane and Sekhon for genetic optimization using derivatives the affyPara package by Schmidberger, Vicedo and Mansmann for parallel normalization of Affymetrix microarrays, the puma package by Pearson et al. which propagates uncertainty into standard microarray analyses such as differential expression and the ccems package for combinatorically complex equilibrium model selection all can use snow for parallelized operations using either one of the MPI, PVM, NWS or socket protocols supported by snow.
  • The bugsparallel package uses Rmpi for distributed computing of multiple MCMC chains using WinBUGS.
  • The partDSA package uses nws for generating a piecewise constant estimation list of increasingly complex predictors based on an intensive and comprehensive search over the entire covariate space.

Parallel computing: GPUs

  • The gputools package by Buckner provides several common data-mining algorithms which are implemented using a mixture of nVidia’s CUDA langauge and cublas library. Given a computer with an nVidia GPU these functions may be substantially more efficient than native R routines.

Large memory and out-of-memory data

  • The biglm package by Lumley uses incremental computations to offers lm() and glm() functionality to data sets stored outside of R’s main memory.
  • The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.
  • The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory and uses external pointer objects to refer to them. This permits transparent access from R without bumping against R’s internal memory limits. Several R processes on the same computer can also shared big memory objects.
  • A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table by Dowle) are also of potential interest but not reviewed here.
  • The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also facilitates operating on data in a streaming fashion which does not require Hadoop.

Easier interfaces for Compiled code

  • The inline package by Sklyar, Murdoch and Smith eases adding code in C, C++ or Fortran to R. It takes care of the compilation, linking and loading of embeded code segments that are stored as R strings.
  • The Rcpp package by Eddelbuettel offers a number of C++ clases that makes transferring R objects to C++ functions (and back) easier, and the RInside package by Eddelbuettel allows easy embedding of R itself into C++ applications for faster and more direct data transfer..
  • The rJava package by Urbanek provides a low-level interface to Java similar to the .Call() interface for C and C++.

Profiling tools

  • The profr package by Wickham can visualize output from the Rprof interface for profiling.
  • The proftools package by Tierney can also be used to analyse profiling output.

CRAN packages:

Related links:

  • Slides from Introduction to High-Performance Computing with R tutorial / workshop presentation
  • The Age of the Unthinkable- Book Review

    The Age of the Unthinkable is a thought provoking book written by Joshua Cooper Ramos and published by Little, Brown. Anyone who has been surprised by change or the speed of change in recent months in matters economic, political or technology should have a look in at least once of this beautiful, with remarkable case studies  painstaking culled and gathered from all parts of the world and cultures.

    The book has an easy to read style, with real life incidents with which we can associate with. It look at creative innovation as a process which is analogous to sand particles piling on to one another, and sudden change being the point at which the sand pile has a flattening avalanche. It learns from examples of how highly centralized systems in Communism collapsed while highly autonomous organizations like Hezbollah flourished as they kept learning and adapting in the face of a bigger enemy. or how creative designers at Nintendo changed the paradigm of video games to invent the Wii Fit to help make video games that help people stay fit, which was un-thought of earlier using inexpensive video chips. And how a little known company in Brazil cut costs by empowering bottoms up cost cutting than top down cost thinking.

    The book talks of things like mashup and the speed at which change is unleashed at us. Lastly it offers us lessons in which leaders may help embrace change and thus help themselves or they are changed inevitably by external forces.

    Change being a process as sure as death and taxes- it compares and contrast people who change willingly internally to people who are changed externally. An entertaining and informative book- I recommend it (see Amazon link to the right margin) for anyone and everyone who have had a ” Oh, we are  idiots” moment as they were surprised by rising taxes for bailouts, powerful armies that failed to keep them safe or big cash rich corporations that failed to keep them employed.

    For the technology or scientifically trained people, this book would be an eye opener.

    41k8ugav4FL._SL75_

    Statistics: Code and Poetry

    Dstats

    As you can see, my poem on Michael Jackson got more page views

    than any code I wrote or any interview I did. which reminds me to be

    more humble than be humbled :)) More Details are on Advertise page above.

    Hive Tutorial: Cloud Computing

    Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)

    Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future  (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).

    Citation-

    http://wiki.apache.org/hadoop/Hive

    Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files

    Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

    If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.


    Reactions to IBM -SPSS takeover.

    The business intelligence -business analytics- data mining industry ( or as James Taylor would say Decision Management Industry) have some reactions on IBM – SPSS ( which was NOT a surprise to many including me). Really.

    From SAS Institute, Anne Milley

    http://blogs.sas.com/sascom/index.php?/archives/557-Analytics-is-still-our-middle-name.html

    Besides SAS, SPSS was one of the last independent analytic software companies. A colleague says, “It’s the end of the analytics cold war.”

    I’ve been saying all along that analytics is required for success. Yes, data integration, data quality, and query & reporting are important too but, as W. Edwards Deming says, “The object of taking data is to provide a basis for action.”

    The end of the analytics cold war- hmm. We all know what the end of real cold war brought us- Google, Cloud Computing, and other non technical issues.

    From KXEN, Roger Hadaad

    “The price paid for SPSS of four times revenues and 25 times earnings shows just how valuable this sector really is,” says Haddad. “But the deal has also created a tremendous opportunity for the sector’s remaining independent vendors that

    KXEN is well placed to capitalize on. “There is no For Sale sign hanging in our window,” continues Haddad. “We launched KXEN in 1998 to democratize the benefits of data mining and predictive analytics, making them practical and affordable across the whole enterprise and not just the exclusive preserve of a few specialists. It’s going to take up to two years for the dust to settle following the IBM

    “Former SPSS partners, systems integrators and distributors will face uncertainty.”

    I think the PE multiple was still low- SPSS was worth more if you count the client base, active community, brand itself in the valuation. Tremendous cross sell opportunities and IBM with it’s nice research and development is a good supporter of pure science.  Yes, next two years would be facing increasing consolidation and more “surprising” news. At 4 times earnings, anyone can be bought in the present market if it is a public listed company. 😉

    From the rather subdued voices on SPSS list, some subjective and non quantitative ‘strategic” forecasts.

    http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0907&L=spssx-l&F=&S=&P=36324

    I think the Ancient Chinese said it best “May you live in interesting times”.

    Having worked with some flavors of Cognos and SPSS, I think there could be areas for technical integration for querying and GUI based forecasting as well, apart from financial mergers and administrative re adjustments. I mean people pull data not just to report it, but to estimate what comes next as well.

    This could also spell the end of uni platform skilled analysts. You now need to learn atleast two different platforms like SAS,SPSS or KXEN, R or Cognos, Business Objects to hedge your chances of getting offshored (Note- I worked in offshoring for almost 4 years in India in data analytics).

    Answering what IBM will do with SPSS and it’s open source commitment to R and consequences for employees, customers, vendors,partners who have more choices now than ever.

    …. well it depends. Who is John Galt?

    Interview Karen Lopez Data Modeling Expert

    Zachman Framework
    Image via Wikipedia

    Here is an interview with Karen Lopez who has worked in data modeling for almost three decades and is a renowned data management expert in her field.

    Data professionals need to know about the data domain in addition to the data structure domain – Karen Lopez

    Ajay- Describe your career in science. How would you persuade younger students to take more science courses.

    Karen- I’ve always had an interest in science and I attribute that to the great science teachers I had. I studied information systems at Purdue University though a unique program that focuses on systems analysis and computer technologies. I’m one of the few who studied data and process modeling in an undergraduate program 25+ years ago.

    I believe that it is very important that we find a way of attracting more scientists to teach. In both the natural and computer sciences, it’s difficult for institutions to tempt scientists away from professional positions that offer much greater compensation. So I support programs that find ways to make that happen.

    Ajay- If you had to give advice to a young person starting their career in BI and had to give them advice in just three points – what would they be?

    Karen- Wow. It’s tough to think of just three things, but these are recommendations that I make often:

    – Remember that every design decision should be made based on cost, benefit, and risk. If you can’t clearly describe these for every side of a decision, then you aren’t doing design; you are guessing.

    – No one beside you is responsible for advancing your skills and keeping an eye on emerging practices. Don’t expect your employer to lay out a career plan that is in your best interest. That’s not their job. Data professionals need to know about the data domain in addition to the data structure domain. The best database or data warehouse design in the world is worse than uses useless if the how the data is processed is wrong. Remember to expand your knowledge about data, not just the data structures and tools.

    – All real-world work involves collaboration and negotiation. There is no one right answer that works for every situation. Building your skills in these areas will pay off significantly.

    Ajay- What do you think is the best way for a technical consultant and client to be on the same page regarding requirements. Which methodology or template have you used, and which has given you the most success.

    Karen- While I’m a huge fan of modeling (data modeling and other modeling), I still think that giving clients a prototype or mockup of something that looks real to them goes a long way. We need to build tools and competencies to develop these prototypes quickly. It’s a lost art in the data world.

    Ajay- What are the special incentives that make Canada a great place for tech entrepreneurs rather than say go to the United States. ( Note- Disclaimer I have family in Canada and study in the US)

    Karen- I prefer not to think of this as an either-or decision. I immigrated to Canada from the US about 15 years ago, but most of our business is outside of Canada. I have enjoyed special incentives here in Canada for small businesses as well as special programs that allowed me to work in Canada as a technical professional before I moved here permanently.

    Overall, I have found Canadian employers more open to sponsoring foreign workers and it is easier for them to do so than what my US clients experience. Having said that, a significant portion of my work over the last few years has been on global projects where we leverage online collaboration tools to meet our goals. The advent of these tools has made it much easier to work from wherever I am and to work with others regardless of their visa statuses.

    Where a company forms is less tied to where one lives or works these days.

    Ajay- Could you tell us more about the Zachman framework (apart from the wikipedia reference)? A practical example on how you used it on an actual project would be great.

    Karen- Of course the best resource for finding out about the Zachman framework is from John Zachman himself http://www.zachmaninternational.com/index.php/home-article/13 . He offers some excellent courses and does a great deal of public speaking at government and DAMA events. I highly recommend anyone interested in the Framework to hear about it directly from him.

    There are many misunderstandings about John’s intent, such as the myth that he requires big upfront modeling (he doesn’t), that the Framework is a methodology (it isn’t), or that it can only be used to build computer systems (it can be used for more than that).

    I have used the Zachman Framework to develop a joint Business-IT Strategic Information Systems Plan as well as to inventory and track progress of multi-project programs. One interesting use was a paper I authored for the Canadian Information Processing Society (CIPS) on how various educational programs, specializations, and certifications map to the Zachman Framework. I later developed a presentation about this mapping for a Zachman conference.

    For a specific project, the Zachman Framework allows business to understand where their enterprise assets are being managed – and how well they are managed. It’s not an IT thing; it’s an enterprise architecture thing.

    Ajay- What does Karen Lopez do for fun when not at work, traveling, speaking or blogging.

    Karen- Sometimes it seems that’s all I do. I enjoy volunteering for IT-related organizations such as DAMA and CIPS. I participate in the accreditation of college and university educational programs in Canada and abroad. As a member of data-related standards bodies, namely the Association for Retail Technology Standards and the American Dental Association, I help develop industry standard data models. I’ve also been a spokesperson for a CIPS program to encourage girls to take more math and science courses throughout their student careers so that they may have access to great opportunities in the future.

    I like to think of myself as a runner; last year I completed my first half marathon, which I’d never thought was possible. I am studying Hindi and Sanskrit. I’m also a addicted to reading and am thankful that some of it I actually get paid to do.

    Biography

    Karen López is a Senior Project Manager at InfoAdvisors, Inc. Karen is a frequent speaker at DAMA conferences and DAMA Chapters. She has 20+ years of experience in project and data management on large, multi-project programs. Karen specializes in the practical application of data management principles. Karen is also the ListMistress and moderator of the InfoAdvisors Discussion Groups at http://www.infoadvisors.com. You can reach her at www.twitter.com/datachick