Making NeW R

Tal G in his excellent blog piece talks of “Why R Developers should not be paid” http://www.r-statistics.com/2010/09/open-source-and-money-why-r-developers-shouldnt-be-paid/

His argument of love is not very original though it was first made by these four guys

I am going to argue that “some” R developers should be paid, while the main focus should be volunteers code. These R developers should be paid as per usage of their packages.

Let me expand.

Imagine the following conversation between Ross Ihaka, Norman Nie and Peter Dalgaard.

Norman- Hey Guys, Can you give me some code- I got this new startup.

Ross Ihaka and Peter Dalgaard- Sure dude. Here is 100,000 lines of code, 2000 packages and 2 decades of effort.

Norman- Thanks guys.

Ross Ihaka- Hey, What you gonna do with this code.

Norman- I will better it. Sell it. Finally beat Jim Goodnight and his **** Proc GLM and **** Proc Reg.

Ross- Okay, but what will you give us? Will you give us some code back of what you improve?

Norman – Uh, let me explain this open core …

Peter D- Well how about some royalty?

Norman- Sure, we will throw parties at all conferences, snacks you know at user groups.

Ross – Hmm. That does not sound fair. (walks away in a huff muttering)-He takes our code, sells it and wont share the code

Peter D- Doesnt sound fair. I am back to reading Hamlet, the great Dane, and writing the next edition of my book. I am glad I wrote a book- Ross didnt even write that.

Norman-Uh Oh. (picks his phone)- Hey David Smith, We need to write some blog articles pronto – these open source guys ,man…

———–I think that sums what has been going on in the dynamics of R recently. If Ross Ihaka and R Gentleman had adopted an open core strategy- meaning you can create packages to R but not share the original where would we all be?

At this point if he is reading this, David Smith , long suffering veteran of open source flameouts is rolling his eyes while Tal G is wondering if he will publish this on R Bloggers and if so when or something.

Lets bring in another R veteran- Hadley Wickham who wrote a book on R and also created ggplot. Thats the best quality, most often used graphics package.

In terms of economic utilty to end user- the ggplot package may be as useful if not more as the foreach package developed by Revolution Computing/Analytics.

Now http://cran.r-project.org/web/packages/foreach/index.html says that foreach is licensed under http://www.apache.org/licenses/LICENSE-2.0

However lets come to open core licensing ( read it here http://alampitt.typepad.com/lampitt_or_leave_it/2008/08/open-core-licen.html ) which is where the debate is- Revolution takes code- enhances it (in my opinion) substantially with new formats XDF for better efficieny, web services API, and soon coming next year a GUI (thanks in advance , Dr Nie and guys)

and sells this advanced R code to businesses happy to pay ( they are currently paying much more to DR Goodnight and HIS guys)

Why would any sane customer buy it from Revolution- if he could download exactly the same thing from http://r-project.org

Hence the business need for Revolution Analytics to have an enhanced R- as they are using a product based software model not software as a service model.

If Revolution gives away source code of these new enhanced codes to R core team- how will R core team protect the above mentioned intelectual property- given they have 2 decades experience of giving away free code , and back and forth on just code.

Now Revolution also has a marketing budget- and thats how they sponsor some R Core events, conferences, after conference snacks.

How would people decide if they are being too generous or too stingy in their contribution (compared to the formidable generosity of SAS Institute to its employees, stakeholders and even third party analysts).

Would it not be better- IF Revolution can shift that aspect of relationship to its Research and Development budget than it’s marketing budget- come with some sort of incentive for “SOME” developers – even researchers need grants and assistantships, scholarships, make a transparent royalty formula say 17.5 % of the NEW R sales goes to R PACKAGE Developers pool, which in turn examines usage rate of packages and need/merit before allocation- that would require Revolution to evolve from a startup to a more sophisticated corporate and R Core can use this the same way as John M Chambers software award/scholarship

Dont pay all developers- it would be an insult to many of them – say Prof Harrell creator of HMisc to accept – but can Revolution expand its dev base (and prospect for future employees) by even sponsoring some R Scholarships.

And I am sure that if Revolution opens up some more code to the community- they would the rest of the world and it’s help useful. If it cant trust people like R Gentleman with some source code – well he is a board member.

——————————————————————————————–

Now to sum up some technical discussions on NeW R

1) An accepted way of benchmarking efficiencies.

2) Code review and incorporation of efficiencies.

3) Multi threading- Multi core usage are trends to be incorporated.

4) GUIs like R Commander E Plugins for other packages, and Rattle for Data Mining to have focussed (or Deducer). This may involve hiring User Interface Designers (like from Apple 😉 who will work for love AND money ( Even the Beatles charge royalty for that song)

5) More support to cloud computing initiatives like Biocep and Elastic R – or Amazon AMI for using cloud computers- note efficiency arguements dont matter if you just use a Chrome Browser and pay 2 cents a hour for an Amazon Instance. Probably R core needs more direct involvement of Google (Cloud OS makers) and Amazon as well as even Salesforce.com (for creating Force.com Apps). Note even more corporates here need to be involved as cloud computing doesnot have any free and open source infrastructure (YET)

_______________________________________________________

Debates will come and go. This is an interesting intellectual debate and someday the liitle guys will win the Revolution-

From Hugh M of Gaping Void-

http://www.gapingvoid.com/Moveable_Type/archives/cat_microsoft_blue_monster_series.html

HOW DOES A SOFTWARE COMPANY MAKE MONEY, IF ALL

SOFTWARE IS FREE?

“If something goes wrong with Microsoft, I can phone Microsoft up and have it fixed. With Open Source, I have to rely on the community.”

And the community, as much as we may love it, is unpredictable. It might care about your problem and want to fix it, then again, it may not. Anyone who has ever witnessed something online go “viral”, good or bad, will know what I’m talking about.

and especially-

http://gapingvoid.com/2007/04/16/how-well-does-open-source-currently-meet-the-needs-of-shareholders-and-ceos/

Source-http://gapingvoidgallery.com/

Kind of sums up why the open core licensing is all about.

Analytics and Journals

Some good journals for reading on analytics-

1) JSS

http://www.jstatsoft.org/

present research that demonstrates the joint evolution of computational and statistical methods and techniques. Implementations can use languages such as C, C++, S, Fortran, Java, PHP, Python and Ruby or environments such as Mathematica, MATLAB, R, S-PLUS, SAS, Stata, and XLISP-STAT.

There are currently 370 articles, 23 code snippets, 86 book reviews, 4 software reviews, and 7 special volumes in archives

2) R Journal

http://journal.r-project.org/

The Journal

3) Pharma Programming

http://maney.co.uk/index.php/journals/pha/

Pharmaceutical Programming is the official journal of the Pharmaceutical Users Software Exchange (PhUSE), a non-profit membership society with the objective of educating programmers and their managers working in the pharmaceutical industry. Available both in print and online, Pharmaceutical Programming is an international journal with focus on programming in the regulated environment of the pharmaceutical and life sciences industry.

4) SAS Papers – User Groups

http://www.lexjansen.com/

4569 SAS papers presented at SGF/SUGI 1996-2010.	1343 SAS papers presented at PharmaSUG 2000-2010.	1810 SAS papers presented at NESUG 1997-2009.
1191 SAS papers presented at SESUG 1999-2009.	463 SAS papers presented at PhUSE 2005-2009.	787 SAS papers presented at WUSS 2003-2009.
337 SAS papers presented at MWSUG 2001, 2004-2009.	188 SAS papers presented at PNWSUG 2004-2009.	246 SAS papers presented at SCSUG 2003-2007, 2009.
221 SAS papers related to CDISC. Easy access to the CDISC Forum.

5) http://analyticsmagazine.com/

Magazine by http://www.informs.org/

6) Data Mining Journals

Academic Journals

Journals relevant to Data Mining

Applied Intelligence – The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies –http://www.kluweronline.com/issn/0924-669X/contents
Data Mining and Knowledge Discovery – http://www.kluweronline.com/issn/1384-5810/
Journal of Intelligent Information Systems – Integrating Artificial Intelligence and Database Technologies –http://www.kluweronline.com/issn/0925-9902
Journal of Intelligent Systems – http://www.brunel.ac.uk/~hssrjis/
Knowledge and Information Systems – http://springerlink.metapress.com/openurl.asp?genre=journal&issn=0219-1377
Machine Learning – http://www.kluweronline.com/issn/0885-6125/
IEEE Transactions on Knowledge and Data Engineering – http://www.computer.org/tkde/
IEEE Transactions on Pattern Analysis and Machine Intelligence – http://www.computer.org/tpami/

Oracle Open World/ RODM package

From the press release, here comes Oracle Open World. They really have an excellent rock concert in that as well.

.NET and Windows @ Oracle Develop and Oracle OpenWorld 2010

Oracle Develop will again feature a .NET track for Oracle developers. Oracle Develop is suited for all levels of .NET developers, from beginner to advanced. It covers introductory Oracle .NET material, new features, deep dive application tuning, and includes three hours of hands-on labs apply what you learned from the sessions.

To register, go to Oracle Develop registration site.

Oracle OpenWorld will include several sessions on using the Oracle Database on Windows and .NET.

Session schedules and locations for Windows and .NET sessions at Oracle Develop and OpenWorld are now available.

Download: 32-bit ODAC 11.2.0.1.2 for Visual Studio 2010 and .NET Framework 4

With ODAC 11.2.0.1.2, developers can connect to Oracle Database versions 9.2 and higher from Visual Studio 2010 and .NET Framework 4. ODAC components support the full framework, as well as the new .NET Framework Client Profile.

Download ODAC 11.2.0.1.2

Oracle’s integration with Visual Studio 2010 video demo

ODAC 11.2.0.1.2 Data Sheet

New Features List

Statement of Direction: Oracle Database and Microsoft Entity Framework

Learn about Oracle’s beta and production plans to support Microsoft Entity Framework with Oracle Database.

Also see http://www.oracle.com/technetwork/articles/datawarehouse/saternos-r-161569.html

for

Data Mining Using the RDOM Package

By Casimir Saternos

Some excerpts-

Open R and enter the following command.

> library(RODM)

This command loads the RODM library and as well the dependent RODBC package. The next step is to make a database connection.

> DB <- RODM_open_dbms_connection(dsn="orcl", uid="dm", pwd="dm")

Subsequent commands use the DB object (an instance of the RODBC class) to connect to the database. The DNS specified in the command is the name you used earlier for the Data Source Name during the ODBC connection configuration. You can view the actual R code being executed by the command by simply typing the function name (without parentheses).

> RODM_open_dbms_connection

And say making a Model in Oracle and R-

> numrows <- length(orange_data[,1])
> orange_data.rows <- length(orange_data[,1])
> orange_data.id <- matrix(seq(1, orange_data.rows), nrow=orange_data.rows, ncol=1, dimnames= list(NULL, c(“CASE_ID”)))
> orange_data <- cbind(orange_data.id, orange_data)

This adjustment to the data frame then needs to be propagated to the database. You can confirm the change using the sqlColumns function, as listed earlier.

> RODM_create_dbms_table(DB, "orange_data")
> sqlColumns(DB, 'orange_data')$COLUMN_NAME

> glm <- RODM_create_glm_model(
database = DB,
data_table_name = “orange_data”,
case_id_column_name = “CASE_ID”,
target_column_name = “circumference”,
model_name = “GLM_MODEL”,
mining_function = “regression”)

Information about this model can then be obtained by analyzing value returned from the model and stored in the variable named glm.

> glm$model.model_settings
> glm$glm.globals
> $glm.coefficients

Once you have a model, you can apply the model to a new set of data. To begin, create or retrieve sample data in the same format as the training data.

> query<-('select 999 case_id, 1 tree, 120 age, 
32 circumference from dual')

> orange_test<-sqlQuery(DB, query)
> RODM_create_dbms_table(DB, "orange_test")
and 
Finally, the model can be applied to the new data set and the results analyzed.

results <- RODM_apply_model(database = DB, 
data_table_name = "orange_test",
model_name = "GLM_MODEL",
supplemental_cols = "circumference")

When your session is complete, you can clean up objects that were created (if you like) and you should close the database connection:

> RODM_drop_model(database=DB,'GLM_MODEL')
> RODM_drop_dbms_table(DB, "orange_test")
> RODM_drop_dbms_table(DB, "orange_data")
> RODM_close_dbms_connection(DB)

See the full article at http://www.oracle.com/technetwork/articles/datawarehouse/saternos-r-161569.html

SAS/Blades/Servers/ GPU Benchmarks

Just checked out cool new series from NVidia servers.

Now though SAS Inc/ Jim Goodnight thinks HP Blade Servers are the cool thing- the GPU takes hardware high performance computing to another level. It would be interesting to see GPU based cloud computers as well – say for the on Demand SAS (free for academics and students) but which has had some complaints of being slow.

See this for SAS and Blade Servers-

http://www.sas.com/success/ncsu_analytics.html

To give users hands-on experience, the program is underpinned by a virtual computing lab (VCL), a remote access service that allows users to reserve a computer configured with a desired set of applications and operating system and then access that computer over the Internet. The lab is powered by an IBM BladeCenter infrastructure, which includes more than 500 blade servers, distributed between two locations. The assignment of the blade servers can be changed to meet shifts in the balance of demand among the various groups of users. Laura Ladrie, MSA Classroom Coordinator and Technical Support Specialist, says, “The virtual computing lab chose IBM hardware because of its quality, reliability and performance. IBM hardware is also energy efficient and lends itself well to high performance/low overhead computing.

Thats interesting since IBM now competes (as owner of SPSS) and also cooperates with SAS Institute

And

http://www.theaustralian.com.au/australian-it/the-world-according-to-jim-goodnight-blade-switch-slashes-job-times/story-e6frgakx-1225888236107

You’re effectively turbo-charging through deployment of many processors within the blade servers?

Yes. We’ve got machines with 192 blades on them. One of them has 202 or 203 blades. We’re using Hewlett-Packard blades with 12 CP cores on each, so it’s a total 2300 CPU cores doing the computation.

Our idea was to give every one of those cores a little piece of work to do, and we came up with a solution. It involved a very small change to the algorithm we were using, and it’s just incredible how fast we can do things now.

I don’t think of it as a grid, I think of it as essentially one computer. Most people will take a blade and make a grid out of it, where everything’s a separate computer running separate jobs.

We just look at it as one big machine that has memory and processors all over the place, so it’s a totally different concept.

GPU servers can be faster than CPU servers, though , Professor G.

Source-

http://www.nvidia.com/object/preconfigured_clusters.html

TESLA GPU COMPUTING SOLUTIONS FOR DATA CENTERS
Supercharge your cluster with the Tesla family of GPU computing solutions. Deploy 1U systems from NVIDIA or hybrid CPU-GPU servers from OEMs that integrate NVIDIA® Tesla™ GPU computing processors.

When compared to the latest quad-core CPU, Tesla 20-series GPU computing processors deliver equivalent performance at 1/20th the power consumption and 1/10th the cost. Each Tesla GPU features hundreds of parallel CUDA cores and is based on the revolutionary NVIDIA® CUDA™ parallel computing architecture with a rich set of developer tools (compilers, profilers, debuggers) for popular programming languages APIs like C, C++, Fortran, and driver APIs like OpenCL and DirectCompute.

NVIDIA’s partners provide turnkey easy-to-deploy Preconfigured Tesla GPU clusters that are customizable to your needs. For 3D cloud computing applications, our partners offer the Tesla RS clusters that are optimized for running RealityServer with iray.

Available Tesla Products for Data Centers:
– Tesla S2050
– Tesla M2050/M2070
– Tesla S1070
– Tesla M1060

Also I liked the hybrid GPU and CPU

And from a paper on comparing GPU and CPU using Benchmark tests on BLAS from a Debian- Dirk E’s excellent blog

http://dirk.eddelbuettel.com/blog/

Usage of accelerated BLAS libraries seems to shrouded in some mystery, judging from somewhat regularly recurring requests for help on lists such as r-sig-hpc(gmane version), the R list dedicated to High-Performance Computing. Yet it doesn’t have to be; installation can be really simple (on appropriate systems).

Another issue that I felt needed addressing was a comparison between the different alternatives available, quite possibly including GPU computing. So a few weeks ago I sat down and wrote a small package to run, collect, analyse and visualize some benchmarks. That package, called gcbd (more about the name below) is now onCRAN as of this morning. The package both facilitates the data collection for the paper it also contains (in the vignette form common among R packages) and provides code to analyse the data—which is also included as a SQLite database. All this is done in the Debian and Ubuntu context by transparently installing and removing suitable packages providing BLAS implementations: that we can fully automate data collection over several competing implementations via a single script (which is also included). Contributions of benchmark results is encouraged—that is the idea of the package.

And from his paper on the same-

Analysts are often eager to reap the maximum performance from their computing platforms.

A popular suggestion in recent years has been to consider optimised basic linear algebra subprograms (BLAS). Optimised BLAS libraries have been included with some (commercial) analysis platforms for a decade (Moler 2000), and have also been available for (at least some) Linux distributions for an equally long time (Maguire 1999). Setting BLAS up can be daunting: the R language and environment devotes a detailed discussion to the topic in its Installation and Administration manual (R Development Core Team 2010b, appendix A.3.1). Among the available BLAS implementations, several popular choices have emerged. Atlas (an acronym for Automatically Tuned Linear Algebra System) is popular as it has shown very good performance due to its automated and CPU-specic tuning (Whaley and Dongarra 1999; Whaley and Petitet 2005). It is also licensed in such a way that it permits redistribution leading to fairly wide availability of Atlas.1 We deploy Atlas in both a single-threaded and a multi-threaded conguration. Another popular BLAS implementation is Goto BLAS which is named after its main developer, Kazushige Goto (Goto and Van De Geijn 2008). While `free to use’, its license does not permit redistribution putting the onus of conguration, compilation and installation on the end-user. Lastly, the Intel Math Kernel Library (MKL), a commercial product, also includes an optimised BLAS library. A recent addition to the tool chain of high-performance computing are graphical processing units (GPUs). Originally designed for optimised single-precision arithmetic to accelerate computing as performed by graphics cards, these devices are increasingly used in numerical analysis. Earlier criticism of insucient floating-point precision or severe performance penalties for double-precision calculation are being addressed by the newest models. Dependence on particular vendors remains a concern with NVidia’s CUDA toolkit (NVidia 2010) currently still the preferred development choice whereas the newer OpenCL standard (Khronos Group 2008) may become a more generic alternative that is independent of hardware vendors. Brodtkorb et al. (2010) provide an excellent recent survey. But what has been lacking is a comparison of the eective performance of these alternatives. This paper works towards answering this question. By analysing performance across ve dierent BLAS implementations|as well as a GPU-based solution|we are able to provide a reasonably broad comparison.

Performance is measured as an end-user would experience it: we record computing times from launching commands in the interactive R environment (R Development Core Team 2010a) to their completion.

And

Basic Linear Algebra Subprograms (BLAS) provide an Application Programming Interface

(API) for linear algebra. For a given task such as, say, a multiplication of two conformant

matrices, an interface is described via a function declaration, in this case sgemm for single

precision and dgemm for double precision. The actual implementation becomes interchangeable

thanks to the API denition and can be supplied by dierent approaches or algorithms. This

is one of the fundamental code design features we are using here to benchmark the dierence

in performance from dierent implementations.

A second key aspect is the dierence between static and shared linking. In static linking,

object code is taken from the underlying library and copied into the resulting executable.

This has several key implications. First, the executable becomes larger due to the copy of

the binary code. Second, it makes it marginally faster as the library code is present and

no additional look-up and subsequent redirection has to be performed. The actual amount

of this performance penalty is the subject of near-endless debate. We should also note that

this usually amounts to only a small load-time penalty combined with a function pointer

redirection|the actual computation eort is unchanged as the actual object code is identi-

cal. Third, it makes the program more robust as fewer external dependencies are required.

However, this last point also has a downside: no changes in the underlying library will be

reected in the binary unless a new build is executed. Shared library builds, on the other

hand, result in smaller binaries that may run marginally slower|but which can make use of

dierent libraries without a rebuild.

Basic Linear Algebra Subprograms (BLAS) provide an Application Programming Interface(API) for linear algebra. For a given task such as, say, a multiplication of two conformantmatrices, an interface is described via a function declaration, in this case sgemm for singleprecision and dgemm for double precision. The actual implementation becomes interchangeablethanks to the API denition and can be supplied by dierent approaches or algorithms. Thisis one of the fundamental code design features we are using here to benchmark the dierencein performance from dierent implementations.A second key aspect is the dierence between static and shared linking. In static linking,object code is taken from the underlying library and copied into the resulting executable.This has several key implications. First, the executable becomes larger due to the copy ofthe binary code. Second, it makes it marginally faster as the library code is present andno additional look-up and subsequent redirection has to be performed. The actual amountof this performance penalty is the subject of near-endless debate. We should also note thatthis usually amounts to only a small load-time penalty combined with a function pointerredirection|the actual computation eort is unchanged as the actual object code is identi-cal. Third, it makes the program more robust as fewer external dependencies are required.However, this last point also has a downside: no changes in the underlying library will bereected in the binary unless a new build is executed. Shared library builds, on the otherhand, result in smaller binaries that may run marginally slower|but which can make use ofdierent libraries without a rebuild.

And summing up,

reference BLAS to be dominated in all cases. Single-threaded Atlas BLAS improves on the reference BLAS but loses to multi-threaded BLAS. For multi-threaded BLAS we nd the Goto BLAS dominate the Intel MKL, with a single exception of the QR decomposition on the xeon-based system which may reveal an error. The development version of Atlas, when compiled in multi-threaded mode is competitive with both Goto BLAS and the MKL. GPU computing is found to be compelling only for very large matrix sizes. Our benchmarking framework in the gcbd package can be employed by others through the R packaging system which could lead to a wider set of benchmark results. These results could be helpful for next-generation systems which may need to make heuristic choices about when to compute on the CPU and when to compute on the GPU.

Source – DirkE’paper and blog http://dirk.eddelbuettel.com/papers/gcbd.pdf

Quite appropriately-,

Hardware solutions or atleast need to be a part of Revolution Analytic’s thinking as well. SPSS does not have any choice anymore though 😉

It would be interesting to see how the new SAS Cloud Computing/ Server Farm/ Time Sharing facility is benchmarking CPU and GPU for SAS analytics performance – if being done already it would be nice to see a SUGI paper on the same at http://sascommunity.org.

Multi threading needs to be taken care automatically by statistical software to optimize current local computing (including for New R)

Acceptable benchmarks for testing hardware as well as software need to be reinforced and published across vendors, academics and companies.

What do you think?

Professors and Patches: For a Betterrrr R

Professors sometime throw out provocative statements to ensure intellectual debate. I have had almost 1500+ hits in less than 2 days ( and I am glad I am on wordpress.com , my old beloved server would have crashed))

The remarks from Ross Ihaka, covered before and also at Xian’s blog at

Note most of his remarks are techie- and only a single line refers to Revlution Analytics.

Other senior members of community (read- professors are silent, though brobably some thought may have been ignited behind scenes)

http://xianblog.wordpress.com/2010/09/06/insane/comment-page-4/#comments

Ross Ihaka Says:
September 12, 2010 at 1:23 pm

Since (something like) my name has been taken in vain here, let me
chip in.

I’ve been worried for some time that R isn’t going to provide the base
that we’re going to need for statistical computation in the
future. (It may well be that the future is already upon us.) There
are certainly efficiency problems (speed and memory use), but there
are more fundamental issues too. Some of these were inherited from S
and some are peculiar to R.

One of the worst problems is scoping. Consider the following little
gem.

f =
function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global.
There are other examples where variables alternate between local and
non-local throughout the body of a function. No sensible language
would allow this. It’s ugly and it makes optimisation really
difficult. This isn’t the only problem, even weirder things happen
because of interactions between scoping and lazy evaluation.

In light of this, I’ve come to the conclusion that rather than
“fixing” R, it would be much more productive to simply start over and
build something better. I think the best you could hope for by fixing
the efficiency problems in R would be to boost performance by a small
multiple, or perhaps as much as an order of magnitude. This probably
isn’t enough to justify the effort (Luke Tierney has been working on R
compilation for over a decade now).

To try to get an idea of how much speedup is possible, a number of us
have been carrying out some experiments to see how much better we
could do with something new. Based on prototyping we’ve been doing at
Auckland, it looks like it should be straightforward to get two orders
of magnitude speedup over R, at least for those computations which are
currently bottle-necked. There are a couple of ways to make this
happen.

First, scalar computations in R are very slow. This in part because
the R interpreter is very slow, but also because there are a no scalar
types. By introducing scalars and using compilation it looks like its
possible to get a speedup by a factor of several hundred for scalar
computations. This is important because it means that many ghastly
uses of array operations and the apply functions could be replaced by
simple loops. The cost of these improvements is that scope
declarations become mandatory and (optional) type declarations are
necessary to help the compiler.

As a side-effect of compilation and the use of type-hinting it should
be possible to eliminate dispatch overhead for certain (sealed)
classes (scalars and arrays in particular). This won’t bring huge
benefits across the board, but it will mean that you won’t have to do
foreign language calls to get efficiency.

A second big problem is that computations on aggregates (data frames
in particular) run at glacial rates. This is entirely down to
unnecessary copying because of the call-by-value semantics.
Preserving call-by-value semantics while eliminating the extra copying
is hard. The best we can probably do is to take a conservative
approach. R already tries to avoid copying where it can, but fails in
an epic fashion. The alternative is to abandon call-by-value and move
to reference semantics. Again, prototyping indicates that several
hundredfold speedup is possible (for data frames in particular).

The changes in semantics mentioned above mean that the new language
will not be R. However, it won’t be all that far from R and it should
be easy to port R code to the new system, perhaps using some form of
automatic translation.

If we’re smart about building the new system, it should be possible to
make use of multi-cores and parallelism. Adding this to the mix might just
make it possible to get a three order-of-magnitude performance boost
with just a fraction of the memory that R uses. I think it’s something
really worth putting some effort into.

I also think one other change is necessary. The license will need to a
better job of protecting work donated to the commons than GPL2 seems
to have done. I’m not willing to have any more of my work purloined by
the likes of Revolution Analytics, so I’ll be looking for better
protection from the license (and being a lot more careful about who I
work with).

The discussion spilled over to Stack Overflow as well

http://stackoverflow.com/questions/3706990/is-r-that-bad-that-it-should-be-rewritten-from-scratch/3710667#3710667

n the past week I’ve been following a discussion where Ross Ihaka wrote (here ):

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

He then continued explaining. This discussion started from this post, and was then followed by commentshere, here, here, here, here, here and maybe some more places I don’t know of.

We all know the problem now.

R can be improved substantially in terms of speed.

For some solutions, here are the patches by Radford-

http://www.cs.toronto.edu/~radford/speed-patches-doc

patch-dollar

Speeds up access to lists, pairlists, and environments using the
$ operator. The speedup comes mainly from avoiding the overhead of
calling DispatchOrEval if there are no complexities, from passing
on the field to extract as a symbol, or a name, or both, as available,
and then converting only as necessary, from simplifying and inlining
the pstrmatch procedure, and from not translating string multiple
times.

Relevant timing test script: test-dollar.r

This test shows about a 40% decrease in the time needed to extract
elements of lists and environments.

Changes unrelated to speed improvement:

A small error-reporting bug is fixed, illustrated by the following
output with r52822:

> options(warnPartialMatchDollar=TRUE)
> pl <- pairlist(abc=1,def=2)
> pl$ab
[1] 1
Warning message:
In pl$ab : partial match of 'ab' to ''

Some code is changed at the end of R_subset3_dflt because it seems
to be more correct, as discussed in code comments.

patch-evalList

Speeds up a large number of operations by avoiding allocation of
an extra CONS cell in the procedures for evaluating argument lists.

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 5%.

patch-fast-base

Speeds up lookup of symbols defined in the base environment, by
flagging symbols that have a base environment definition recorded
in the global cache. This allows the definition to be retrieved
quickly without looking in the hash table.

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 3%.

Issue: This patch uses the "spare" bit for the flag. This bit is
misnamed, since it is already used elsewhere (for closures). It is
possible that one of the "gp" bits should be used instead. The
"gp" bits should really be divided up for faster access, and so that
their present use is apparent in the code.

In case this use of the "spare" bit proves unwise, the patch code is
conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of
envir.r.

patch-fast-spec

Speeds up lookup of function symbols that begin with a character
other than a letter or ".", by allowing fast bypass of non-global
environments that do not contain (and have never contained) symbols
of this sort. Since it is expected that only functions will be
given names of this sort, the check is done only in findFun, though
it could also be done in findVar.

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 8%.

Issue: This patch uses the "spare" bit to flag environments known
to not have symbols starting with a special character. See remarks
on patch-fast-base.

In case this use of the "spare" bit proves unwise, the patch code is
conditional on FAST_SPEC_BYPASS being defined at the start of envir.r.

patch-for

Speeds up for loops by not allocating new space for the loop
variable every iteration, unless necessary.

Relevant timing test script: test-for.r

This test shows a speedup of about 5%.

Change unrelated to speed improvement:

Fixes what I consider to be a bug, in which the loop clobbers a
global variable, as demonstrated by the following output with r52822:

> i <- 99
> f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); }
> f()
[1] 1
[1] 2
[1] 3
> print(i)
[1] 3

patch-matprod

Speeds up matrix products, including vector dot products. The
speed issue here is that the R code checks for any NAs, and
does the multiply in the matprod procedure (in array.c) if so,
since BLAS isn't trusted with NAs. If this check takes about
as long as just doing the multiply in matprod, calling a BLAS
routine makes no sense.

Relevant time test script: test-matprod.r

With no external BLAS, this patch speeds up long vector-vector
products by a factor of about six, matrix-vector products by a
factor of about three, and some matrix-matrix products by a
factor of about two.

Issue: The matrix multiply code in matprod using an LDOUBLE
(long double) variable to accumulate sums, for improved accuracy.
On a SPARC system I tested on, operations on long doubles are
vastly slower than on doubles, so that the patch produces a
large slowdown rather than an improvement. This is also an issue
for the "sum" function, which also uses an LDOUBLE to accumulate
the sum. Perhaps an ordinarly double should be used in these
places, or perhaps the configuration script should define LDOUBLE
as double on architectures where long doubles are extraordinarily
slow.

Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the
start of array.c will disable this patch.

patch-parens

Speeds up parentheses by making "(" a special operator whose
argument is not evaluated, thereby bypassing the overhead of
evalList. Also slightly speeds up curly brackets by inlining
a function that is stylistically better inline anyway.

Relevant test script: test-parens.r

In the parens part of test-parens.r, the speedup is about 9%.

patch-protect

Speeds up numerous operations by making PROTECT, UNPROTECT, etc.
be mostly macros in the files in src/main. This takes effect
only for files that include Defn.h after defining the symbol
USE_FAST_PROTECT_MACROS. With these macros, code of the form
v = PROTECT(...) must be replaced by PROTECT(v = ...).

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 9%.

patch-save-alloc

Speeds up some binary and unary arithmetic operations by, when
possible, using the space holding one of the operands to hold
the result, rather than allocating new space. Though primarily
a speed improvement, for very long vectors avoiding this allocation
could avoid running out of space.

Relevant test script: test-complex-expr.r

On this test, the speedup is about 5% for scalar operands and about
8% for vector operands.

Issues: There are some tricky issues with attributes, but I think
I got them right. This patch relies on NAMED being set correctly
in the rest of the code. In case it isn't, the patch can be disabled
by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c.

patch-square

Speeds up a^2 when a is a long vector by not checking for the
special case of an exponent of 2 over and over again for every
vector element.

Relevant test script: test-square.r

The time for squaring a long vector is reduced in this test by a
factor of more than five.

patch-sum-prod

Speeds up the "sum" and "prod" functions by not checking for NA
when na.rm=FALSE, and other detailed code improvements.

Relevant test script: test-sum-prod.r

For sum, the improvement is about a factor of 2.5 when na.rm=FALSE,
and about 10% when na.rm=TRUE.

Issue: See the discussion of patch-matprod regarding LDOUBLE.
There is no change regarding this issue due to this patch, however.

patch-transpose

Speeds up the transpose operation (the "t" function) from detailed
code improvements.

Relevant test script: test-transpose.r

The improvement for 200x60 matrices is about a factor of two.
There is little or no improvement for long row or column vectors.

patch-vec-arith

Speeds up arithmetic on vectors of the same length, or when on
vector is of length one. This is done with detailed code improvements.

Relevant test script: test-vec-arith.r

On long vectors, the +, -, and * operators are sped up by about
20% when operands are the same length or one operand is of length one.

Rather mysteriously, when the operands are not length one or the
same length, there is about a 20% increase in time required, though
this may be due to some strange C optimizer peculiarity or some
strange cache effect, since the C code for this is the same as before,
with negligible additional overhead getting to it. Regardless, this
case is much less common than equal lengths or length one.

There is little change for the / operator, which is much slower than
+, -, or *.

patch-vec-subset

Speeds up extraction of subsets of vectors or matrices (eg, v[10:20]
or M[1:10,101:110]). This is done with detailed code improvements.

Relevant test script: test-vec-subset.r

There are lots of tests in this script. The most dramatic improvement
is for extracting many rows and columns of a large array, where the
improvement is by about a factor of four. Extracting many rows from
one column of a matrix is sped up by about 30%.

Changes unrelated to speed improvement:

Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL
when NA_INTEGER is appropriate and where LOGICAL and INTEGER types
are treated as interchangeable. These cause no problems at the moment,
but would if representations were changed.

patch-subscript

(Formerly part of patch-vec-subset) This patch also speeds up
extraction, and also replacement, of subsets of vectors or
matrices, but focuses on the creation of the indexes rather than
the copy operations. Often avoids a duplication (see below) and
eliminates a second scan of the subscript vector for zero
subscripts, folding it into a previous scan at no additional cost.

Relevant test script: test-vec-subset.r

Speeds up some operations with scalar or short vector indexes by
about 10%. Speeds up subscripting with a longer vector of
positive indexes by about 20%.

Issues: The current code duplicates a vector of indexes when it
seems unnecessary. Duplication is for two reasons: to handle
the situation where the index vector is itself being modified in
a replace operation, and so that any attributes can be removed, which
is helpful only for string subscripts, given how the routine to handle
them returns information via an attribute. Duplication for the
second reasons can easily be avoided, so I avoided it. The first
reason for duplication is sometimes valid, but can usually be avoided
by first only doing it if the subscript is to be used for replacement
rather than extraction, and second only doing it if the NAMED field
for the subscript isn't zero.

I also removed two layers of procedure call overhead (passing seven
arguments, so not trivial) that seemed to be doing nothing. Probably
it used to do something, but no longer does, but if instead it is
preparation for some future use, then removing it might be a mistake.

Software problems are best solved by writing code or patches in my opinion rather than discussing endlessly
Some other solutions to a BETTERRRR R
1) Complete Code Design Review
2) Version 3 - Tuneup
3) Better Documentation
4) Suing Revolution Analytics for the code - Hand over da code pardner

Linux= Who did what and how much?

A report distributed under Creative Commons 3 and available at

That shows Canonical — the commercial arm of Ubuntu — has contributed only about one percent of the code to the GNOME desktop for Linux. while Red Hat accounts for 17 percent of the code and Novell developers are responsible for about 11 percent. That prompted some heartburn from Mark, creator- founder Cannonical/ Ubuntu at http://www.markshuttleworth.com/archives/517

And it would be a very different story if it weren’t for the Mozilla folks and Netscape before them, and GNOME and KDE, and Google and everyone else who have exercised that stack in so many different ways, making it better along the way. There are tens of thousands of people who are not in any way shape or form associated with Ubuntu, who make this story real. Many of them have been working at it for more than a decade – it takes a long time to make an overnight success while Ubuntu has only been on the scene six years. So Ubuntu cannot be credited solely for the delight of its users.

Nevertheless, the Ubuntu Project does bring something unique, special and important to free software: a total commitment to everyday users and use cases, the idea that free software should be “for everyone” both economically and in ease of use, and a willingness to chase down the problems that stand between here and there. I feel that commitment is a gift back to the people who built every one of those packages. If we can bring free software to ten times the audience, we have amplified the value of your generosity by a factor of ten, we have made every hour spent fixing an issue or making something amazing, ten times as valuable. I’m very proud to be spending the time and energy on Ubuntu that I do. Yes, I could do many other things, but I can’t think of another course which would have the same impact on the world.

I recognize that not everybody will feel the same way. Bringing their work to ten times the audience without contributing features might just feel like leeching, or increasing the flow of bug reports 10x. I suppose you could say that no matter how generous we are to downstream users, if upstream is only measuring code, then any generosity other than code won’t be registered. I don’t really know what to do about that – I didn’t found Ubuntu as a vehicle for getting lots of code written, that didn’t seem to me to be what the world needed.

Open source communities work like democracies with all noise whereas R and D within corporates have a stricter hierarchy. Still for all that – Ubuntu and Android have made Linux mainstream just as R has made statistical software available to all.

And Ubuntu also has great support for R (particularly the single click R Commander Install and Icon) available at http://packages.ubuntu.com/lucid/math/r-cran-rcmdr

John M. Chambers Statistical Software Award – 2011

Write code, win cash, and the glory. Deep bow to Father John M Chambers, inventor of S ,for endowing this award for statistical software creation by grads and undergrads.

An effort to be matched by companies like SAS, SPSS which after all came from grad school work. Now back to the competition, I gotta get my homies from U Tenn in a team ( I was a grad student last year though taking this year off due to medico- financial reasons)

John M. Chambers Statistical Software Award – 2011
Statistical Computing Section
American Statistical Association

The Statistical Computing Section of the American Statistical
Association announces the competition for the John M. Chambers
Statistical Software Award. In 1998 the Association for Computing
Machinery presented its Software System Award to John Chambers for the
design and development of S. Dr. Chambers generously donated his award
to the Statistical Computing Section to endow an annual prize for
statistical software written by an undergraduate or graduate student.
The prize carries with it a cash award of $1000, plus a substantial
allowance for travel to the annual Joint Statistical Meetings where
the award will be presented.

Teams of up to 3 people can participate in the competition, with the
cash award being split among team members. The travel allowance will
be given to just one individual in the team, who will be presented the
award at JSM. To be eligible, the team must have designed and
implemented a piece of statistical software. The individual within
the team indicated to receive the travel allowance must have begun the
development while a student, and must either currently be a student,
or have completed all requirements for her/his last degree after
January 1, 2009. To apply for the award, teams must provide the
following materials:

Current CV’s of all team members.

A letter from a faculty mentor at the academic institution of the
individual indicated to receive the travel award. The letter
should confirm that the individual had substantial participation in
the development of the software, certify her/his student status
when the software began to be developed (and either the current
student status or the date of degree completion), and briefly
discuss the importance of the software to statistical practice.

A brief, one to two page description of the software, summarizing
what it does, how it does it, and why it is an important
contribution. If the team member competing for the travel
allowance has continued developing the software after finishing
her/his studies, the description should indicate what was developed
when the individual was a student and what has been added since.

An installable software package with its source code for use by the
award committee. It should be accompanied by enough information to allow
the judges to effectively use and evaluate the software (including
its design considerations.) This information can be provided in a
variety of ways, including but not limited to a user manual (paper
or electronic), a paper, a URL, and online help to the system.

All materials must be in English. We prefer that electronic text be
submitted in Postscript or PDF. The entries will be judged on a
variety of dimensions, including the importance and relevance for
statistical practice of the tasks performed by the software, ease of
use, clarity of description, elegance and availability for use by the
statistical community. Preference will be given to those entries that
are grounded in software design rather than calculation. The decision
of the award committee is final.

All application materials must be received by 5:00pm EST, Monday,
February 21, 2011 at the address below. The winner will be announced
in May and the award will be given at the 2011 Joint Statistical
Meetings.

Information on the competition can also be accessed on the website of
the Statistical Computing Section (www.statcomputing.org or see the
ASA website, www.amstat.org for a pointer), including the names and
contributions of previous winners. Inquiries and application
materials should be emailed or mailed to:

Chambers Software Award
c/o Fei Chen
Avaya Labs
233 Mt Airy Rd.
Basking Ridge, NJ 07920
feic@avaya.com