Tag: ross ihaka
Making NeW R
Tal G in his excellent blog piece talks of “Why R Developers should not be paid” http://www.r-statistics.com/2010/09/open-source-and-money-why-r-developers-shouldnt-be-paid/
His argument of love is not very original though it was first made by these four guys
I am going to argue that “some” R developers should be paid, while the main focus should be volunteers code. These R developers should be paid as per usage of their packages.
Let me expand.
Imagine the following conversation between Ross Ihaka, Norman Nie and Peter Dalgaard.
Norman- Hey Guys, Can you give me some code- I got this new startup.
Ross Ihaka and Peter Dalgaard- Sure dude. Here is 100,000 lines of code, 2000 packages and 2 decades of effort.
Norman- Thanks guys.
Ross Ihaka- Hey, What you gonna do with this code.
Norman- I will better it. Sell it. Finally beat Jim Goodnight and his **** Proc GLM and **** Proc Reg.
Ross- Okay, but what will you give us? Will you give us some code back of what you improve?
Norman – Uh, let me explain this open core …
Peter D- Well how about some royalty?
Norman- Sure, we will throw parties at all conferences, snacks you know at user groups.
Ross – Hmm. That does not sound fair. (walks away in a huff muttering)-He takes our code, sells it and wont share the code
Peter D- Doesnt sound fair. I am back to reading Hamlet, the great Dane, and writing the next edition of my book. I am glad I wrote a book- Ross didnt even write that.
Norman-Uh Oh. (picks his phone)- Hey David Smith, We need to write some blog articles pronto – these open source guys ,man…
———–I think that sums what has been going on in the dynamics of R recently. If Ross Ihaka and R Gentleman had adopted an open core strategy- meaning you can create packages to R but not share the original where would we all be?
At this point if he is reading this, David Smith , long suffering veteran of open source flameouts is rolling his eyes while Tal G is wondering if he will publish this on R Bloggers and if so when or something.
Lets bring in another R veteran- Hadley Wickham who wrote a book on R and also created ggplot. Thats the best quality, most often used graphics package.
In terms of economic utilty to end user- the ggplot package may be as useful if not more as the foreach package developed by Revolution Computing/Analytics.
Now http://cran.r-project.org/web/packages/foreach/index.html says that foreach is licensed under http://www.apache.org/licenses/LICENSE-2.0
However lets come to open core licensing ( read it here http://alampitt.typepad.com/lampitt_or_leave_it/2008/08/open-core-licen.html ) which is where the debate is- Revolution takes code- enhances it (in my opinion) substantially with new formats XDF for better efficieny, web services API, and soon coming next year a GUI (thanks in advance , Dr Nie and guys)
and sells this advanced R code to businesses happy to pay ( they are currently paying much more to DR Goodnight and HIS guys)
Why would any sane customer buy it from Revolution- if he could download exactly the same thing from http://r-project.org
Hence the business need for Revolution Analytics to have an enhanced R- as they are using a product based software model not software as a service model.
If Revolution gives away source code of these new enhanced codes to R core team- how will R core team protect the above mentioned intelectual property- given they have 2 decades experience of giving away free code , and back and forth on just code.
Now Revolution also has a marketing budget- and thats how they sponsor some R Core events, conferences, after conference snacks.
How would people decide if they are being too generous or too stingy in their contribution (compared to the formidable generosity of SAS Institute to its employees, stakeholders and even third party analysts).
Would it not be better- IF Revolution can shift that aspect of relationship to its Research and Development budget than it’s marketing budget- come with some sort of incentive for “SOME” developers – even researchers need grants and assistantships, scholarships, make a transparent royalty formula say 17.5 % of the NEW R sales goes to R PACKAGE Developers pool, which in turn examines usage rate of packages and need/merit before allocation- that would require Revolution to evolve from a startup to a more sophisticated corporate and R Core can use this the same way as John M Chambers software award/scholarship
Dont pay all developers- it would be an insult to many of them – say Prof Harrell creator of HMisc to accept – but can Revolution expand its dev base (and prospect for future employees) by even sponsoring some R Scholarships.
And I am sure that if Revolution opens up some more code to the community- they would the rest of the world and it’s help useful. If it cant trust people like R Gentleman with some source code – well he is a board member.
——————————————————————————————–
Now to sum up some technical discussions on NeW R
1) An accepted way of benchmarking efficiencies.
2) Code review and incorporation of efficiencies.
3) Multi threading- Multi core usage are trends to be incorporated.
4) GUIs like R Commander E Plugins for other packages, and Rattle for Data Mining to have focussed (or Deducer). This may involve hiring User Interface Designers (like from Apple 😉 who will work for love AND money ( Even the Beatles charge royalty for that song)
5) More support to cloud computing initiatives like Biocep and Elastic R – or Amazon AMI for using cloud computers- note efficiency arguements dont matter if you just use a Chrome Browser and pay 2 cents a hour for an Amazon Instance. Probably R core needs more direct involvement of Google (Cloud OS makers) and Amazon as well as even Salesforce.com (for creating Force.com Apps). Note even more corporates here need to be involved as cloud computing doesnot have any free and open source infrastructure (YET)
_______________________________________________________
Debates will come and go. This is an interesting intellectual debate and someday the liitle guys will win the Revolution-
From Hugh M of Gaping Void-
http://www.gapingvoid.com/Moveable_Type/archives/cat_microsoft_blue_monster_series.html
HOW DOES A SOFTWARE COMPANY MAKE MONEY, IF ALL
SOFTWARE IS FREE?
“If something goes wrong with Microsoft, I can phone Microsoft up and have it fixed. With Open Source, I have to rely on the community.”
And the community, as much as we may love it, is unpredictable. It might care about your problem and want to fix it, then again, it may not. Anyone who has ever witnessed something online go “viral”, good or bad, will know what I’m talking about.
and especially-
Source-http://gapingvoidgallery.com/
Kind of sums up why the open core licensing is all about.
Professors and Patches: For a Betterrrr R
Professors sometime throw out provocative statements to ensure intellectual debate. I have had almost 1500+ hits in less than 2 days ( and I am glad I am on wordpress.com , my old beloved server would have crashed))
The remarks from Ross Ihaka, covered before and also at Xian’s blog at
Note most of his remarks are techie- and only a single line refers to Revlution Analytics.
Other senior members of community (read- professors are silent, though brobably some thought may have been ignited behind scenes)
http://xianblog.wordpress.com/2010/09/06/insane/comment-page-4/#comments
Ross Ihaka Says:
September 12, 2010 at 1:23 pm
Since (something like) my name has been taken in vain here, let me
chip in.
I’ve been worried for some time that R isn’t going to provide the base
that we’re going to need for statistical computation in the
future. (It may well be that the future is already upon us.) There
are certainly efficiency problems (speed and memory use), but there
are more fundamental issues too. Some of these were inherited from S
and some are peculiar to R.
One of the worst problems is scoping. Consider the following little
gem.
f =
function() {
if (runif(1) > .5)
x = 10
x
}
The x being returned by this function is randomly local or global.
There are other examples where variables alternate between local and
non-local throughout the body of a function. No sensible language
would allow this. It’s ugly and it makes optimisation really
difficult. This isn’t the only problem, even weirder things happen
because of interactions between scoping and lazy evaluation.
In light of this, I’ve come to the conclusion that rather than
“fixing” R, it would be much more productive to simply start over and
build something better. I think the best you could hope for by fixing
the efficiency problems in R would be to boost performance by a small
multiple, or perhaps as much as an order of magnitude. This probably
isn’t enough to justify the effort (Luke Tierney has been working on R
compilation for over a decade now).
To try to get an idea of how much speedup is possible, a number of us
have been carrying out some experiments to see how much better we
could do with something new. Based on prototyping we’ve been doing at
Auckland, it looks like it should be straightforward to get two orders
of magnitude speedup over R, at least for those computations which are
currently bottle-necked. There are a couple of ways to make this
happen.
First, scalar computations in R are very slow. This in part because
the R interpreter is very slow, but also because there are a no scalar
types. By introducing scalars and using compilation it looks like its
possible to get a speedup by a factor of several hundred for scalar
computations. This is important because it means that many ghastly
uses of array operations and the apply functions could be replaced by
simple loops. The cost of these improvements is that scope
declarations become mandatory and (optional) type declarations are
necessary to help the compiler.
As a side-effect of compilation and the use of type-hinting it should
be possible to eliminate dispatch overhead for certain (sealed)
classes (scalars and arrays in particular). This won’t bring huge
benefits across the board, but it will mean that you won’t have to do
foreign language calls to get efficiency.
A second big problem is that computations on aggregates (data frames
in particular) run at glacial rates. This is entirely down to
unnecessary copying because of the call-by-value semantics.
Preserving call-by-value semantics while eliminating the extra copying
is hard. The best we can probably do is to take a conservative
approach. R already tries to avoid copying where it can, but fails in
an epic fashion. The alternative is to abandon call-by-value and move
to reference semantics. Again, prototyping indicates that several
hundredfold speedup is possible (for data frames in particular).
The changes in semantics mentioned above mean that the new language
will not be R. However, it won’t be all that far from R and it should
be easy to port R code to the new system, perhaps using some form of
automatic translation.
If we’re smart about building the new system, it should be possible to
make use of multi-cores and parallelism. Adding this to the mix might just
make it possible to get a three order-of-magnitude performance boost
with just a fraction of the memory that R uses. I think it’s something
really worth putting some effort into.
I also think one other change is necessary. The license will need to a
better job of protecting work donated to the commons than GPL2 seems
to have done. I’m not willing to have any more of my work purloined by
the likes of Revolution Analytics, so I’ll be looking for better
protection from the license (and being a lot more careful about who I
work with).
The discussion spilled over to Stack Overflow as well
n the past week I’ve been following a discussion where Ross Ihaka wrote (here ):
I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.
He then continued explaining. This discussion started from this post, and was then followed by commentshere, here, here, here, here, here and maybe some more places I don’t know of.
We all know the problem now.
R can be improved substantially in terms of speed.
For some solutions, here are the patches by Radford-
http://www.cs.toronto.edu/~radford/speed-patches-doc
patch-dollar Speeds up access to lists, pairlists, and environments using the $ operator. The speedup comes mainly from avoiding the overhead of calling DispatchOrEval if there are no complexities, from passing on the field to extract as a symbol, or a name, or both, as available, and then converting only as necessary, from simplifying and inlining the pstrmatch procedure, and from not translating string multiple times. Relevant timing test script: test-dollar.r This test shows about a 40% decrease in the time needed to extract elements of lists and environments. Changes unrelated to speed improvement: A small error-reporting bug is fixed, illustrated by the following output with r52822: > options(warnPartialMatchDollar=TRUE) > pl <- pairlist(abc=1,def=2) > pl$ab [1] 1 Warning message: In pl$ab : partial match of 'ab' to '' Some code is changed at the end of R_subset3_dflt because it seems to be more correct, as discussed in code comments. patch-evalList Speeds up a large number of operations by avoiding allocation of an extra CONS cell in the procedures for evaluating argument lists. Relevant timing test scripts: all of them, but will look at test-em.r On test-em.r, the speedup from this patch is about 5%. patch-fast-base Speeds up lookup of symbols defined in the base environment, by flagging symbols that have a base environment definition recorded in the global cache. This allows the definition to be retrieved quickly without looking in the hash table. Relevant timing test scripts: all of them, but will look at test-em.r On test-em.r, the speedup from this patch is about 3%. Issue: This patch uses the "spare" bit for the flag. This bit is misnamed, since it is already used elsewhere (for closures). It is possible that one of the "gp" bits should be used instead. The "gp" bits should really be divided up for faster access, and so that their present use is apparent in the code. In case this use of the "spare" bit proves unwise, the patch code is conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of envir.r. patch-fast-spec Speeds up lookup of function symbols that begin with a character other than a letter or ".", by allowing fast bypass of non-global environments that do not contain (and have never contained) symbols of this sort. Since it is expected that only functions will be given names of this sort, the check is done only in findFun, though it could also be done in findVar. Relevant timing test scripts: all of them, but will look at test-em.r On test-em.r, the speedup from this patch is about 8%. Issue: This patch uses the "spare" bit to flag environments known to not have symbols starting with a special character. See remarks on patch-fast-base. In case this use of the "spare" bit proves unwise, the patch code is conditional on FAST_SPEC_BYPASS being defined at the start of envir.r. patch-for Speeds up for loops by not allocating new space for the loop variable every iteration, unless necessary. Relevant timing test script: test-for.r This test shows a speedup of about 5%. Change unrelated to speed improvement: Fixes what I consider to be a bug, in which the loop clobbers a global variable, as demonstrated by the following output with r52822: > i <- 99 > f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); } > f() [1] 1 [1] 2 [1] 3 > print(i) [1] 3 patch-matprod Speeds up matrix products, including vector dot products. The speed issue here is that the R code checks for any NAs, and does the multiply in the matprod procedure (in array.c) if so, since BLAS isn't trusted with NAs. If this check takes about as long as just doing the multiply in matprod, calling a BLAS routine makes no sense. Relevant time test script: test-matprod.r With no external BLAS, this patch speeds up long vector-vector products by a factor of about six, matrix-vector products by a factor of about three, and some matrix-matrix products by a factor of about two. Issue: The matrix multiply code in matprod using an LDOUBLE (long double) variable to accumulate sums, for improved accuracy. On a SPARC system I tested on, operations on long doubles are vastly slower than on doubles, so that the patch produces a large slowdown rather than an improvement. This is also an issue for the "sum" function, which also uses an LDOUBLE to accumulate the sum. Perhaps an ordinarly double should be used in these places, or perhaps the configuration script should define LDOUBLE as double on architectures where long doubles are extraordinarily slow. Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the start of array.c will disable this patch. patch-parens Speeds up parentheses by making "(" a special operator whose argument is not evaluated, thereby bypassing the overhead of evalList. Also slightly speeds up curly brackets by inlining a function that is stylistically better inline anyway. Relevant test script: test-parens.r In the parens part of test-parens.r, the speedup is about 9%. patch-protect Speeds up numerous operations by making PROTECT, UNPROTECT, etc. be mostly macros in the files in src/main. This takes effect only for files that include Defn.h after defining the symbol USE_FAST_PROTECT_MACROS. With these macros, code of the form v = PROTECT(...) must be replaced by PROTECT(v = ...). Relevant timing test scripts: all of them, but will look at test-em.r On test-em.r, the speedup from this patch is about 9%. patch-save-alloc Speeds up some binary and unary arithmetic operations by, when possible, using the space holding one of the operands to hold the result, rather than allocating new space. Though primarily a speed improvement, for very long vectors avoiding this allocation could avoid running out of space. Relevant test script: test-complex-expr.r On this test, the speedup is about 5% for scalar operands and about 8% for vector operands. Issues: There are some tricky issues with attributes, but I think I got them right. This patch relies on NAMED being set correctly in the rest of the code. In case it isn't, the patch can be disabled by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c. patch-square Speeds up a^2 when a is a long vector by not checking for the special case of an exponent of 2 over and over again for every vector element. Relevant test script: test-square.r The time for squaring a long vector is reduced in this test by a factor of more than five. patch-sum-prod Speeds up the "sum" and "prod" functions by not checking for NA when na.rm=FALSE, and other detailed code improvements. Relevant test script: test-sum-prod.r For sum, the improvement is about a factor of 2.5 when na.rm=FALSE, and about 10% when na.rm=TRUE. Issue: See the discussion of patch-matprod regarding LDOUBLE. There is no change regarding this issue due to this patch, however. patch-transpose Speeds up the transpose operation (the "t" function) from detailed code improvements. Relevant test script: test-transpose.r The improvement for 200x60 matrices is about a factor of two. There is little or no improvement for long row or column vectors. patch-vec-arith Speeds up arithmetic on vectors of the same length, or when on vector is of length one. This is done with detailed code improvements. Relevant test script: test-vec-arith.r On long vectors, the +, -, and * operators are sped up by about 20% when operands are the same length or one operand is of length one. Rather mysteriously, when the operands are not length one or the same length, there is about a 20% increase in time required, though this may be due to some strange C optimizer peculiarity or some strange cache effect, since the C code for this is the same as before, with negligible additional overhead getting to it. Regardless, this case is much less common than equal lengths or length one. There is little change for the / operator, which is much slower than +, -, or *. patch-vec-subset Speeds up extraction of subsets of vectors or matrices (eg, v[10:20] or M[1:10,101:110]). This is done with detailed code improvements. Relevant test script: test-vec-subset.r There are lots of tests in this script. The most dramatic improvement is for extracting many rows and columns of a large array, where the improvement is by about a factor of four. Extracting many rows from one column of a matrix is sped up by about 30%. Changes unrelated to speed improvement: Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL when NA_INTEGER is appropriate and where LOGICAL and INTEGER types are treated as interchangeable. These cause no problems at the moment, but would if representations were changed. patch-subscript (Formerly part of patch-vec-subset) This patch also speeds up extraction, and also replacement, of subsets of vectors or matrices, but focuses on the creation of the indexes rather than the copy operations. Often avoids a duplication (see below) and eliminates a second scan of the subscript vector for zero subscripts, folding it into a previous scan at no additional cost. Relevant test script: test-vec-subset.r Speeds up some operations with scalar or short vector indexes by about 10%. Speeds up subscripting with a longer vector of positive indexes by about 20%. Issues: The current code duplicates a vector of indexes when it seems unnecessary. Duplication is for two reasons: to handle the situation where the index vector is itself being modified in a replace operation, and so that any attributes can be removed, which is helpful only for string subscripts, given how the routine to handle them returns information via an attribute. Duplication for the second reasons can easily be avoided, so I avoided it. The first reason for duplication is sometimes valid, but can usually be avoided by first only doing it if the subscript is to be used for replacement rather than extraction, and second only doing it if the NAMED field for the subscript isn't zero. I also removed two layers of procedure call overhead (passing seven arguments, so not trivial) that seemed to be doing nothing. Probably it used to do something, but no longer does, but if instead it is preparation for some future use, then removing it might be a mistake. Software problems are best solved by writing code or patches in my opinion rather than discussing endlessly Some other solutions to a BETTERRRR R 1) Complete Code Design Review 2) Version 3 - Tuneup 3) Better Documentation 4) Suing Revolution Analytics for the code - Hand over da code pardner