Going Deap : Algols in Python

Logo of PyPy
Image via Wikipedia

Here is an important new step in Python- the established statistical programming language (used to be really pushed by SPSS in pre-IBM days and the rPy package integrates R and Python).

Well the news  ( http://www.kdnuggets.com/2010/10/eap-evolutionary-algorithms-in-python.html ) is the release of Distributed Evolutionary Algorithms in Python. If your understanding of modeling means running regression and iterating it- you may need to read some more.  If you have felt frustrated at lack of parallelization in statistical software as well as your own hardware constraints- well go DEAP (and for corporate types the licensing is

http://www.gnu.org/licenses/lgpl.html ).

http://code.google.com/p/deap/

DEAP

DEAP is intended to be an easy to use distributed evolutionary algorithm library in the Python language. Its two main components are modular and can be used separately. The first module is a Distributed Task Manager (DTM), which is intended to run on cluster of computers. The second part is the Evolutionary Algorithms in Python (EAP) framework.

DTM

DTM is a distributed task manager that is able to spread workload over a buch of computers using a TCP or a MPI connection.

DTM include the following features:

 

EAP

Features

EAP includes the following features:

  • Genetic algorithm using any imaginable representation
    • List, Array, Set, Dictionary, Tree, …
  • Genetic programing using prefix trees
    • Loosely typed, Strongly typed
    • Automatically defined functions (new v0.6)
  • Evolution strategies (including CMA-ES)
  • Multi-objective optimisation (NSGA-II, SPEA-II)
  • Parallelization of the evaluations (and maybe more) (requires python2.6 and preferably python2.7) (new v0.6)
  • Genealogy of an evolution (that is compatible with NetworkX) (new v0.6)
  • Hall of Fame of the best individuals that lived in the population (new v0.5)
  • Milestones that take snapshot of a system regularly (new v0.5)

 

Documentation

See the eap user’s guide for EAP 0.6 documentation.

Requirement

The most basic features of EAP requires Python2.5 (we simply do not offer support for 2.4). In order to use multiprocessing you will need Python2.6 and to be able to combine the toolbox and the multiprocessing module Python2.7 is needed for its support to pickle partial functions.

Projects using EAP

If you want your project listed here, simply send us a link and a brief description and we’ll be glad to add it.

and from the wordpress.com blog (funny how people like code.google.com but not blogger.google.com anymore) at http://deapdev.wordpress.com/

EAP is part of the DEAP project, that also includes some facilities for the automatic distribution and parallelization of tasks over a cluster of computers. The D part of DEAP, called DTM, is under intense development and currently available as an alpha version. DTM currently provides two and a half ways to distribute workload on a cluster or LAN of workstations, based on MPI and TCP communication managers.

This public release (version 0.6) is more complete and simpler than ever. It includes Genetic Algorithms using any imaginable representation, Genetic Programming with strongly and loosely typed trees in addition to automatically defined functions, Evolution Strategies (including Covariance Matrix Adaptation), multiobjective optimization techniques (NSGA-II and SPEA2), easy parallelization of algorithms and much more like milestones, genealogy, etc.

We are impatient to hear your feedback and comments on that system at .

Best,

François-Michel De Rainville
Félix-Antoine Fortin
Marc-André Gardner
Christian Gagné
Marc Parizeau

Laboratoire de vision et systèmes numériques
Département de génie électrique et génie informatique
Université Laval
Quebec City (Quebec), Canada

and if you are new to Python -sigh here are some statistical things (read ad-van-cED analytics using Python) by a slideshare from Visual numerics (pre Rogue Wave acquisition)

Also see,

http://code.google.com/p/deap/wiki/SimpleExample

 

 

 

AsterData gets $30 mill in funding

From the press release, the maker of Map Reduce based BI software gets 30 mill $ as Series C funding. Given the valuation recently by IBM to Netezza, AsterData seems set to cross the Billion Dollar valuation within the next 18-24 months IMO

Aster Data Closes $30 Million Series C Financing

Explosive Growth and Market Leadership Attracts New and Existing Investors

San Carlos, CA – September 22, 2010 – Aster Data, a market leader in big data management and advanced analytics, today announced that it has closed a $30 million Series C round of financing led by both new and existing investors. The company will use the new funding to accelerate growth, scale operations, and expand its global market share in the $20 billion database market – a market that is experiencing rapid growth as a result of both the explosion in data volumes across organizations and the urgent need to deliver a new class of analytics and data-driven applications. The Series C round of funding includes previous investors Sequoia Capital, JAFCO Ventures, Institutional Venture Partners, Cambrian Ventures, as well as an additional new strategic investor.  Also investing in this round is early investor David Cheriton, who previously backed high-growth companies including Google and VMware, and co-founded several successful technology companies.

Today’s Series C funding announcement underscores a year of strong innovation, execution, and overall momentum for the analytic database company. Key milestones include:

Strong sales growth: Since 2008, Aster Data has doubled revenue year-over-year and secured key customers that leverage Aster Data’s platform to address the big data management problem including MySpace, comScore, Barnes & Noble, and Akamai. Like so many organizations today,
Aster Data’s customers are experiencing explosive data growth across their organizations and recognize the need for rich, advanced analytics that give them deeper insights from their data.

Key executive hires: Quentin Gallivan, former CEO of both PivotLink and Postini and EVP of worldwide sales at Verisign, recently joined the company as Chief Executive Officer. In addition, earlier this year, John Calonico, previously at Interwoven, BEA, and Autodesk, joined as Chief Financial Officer; and Nitin Donde, formerly an executive at EMC and 3PAR, joined as Executive Vice President Engineering.  The strength and experience of Aster Data’s management team helps further establish a strong operational foundation for growth in 2010 and beyond.

Industry recognition: Aster Data was positioned in the “Visionaries” Quadrant of Gartner, Inc.’s

Data Warehouse Database Management Systems Magic Quadrant, published 2010 *; was recently named 2011 Tech Pioneer by the World Economic Forum; was named “Company to Watch” in the Information Management category of TechWeb’s Intelligent Enterprise 2010 Editors’ Choice Awards; and was awarded the 2010 San Francisco Business Times Technology and Innovation Award in the Best Product and Services Category.

Product Innovation: Aster Data continues to deliver ground-breaking capabilities to address the big data management and advanced analytics market need. Its recent announcement of
Aster Data nCluster 4.6 includes a column data store, making it the first hybrid row and column MPP DBMS with a unified SQL and MapReduce analytic framework for advanced analytics on large data sets. This year, Aster Data also delivered the most extensive library of pre-packaged MapReduce analytics totaling over 1000 functions, to ease and accelerate delivery of highly advanced analytic applications.

Aster Data’s analytic database, also called a ‘Data-Analytics Server’ is specifically designed to enable organizations to cost effectively store and analyze massive volumes of data. Aster Data leverages the power of commodity, general-purpose hardware, to reduce the cost to scale to support large data volumes and uniquely allows analysis of all data ‘in-database’ enabling richer and faster processing of large data sets. Aster Data’s in-database analytics engine uses the power of MapReduce, a parallel processing framework created by Google.

”The funding we received in our Series C round is a strong endorsement of Aster Data’s market leadership position and the high growth potential of the big data market,” said Quentin Gallivan, Chief Executive Officer, Aster Data. “The Aster Data team has executed exceptionally well to-date and I am excited to have the resources to accelerate the growth of the company as we expand our operations and execute aggressively across all fronts.”

Professors and Patches: For a Betterrrr R

Professors sometime throw out provocative statements to ensure intellectual debate. I have had almost 1500+ hits in less than 2 days ( and I am glad I am on wordpress.com , my old beloved server would have crashed))

The remarks from Ross Ihaka, covered before and also at Xian’s blog at

Note most of his remarks are techie- and only a single line refers to Revlution Analytics.

Other senior members of community (read- professors are silent, though brobably some thought may have been ignited behind scenes)

http://xianblog.wordpress.com/2010/09/06/insane/comment-page-4/#comments

Ross Ihaka Says:
September 12, 2010 at 1:23 pm

Since (something like) my name has been taken in vain here, let me
chip in.

I’ve been worried for some time that R isn’t going to provide the base
that we’re going to need for statistical computation in the
future. (It may well be that the future is already upon us.) There
are certainly efficiency problems (speed and memory use), but there
are more fundamental issues too. Some of these were inherited from S
and some are peculiar to R.

One of the worst problems is scoping. Consider the following little
gem.

f =
function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global.
There are other examples where variables alternate between local and
non-local throughout the body of a function. No sensible language
would allow this. It’s ugly and it makes optimisation really
difficult. This isn’t the only problem, even weirder things happen
because of interactions between scoping and lazy evaluation.

In light of this, I’ve come to the conclusion that rather than
“fixing” R, it would be much more productive to simply start over and
build something better. I think the best you could hope for by fixing
the efficiency problems in R would be to boost performance by a small
multiple, or perhaps as much as an order of magnitude. This probably
isn’t enough to justify the effort (Luke Tierney has been working on R
compilation for over a decade now).

To try to get an idea of how much speedup is possible, a number of us
have been carrying out some experiments to see how much better we
could do with something new. Based on prototyping we’ve been doing at
Auckland, it looks like it should be straightforward to get two orders
of magnitude speedup over R, at least for those computations which are
currently bottle-necked. There are a couple of ways to make this
happen.

First, scalar computations in R are very slow. This in part because
the R interpreter is very slow, but also because there are a no scalar
types. By introducing scalars and using compilation it looks like its
possible to get a speedup by a factor of several hundred for scalar
computations. This is important because it means that many ghastly
uses of array operations and the apply functions could be replaced by
simple loops. The cost of these improvements is that scope
declarations become mandatory and (optional) type declarations are
necessary to help the compiler.

As a side-effect of compilation and the use of type-hinting it should
be possible to eliminate dispatch overhead for certain (sealed)
classes (scalars and arrays in particular). This won’t bring huge
benefits across the board, but it will mean that you won’t have to do
foreign language calls to get efficiency.

A second big problem is that computations on aggregates (data frames
in particular) run at glacial rates. This is entirely down to
unnecessary copying because of the call-by-value semantics.
Preserving call-by-value semantics while eliminating the extra copying
is hard. The best we can probably do is to take a conservative
approach. R already tries to avoid copying where it can, but fails in
an epic fashion. The alternative is to abandon call-by-value and move
to reference semantics. Again, prototyping indicates that several
hundredfold speedup is possible (for data frames in particular).

The changes in semantics mentioned above mean that the new language
will not be R. However, it won’t be all that far from R and it should
be easy to port R code to the new system, perhaps using some form of
automatic translation.

If we’re smart about building the new system, it should be possible to
make use of multi-cores and parallelism. Adding this to the mix might just
make it possible to get a three order-of-magnitude performance boost
with just a fraction of the memory that R uses. I think it’s something
really worth putting some effort into.

I also think one other change is necessary. The license will need to a
better job of protecting work donated to the commons than GPL2 seems
to have done. I’m not willing to have any more of my work purloined by
the likes of Revolution Analytics, so I’ll be looking for better
protection from the license (and being a lot more careful about who I
work with).

The discussion spilled over to Stack Overflow as well

http://stackoverflow.com/questions/3706990/is-r-that-bad-that-it-should-be-rewritten-from-scratch/3710667#3710667

n the past week I’ve been following a discussion where Ross Ihaka wrote (here ):

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

He then continued explaining. This discussion started from this post, and was then followed by commentsherehereherehereherehere and maybe some more places I don’t know of.

We all know the problem now.

R can be improved substantially in terms of speed.

For some solutions, here are the patches by Radford-

http://www.cs.toronto.edu/~radford/speed-patches-doc

patch-dollar

    Speeds up access to lists, pairlists, and environments using the
    $ operator.  The speedup comes mainly from avoiding the overhead of 
    calling DispatchOrEval if there are no complexities, from passing
    on the field to extract as a symbol, or a name, or both, as available,
    and then converting only as necessary, from simplifying and inlining
    the pstrmatch procedure, and from not translating string multiple
    times.  

    Relevant timing test script:  test-dollar.r 

    This test shows about a 40% decrease in the time needed to extract
    elements of lists and environments.

    Changes unrelated to speed improvement:

    A small error-reporting bug is fixed, illustrated by the following
    output with r52822:

    > options(warnPartialMatchDollar=TRUE)
    > pl <- pairlist(abc=1,def=2)
    > pl$ab
    [1] 1
    Warning message:
    In pl$ab : partial match of 'ab' to ''

    Some code is changed at the end of R_subset3_dflt because it seems 
    to be more correct, as discussed in code comments. 

patch-evalList

    Speeds up a large number of operations by avoiding allocation of
    an extra CONS cell in the procedures for evaluating argument lists.

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 5%.

patch-fast-base

    Speeds up lookup of symbols defined in the base environment, by
    flagging symbols that have a base environment definition recorded
    in the global cache.  This allows the definition to be retrieved
    quickly without looking in the hash table.  

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 3%.

    Issue:  This patch uses the "spare" bit for the flag.  This bit is
    misnamed, since it is already used elsewhere (for closures).  It is
    possible that one of the "gp" bits should be used instead.  The
    "gp" bits should really be divided up for faster access, and so that
    their present use is apparent in the code.

    In case this use of the "spare" bit proves unwise, the patch code is 
    conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of
    envir.r.

patch-fast-spec

    Speeds up lookup of function symbols that begin with a character
    other than a letter or ".", by allowing fast bypass of non-global
    environments that do not contain (and have never contained) symbols 
    of this sort.  Since it is expected that only functions will be
    given names of this sort, the check is done only in findFun, though
    it could also be done in findVar.

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 8%.    

    Issue:  This patch uses the "spare" bit to flag environments known
    to not have symbols starting with a special character.  See remarks
    on patch-fast-base.

    In case this use of the "spare" bit proves unwise, the patch code is 
    conditional on FAST_SPEC_BYPASS being defined at the start of envir.r.

patch-for

    Speeds up for loops by not allocating new space for the loop
    variable every iteration, unless necessary.  

    Relevant timing test script:  test-for.r

    This test shows a speedup of about 5%.  

    Change unrelated to speed improvement:

    Fixes what I consider to be a bug, in which the loop clobbers a
    global variable, as demonstrated by the following output with r52822:

    > i <- 99
    > f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); }
    > f()
    [1] 1
    [1] 2
    [1] 3
    > print(i)
    [1] 3

patch-matprod

    Speeds up matrix products, including vector dot products.  The
    speed issue here is that the R code checks for any NAs, and 
    does the multiply in the matprod procedure (in array.c) if so,
    since BLAS isn't trusted with NAs.  If this check takes about
    as long as just doing the multiply in matprod, calling a BLAS
    routine makes no sense.  

    Relevant time test script:  test-matprod.r

    With no external BLAS, this patch speeds up long vector-vector 
    products by a factor of about six, matrix-vector products by a
    factor of about three, and some matrix-matrix products by a 
    factor of about two.

    Issue:  The matrix multiply code in matprod using an LDOUBLE
    (long double) variable to accumulate sums, for improved accuracy.  
    On a SPARC system I tested on, operations on long doubles are 
    vastly slower than on doubles, so that the patch produces a 
    large slowdown rather than an improvement.  This is also an issue 
    for the "sum" function, which also uses an LDOUBLE to accumulate
    the sum.  Perhaps an ordinarly double should be used in these
    places, or perhaps the configuration script should define LDOUBLE 
    as double on architectures where long doubles are extraordinarily 
    slow.

    Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the
    start of array.c will disable this patch.

patch-parens

    Speeds up parentheses by making "(" a special operator whose
    argument is not evaluated, thereby bypassing the overhead of
    evalList.  Also slightly speeds up curly brackets by inlining
    a function that is stylistically better inline anyway.

    Relevant test script:  test-parens.r

    In the parens part of test-parens.r, the speedup is about 9%.

patch-protect

    Speeds up numerous operations by making PROTECT, UNPROTECT, etc.
    be mostly macros in the files in src/main.  This takes effect
    only for files that include Defn.h after defining the symbol
    USE_FAST_PROTECT_MACROS.  With these macros, code of the form
    v = PROTECT(...) must be replaced by PROTECT(v = ...).  

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 9%.

patch-save-alloc

    Speeds up some binary and unary arithmetic operations by, when
    possible, using the space holding one of the operands to hold
    the result, rather than allocating new space.  Though primarily
    a speed improvement, for very long vectors avoiding this allocation 
    could avoid running out of space.

    Relevant test script:  test-complex-expr.r

    On this test, the speedup is about 5% for scalar operands and about
    8% for vector operands.

    Issues:  There are some tricky issues with attributes, but I think
    I got them right.  This patch relies on NAMED being set correctly 
    in the rest of the code.  In case it isn't, the patch can be disabled 
    by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c.

patch-square

    Speeds up a^2 when a is a long vector by not checking for the
    special case of an exponent of 2 over and over again for every 
    vector element.

    Relevant test script:  test-square.r

    The time for squaring a long vector is reduced in this test by a
    factor of more than five.

patch-sum-prod

    Speeds up the "sum" and "prod" functions by not checking for NA
    when na.rm=FALSE, and other detailed code improvements.

    Relevant test script:  test-sum-prod.r

    For sum, the improvement is about a factor of 2.5 when na.rm=FALSE,
    and about 10% when na.rm=TRUE.

    Issue:  See the discussion of patch-matprod regarding LDOUBLE.
    There is no change regarding this issue due to this patch, however.

patch-transpose

    Speeds up the transpose operation (the "t" function) from detailed
    code improvements.

    Relevant test script:  test-transpose.r

    The improvement for 200x60 matrices is about a factor of two.
    There is little or no improvement for long row or column vectors.

patch-vec-arith

    Speeds up arithmetic on vectors of the same length, or when on
    vector is of length one.  This is done with detailed code improvements.

    Relevant test script:  test-vec-arith.r

    On long vectors, the +, -, and * operators are sped up by about     
    20% when operands are the same length or one operand is of length one.

    Rather mysteriously, when the operands are not length one or the
    same length, there is about a 20% increase in time required, though
    this may be due to some strange C optimizer peculiarity or some 
    strange cache effect, since the C code for this is the same as before,
    with negligible additional overhead getting to it.  Regardless, this 
    case is much less common than equal lengths or length one.

    There is little change for the / operator, which is much slower than
    +, -, or *.

patch-vec-subset

    Speeds up extraction of subsets of vectors or matrices (eg, v[10:20]
    or M[1:10,101:110]).  This is done with detailed code improvements.

    Relevant test script:  test-vec-subset.r

    There are lots of tests in this script.  The most dramatic improvement
    is for extracting many rows and columns of a large array, where the 
    improvement is by about a factor of four.  Extracting many rows from
    one column of a matrix is sped up by about 30%. 

    Changes unrelated to speed improvement:

    Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL
    when NA_INTEGER is appropriate and where LOGICAL and INTEGER types
    are treated as interchangeable.  These cause no problems at the moment,
    but would if representations were changed.

patch-subscript

    (Formerly part of patch-vec-subset)  This patch also speeds up
    extraction, and also replacement, of subsets of vectors or
    matrices, but focuses on the creation of the indexes rather than
    the copy operations.  Often avoids a duplication (see below) and
    eliminates a second scan of the subscript vector for zero
    subscripts, folding it into a previous scan at no additional cost.

    Relevant test script:  test-vec-subset.r

    Speeds up some operations with scalar or short vector indexes by
    about 10%.  Speeds up subscripting with a longer vector of
    positive indexes by about 20%.

    Issues:  The current code duplicates a vector of indexes when it
    seems unnecessary.  Duplication is for two reasons:  to handle
    the situation where the index vector is itself being modified in
    a replace operation, and so that any attributes can be removed, which 
    is helpful only for string subscripts, given how the routine to handle 
    them returns information via an attribute.  Duplication for the
    second reasons can easily be avoided, so I avoided it.  The first
    reason for duplication is sometimes valid, but can usually be avoided
    by first only doing it if the subscript is to be used for replacement
    rather than extraction, and second only doing it if the NAMED field
    for the subscript isn't zero.

    I also removed two layers of procedure call overhead (passing seven
    arguments, so not trivial) that seemed to be doing nothing.  Probably 
    it used to do something, but no longer does, but if instead it is 
    preparation for some future use, then removing it might be a mistake.

Software problems are best solved by writing code or patches in my opinion rather than discussing endlessly
Some other solutions to a BETTERRRR R
1) Complete Code Design Review
2) Version 3 - Tuneup
3) Better Documentation
4) Suing Revolution Analytics for the code - Hand over da code pardner

The Old Man and the C

At 32, I am probably too old to do this- but I am finally learning C
from here-
http://www.cprogramming.com/

Another tutorial from CS at Stanford is also quite good

This is document #101, Essential C, in the Stanford CS Education Library. This and other educational materials are available for free at http://cslibrary.stanford.edu/. This article is free to be used, reproduced, excerpted, retransmitted, or sold so long as this notice is clearly reproduced at its beginning.

Top ten RRReasons R is bad for you ?

 

This is the original symbol of the Perl progra...
Image via Wikipedia

 

R stands for programming language based out of www.r-project.org

R is bad for you because –

1) It is slower with bigger datasets than SPSS language and SAS language .If you use bigger datasets, then you should either consider more hardware , or try and wait for some of the ODBC connect packages.

2) It needs more time to learn than SAS language .Much more time to learn how to do much more.

3) R programmers are lesser paid than SAS programmers.They prefer it that way.It equates the satisfaction of creating a package in development with a world wide community with the satisfaction of using a package and earning much more money per hour.

4) It forces you to learn the exact details of what you are doing due to its object oriented structure. Thus you either get no answer or get an exact answer. Your customer pays you by the hour not by the correct answers.

5) You can not push a couple of buttons or refer to a list of top ten most commonly used commands to finish the project.

6) It is free. And open for all. It is socialism expressed in code. Some of the packages are built by university professors. It is free.Free is bad. Who pays for the mortgage of the software programmers if all softwares were free ? Who pays for the Friday picnics. Who pays for the Good Night cruises?

7) It is free. Your organization will not commend you for saving them money- they will question why you did not recommend this before. And why did you approve all those packages that expire in 2011.R is fReeeeee. Customers feel good while spending money.The more software budgets you approve the more your salary is. R thReatens all that.

8) It is impossible to install a package you do not need or want. There is no one calling you on the phone to consider one more package or solution. R can make you lonely.

9) R uses mostly Command line. Command line is from the Seventies. Or the Eighties. The GUI’s RCmdr and Rattle are there but still…..

10) R forces you to learn new stuff by the month. You prefer to only earn by the month. Till the day your job got offshored…

Written by a R user in English language

( which fortunately was not copyrighted otherwise we would be paying Britain for each word)

the above post was reprinted by request.