Going Deap : Algols in Python

Logo of PyPy
Image via Wikipedia

Here is an important new step in Python- the established statistical programming language (used to be really pushed by SPSS in pre-IBM days and the rPy package integrates R and Python).

Well the news  ( http://www.kdnuggets.com/2010/10/eap-evolutionary-algorithms-in-python.html ) is the release of Distributed Evolutionary Algorithms in Python. If your understanding of modeling means running regression and iterating it- you may need to read some more.  If you have felt frustrated at lack of parallelization in statistical software as well as your own hardware constraints- well go DEAP (and for corporate types the licensing is

http://www.gnu.org/licenses/lgpl.html ).

http://code.google.com/p/deap/

DEAP

DEAP is intended to be an easy to use distributed evolutionary algorithm library in the Python language. Its two main components are modular and can be used separately. The first module is a Distributed Task Manager (DTM), which is intended to run on cluster of computers. The second part is the Evolutionary Algorithms in Python (EAP) framework.

DTM

DTM is a distributed task manager that is able to spread workload over a buch of computers using a TCP or a MPI connection.

DTM include the following features:

 

EAP

Features

EAP includes the following features:

  • Genetic algorithm using any imaginable representation
    • List, Array, Set, Dictionary, Tree, …
  • Genetic programing using prefix trees
    • Loosely typed, Strongly typed
    • Automatically defined functions (new v0.6)
  • Evolution strategies (including CMA-ES)
  • Multi-objective optimisation (NSGA-II, SPEA-II)
  • Parallelization of the evaluations (and maybe more) (requires python2.6 and preferably python2.7) (new v0.6)
  • Genealogy of an evolution (that is compatible with NetworkX) (new v0.6)
  • Hall of Fame of the best individuals that lived in the population (new v0.5)
  • Milestones that take snapshot of a system regularly (new v0.5)

 

Documentation

See the eap user’s guide for EAP 0.6 documentation.

Requirement

The most basic features of EAP requires Python2.5 (we simply do not offer support for 2.4). In order to use multiprocessing you will need Python2.6 and to be able to combine the toolbox and the multiprocessing module Python2.7 is needed for its support to pickle partial functions.

Projects using EAP

If you want your project listed here, simply send us a link and a brief description and we’ll be glad to add it.

and from the wordpress.com blog (funny how people like code.google.com but not blogger.google.com anymore) at http://deapdev.wordpress.com/

EAP is part of the DEAP project, that also includes some facilities for the automatic distribution and parallelization of tasks over a cluster of computers. The D part of DEAP, called DTM, is under intense development and currently available as an alpha version. DTM currently provides two and a half ways to distribute workload on a cluster or LAN of workstations, based on MPI and TCP communication managers.

This public release (version 0.6) is more complete and simpler than ever. It includes Genetic Algorithms using any imaginable representation, Genetic Programming with strongly and loosely typed trees in addition to automatically defined functions, Evolution Strategies (including Covariance Matrix Adaptation), multiobjective optimization techniques (NSGA-II and SPEA2), easy parallelization of algorithms and much more like milestones, genealogy, etc.

We are impatient to hear your feedback and comments on that system at .

Best,

François-Michel De Rainville
Félix-Antoine Fortin
Marc-André Gardner
Christian Gagné
Marc Parizeau

Laboratoire de vision et systèmes numériques
Département de génie électrique et génie informatique
Université Laval
Quebec City (Quebec), Canada

and if you are new to Python -sigh here are some statistical things (read ad-van-cED analytics using Python) by a slideshare from Visual numerics (pre Rogue Wave acquisition)

Also see,

http://code.google.com/p/deap/wiki/SimpleExample

 

 

 

Analytics and Journals

Some good journals for reading on analytics-

1) JSS

http://www.jstatsoft.org/

present research that demonstrates the joint evolution of computational and statistical methods and techniques.  Implementations can use languages such as C, C++, S, Fortran, Java, PHP, Python and Ruby or environments such as Mathematica, MATLAB, R, S-PLUS, SAS, Stata, and XLISP-STAT.

There are currently 370 articles, 23 code snippets, 86 book reviews, 4 software reviews, and 7 special volumes in archives

2) R Journal

http://journal.r-project.org/

The  Journal

3) Pharma Programming

http://maney.co.uk/index.php/journals/pha/

Pharmaceutical Programming is the official journal of the Pharmaceutical Users Software Exchange (PhUSE), a non-profit membership society with the objective of educating programmers and their managers working in the pharmaceutical industry. Available both in print and online, Pharmaceutical Programming is an international journal with focus on programming in the regulated environment of the pharmaceutical and life sciences industry.

4) SAS Papers – User Groups

http://www.lexjansen.com/

4569 SAS papers presented
at SGF/SUGI 1996-2010.
1343 SAS papers presented
at PharmaSUG 2000-2010.
1810 SAS papers presented
at NESUG 1997-2009.
1191 SAS papers presented
at SESUG 1999-2009.
463 SAS papers presented
at PhUSE 2005-2009.
787 SAS papers presented
at WUSS 2003-2009.
337 SAS papers presented
at MWSUG 2001, 2004-2009.
188 SAS papers presented
at PNWSUG 2004-2009.
246 SAS papers presented
at SCSUG 2003-2007, 2009.
221 SAS papers related to CDISC.
Easy access to the CDISC Forum.

5) http://analyticsmagazine.com/

Magazine by http://www.informs.org/

6) Data Mining Journals

Academic Journals

Journals relevant to Data Mining

Interview –Jon Peck SPSS

JonPeck

 

I was in the middle of interviewing people as well as helping the good people in my new role as a community evangelist at Smart Data Collective when I got a LinkedIn Request to join the SDC group  from Jon Peck .

SPSS Inc. is a leading worldwide provider of predictive analytics software and solutions. Founded in 1968, today SPSS has more than 250,000 customers worldwide, served by more than 1,200 employees in 60 countries .Now Jon is a legendary SPSS figure and a great teacher in this field .I asked him for an interview he readily agreed.

Jon Peck is a Principal Software Engineer and Technical Advisor at SPSS. He has been working with SPSS since 1983  and in the interview he talks from the breadth of his perspective and experience on things in analytics and at SPSS .

Ajay – Describe your career journey from college to today. What advice would you give to young students seeking to be hedge fund managers rather than scientists.  What are the basic things that a science education can help students with , in your opinion ?

Jon– After graduating from college with a B.A. in math, I earned a Ph. D in Economics, specializing in econometrics, and taught at a top American university for 13 years in the Economics and Statistics Departments and the School of Organization and Management.  Working in an academic environment all that time was a great opportunity to grow intellectually.  I was increasingly drawn to computing and eventually decided to join a statistical software company.  There were only two substantial ones at the time.  After a lot of thought, I joined SPSS as it seemed to be the more interesting place and one where I would be able to work in a wider variety of areas.  That was over 25 years ago!  Now I have some opportunities to teach and speak again as well as working in development, which I enjoy a lot.

I still believe in getting a broad liberal arts education along with as much quantitative training as possible.  Being able to work in very different areas has been a big asset for me.  Most people will have multiple careers, so preparing broadly is the most important career thing you can do.  As for hedge fund jobs if there are any left, I’d say not to be starry-eyed about the money.  If you don’t choose a career that really interests you, you won’t be very successful anyway. Do what you love subject to earning a living.

Math and scientific reasoning skills are preparation for working in many areas as well as being helpful in making the many decisions with quantitative aspects in life.  Math, especially, provides a foundation useful in many areas.  The recently announced program in the UK to improve general understanding of probability illustrates some practical value.

Ajay- What are SPSS’s contribution to Open Source software . What ,if you can disclose are any plans for further increasing that involvement.

Jon-  I wish I could talk about SPSS future plans, but I can’t.  However, the company is committed to continuing its efforts in Python and R.  By opening up the SPSS technology with these open source technologies, we are able to expand what we and our users can do.  At the same time, we can make R more attractive through nicer output and simpler syntax and taking away much of the pain.  One of the things I love about this approach is how quickly and easily new things can be produced and distributed this way compared to the traditional development cycle.  I wrote about productivity and Python recently on my blog at insideout.spss.com.

Ajay – How happy is the SPSS developer community with Python . Are there any other languages that you are considering in the future.

Jon- Many in the SPSS user community were more used to packaged procedures than to programming (except in the area of data transformations).  So Python, first, and then R were a shock.  But the benefits are so large that we have had an excellent response to both the Python and R technologies.  Some have mastered the technology and have been very successful and have made contributions back to the SPSS community.  Others are consumers of this technology, especially through our custom dialogs and extension commands that eliminate the need to learn Python or R in order to use programs in these languages.  Python is an outstanding language.  It is easy to get started with it, but it has very sophisticated features.  It has fewer dark corners than any other language I know.  While there are a few other more popular languages, Python popularity has been steadily growing, especially in the scientific and statistical communities.  But we already have support for three high-level languages, and if there is enough demand, we’ll do more.

Some of our partners prefer to use the lower-level C language interfaces we offer.  That’s fine, too.  We’re not Python zealots (well, maybe, I am).  Python, as a scripting language, isn’t as fast as a compiled language.  For many purposes this does not matter, and Python itself is written in C.  I recently wrote a Python module for TURF analysis.  The computations are simple but computationally explosive, so I was worried that it would be too slow to be useful.   It turned out to be pretty fast because of the way I could use some of Python’s built-in data structures and algorithms.  And the popular numPy and SciPy scientific and numerical libraries are written in C.

Users who would not think of themselves as developers sometimes find that a small Python effort can automate manual work with big time and accuracy improvements.  I got a note recently from a user who said, "I got it to work, and this is FANTASTIC! It will save me a lot of time in my survey analysis work."

Ajay- What are the areas where SPSS is not a good fit for using. What areas suit SPSS software the most compared to other solutions.

Jon- SPSS Statistics, the product,  is not a database.  Our strength is in applying analytical methods to data for model building, prediction, and insight.  Although SPSS Statistics is used in a wide variety of areas, we focus first on people data and think of that first when planning and designing new features.  SPSS Statistics and other SPSS products all work well with databases, and we have solutions for deploying analytics into production systems, but we’re not going to do your payroll.  One thing that was a surprise to me a few years ago is that we have a significant number of users who use SPSS Statistics as a basic reporting product but don’t do any inferential statistics.  They find that they can do customized reporting often using the Custom Tables module very quickly.  With Version 17, they can also do fancier and dynamic output formatting without resorting to script writing or manual editing, which is proving very attractive.

Ajay- Are there any plans for SPSS to use Software as a Service Model . Any plans to use advances in remote and cloud computing for SPSS ?

Jon- We are certainly looking at cloud computing.  The biggest challenge is being able to put things in the cloud that will be robust and reliable.

Ajay- What are SPSS’s Asia plans ? Which
country has the maximum penetration of SPSS in terms of usage.

Jon- SPSS, the company, has long been strong in Japan, and Taiwan, and Korea is also strong.  China is increasingly important, of course.  We have a large data center in Singapore.  Although India has a strong, long, history in statistical methodology, it is a much less well-developed market for us.  We have a presence there, but I don’t know the numbers. (Ajay – SPSS has been one of my first experiences in statistical software when I came up with it at my business school in 2001. In India SPSS has been very active with academia licensing and it introduced us to the nice and easy menu driven features of SPSS.)

Biography – Jon earned his Ph. D. from Yale University and taught econometrics and statistics there for 13 years before joining SPSS.

Jon joined the SPSS company in 1983 and worked on many aspects of the very first SPSS DOS product, including writing the first C code that SPSS ever shipped. Among the features he has designed are OMS (the Output Management System), the Visual Bander, Define Variable Properties, ALTER TYPE, Unicode support, and the Date and Time Wizard. Jon is the author of many of the modules on Developer Central. He is an active cyclist and hiker.

Jon Peck blogs on  SPSS Inside-Out.

Basic Text Mining :3 Simple Paths

The locals of Punjab (india). These are the tr...
Image via Wikipedia

Text Mining in which you search alpha numeric data for meaningful patterns is relatively more complex than plain numeric variable data crunching. The reason for that is human eye can measure only a few hundred rows of data before getting tired, and analytics software algorithms need to properly programmed else they miss the relevant solution or text. An example, how many Punjabis live in Delhi (Stats needed), suppose you have a Dataset that has all the names in Delhi,in order to send an sms contest (Marketing Decision) on Lohri (Punjabi specific Festival)

Text Manipulation can be done by TRIM and LOWER functions in EXCEL and corresponding functions in SAS. For Mining use the following options-

1)SAS Basic Text Mining -Using Only Base SAS

In SAS you can use the INDEXW function for text mining.

As per SAS Online DOc

INDEXW(source, excerpt)

Arguments

source
specifies the character expression to search.
excerpt
specifies the string of characters to search for in the character expression. SAS removes the leading and trailing blanks from excerpt.

The INDEXW function searches source, from left to right, for the first occurrence of excerpt and returns the position in source of the substring’s first character. If the substring is not found in source, INDEXW returns a value of 0. If there are multiple occurrences of the string, INDEXW returns only the position of the first

occurrence.”

2) MS EXCEL

You can use MS Excel for text mining too. I recommend Office 2007 simply because it can handle more rows.

The function in Excel is SEARCH

image

3) MS ACCESS

In MS Access you can use LIKE Queries to create a different table or append a value to certain columns

.Example

Some problems can?t be solved with comparisons : e.g. ?His name begins with Mc or Mac. In a case like this, wildcards are required, and are represented in SQL with the % sign and the LIKE keyword.

e.g.

SELECT au_lname, city

FROM authors

WHERE au_lname LIKE ?Mc&? or au_lanme LIKE ?Mac%?

UPDATED- The above post is now obsolete- there are easier and better ways to to text mining. That includes weka and R

%d bloggers like this: