SPSS bought by Big Blue

SPSS Inc maker of PASW series of analytics softwares is being bought by IBM ( unless Oracle spikes this deal too). IBM is seeking a play in the rapidly growing analytics market and is also a strategic partner to WPS ( who makes the Base SAS alternative SAS language software).

In a personal note- I just entered University of Tennessee as a statistics student.

Interesting community event by R/Statistical community

Citation-
http://en.oreilly.com/oscon2009/public/schedule/detail/10432

StackOverflow Flash Mob for the R User Community
Moderated by: Michael E. Driscoll
7:00pm Wednesday, 07/22/2009
Location: Ballroom A2

In concert with users online across the country, this session will lead a flashmob to populate StackOverflow with R language content.

R, the open source statistical language, has a notoriously steep learning curve. The same technical questions tend be asked repeatedly on the R-help mailing lists, to the detriment of both R experts (who tire of repeating themselves) and the learners (who often receive a technically correct, but terse response).

We have developed a list of the most common 100 technical R questions, based on an analysis of (i) queries sent to the RSeek.org web portal, and (ii) an examination of the R-help list archives, and (iii) a survey of members of R Users Groups in San Francisco, LA, and New York City.

In the first hour, participants will pair up to claim a question, formulate it on StackOverflow, and provide a comprehensive answer. In the second hour, participants will rate, review, and comment on the set of submitted questions and answers.

While Stackoverflow currently lacks content for the R language, we believe this effort will provide the spark to attract more R users, and emerge as a valuable resource to the growing R community.

This is an interesting example of a statistical software community using twitter for a tech help event. I hope this trend/ event gets replicated again and again-

Statisticians worldwide unite in the language of maths !!!

Please follow @rstatsmob to participate. See you at 7 PM PST!

twitter.com/Rstatsmob

Growing Rapidly: Rapid Miner 4.5

The Europe based Rapid Miner came out with version 4.5 of their data mining tool ( also known as Yale) with a much promising “Script” tool.

Also, Rapid Miner came in 1st in open source data mining tools in a poll by Industry benchmark www.kdnuggets.com

They have a brilliant video here for people who just want to have a look at the new Rapid Miner

http://rapid-i.com/videos/rapidminer_tour_3_4_en.html

Citation-

http://rapid-i.com/content/view/147/1/

New Operators:

  • FormulaExtractor
  • Trend
  • LagSeries
  • VectorLinearRegression
  • ExampleSetMinus
  • ExampleSetIntersect
  • Partition
  • Script
  • ForwardSelection
  • NeuralNetImproved
  • KernelNaiveBayes
  • ExhaustiveSubgroupDiscovery
  • URLExampleSource
  • NonDominatedSorting
Image

More Features:

  • The new Script operator allows for arbitrary user defined operations based on Groovy script combined with a simplified RapidMiner syntax
  • Improved the join operator and added options for left and right outer joins
  • New notification mail mechanism at the end of processes
  • Most file based data input operators now provide an option to skip error lines
  • Most file based example source operators as well as the IOObjectReader and the new URLExampleSource now accept URLs instead of a filename for the input source location

R language on the GPU

Here are some nice articles on using R on Graphical Processing Units (GPU) mainly made by NVidia. Think of a GPU as a customized desktop with specialized computing equivalent to much faster computing. i.e. Matlab users can read the webinars here http://www.nvidia.com/object/webinar.html

Now a slightly better definition of GPU computing is from http://www.nvidia.com/object/GPU_Computing.html

GPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.
The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing model. The sequential part of the application runs on the CPU and the computationally-intensive part runs on the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

rgpu

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large data-parallel analysis tasks. The arrival of modern graphics processing units (GPUs) with user friendly programming tools, such as nVidia’s CUDA toolkit (http://www.nvidia.com/cuda), provides a possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude (http://gpgpu.org/). However, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research.

The research project at the page mentioned above has developed special packages for the above need- R on a GPU.

he initial package is hosted by CRAN as gputools a sorce package for UNIX and Linux systems. Be sure to set the environment variable CUDA_HOME to the root of your CUDA toolkit installation. Then install the package in the usual R manner. The installation process will automatically make use of nVidia’s nvcc compiler and CUBLAS shared library.

and some figures

speedupFigure 1 provides performance comparisons between original R functions assuming a four thread data parallel solution on Intel Core i7 920 and our GPU enabled R functions for a GTX 295 GPU. The speedup test consisted of testing each of three algorithms with five randomly generated data sets. The Granger causality algorithm was tested with a lag of 2 for 200, 400, 600, 800, and 1000 random variables with 10 observations each. Complete hierarchical clustering was tested with 1000, 2000, 4000, 6000, and 8000 points. Calculation of Kendall’s correlation coefficient was tested with 20, 30, 40, 50, and 60 random variables with 10000 observations each

Ajay- For hard core data mining people ,customized GPU’s for accelerated analytics and data mining sounds like fun and common sense. Are there other packages for customization on a GPU – let me know.

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

Download

Download the gputools package for R on a Linux platform here: version 0.01.

Training on R

Here is an interesting training from Revolution Computing

New Training Course from REvolution Computing
High-Performance Computing with R
July 31, 2009 – Washington, DC – Prior to JSM
Time: 9am – 5pm
$600 commercial delegates, $450 government, $250 academic

Click Here to Register Now!

An overview of available HPC technologies for the R language to enable faster, scalable analytics that can take advantage of multiprocessor capability will be presented in a one-day course. This will include a comprehensive overview of REvolution’s recently released R packages foreach and iterators, making parallel programming easier than ever before for R programmers, as well as other available technologies such as RMPI, SNOW and many more. We will demonstrate each technology with simple examples that can be used as starting points for more sophisticated work. The agenda will also cover:

  • Identifying performance problems
  • Profiling R programs
  • Multithreading, using compiled code, GPGPU
  • Multiprocess computing
  • SNOW, MPI, NetWorkSpaces, and more
  • Batch queueing systems
  • Dealing with lots of data

Attendees should have basic familiarity with the R language—we will keep examples elementary but relevant to real-world applications.

This course will be conducted hands-on, classroom style. Computers will not be provided. Registrants are required to bring their own laptops.

For the full agenda Click Here or  Click Here to Register Now!”

Source; www.revolution-computing.com

Disclaimer- I am NOT commerically related to REvolution, just love R. I do hope REvolution chaps do spend  tiny bit of time improving the user GUI as well not just for HPC purposes.

They recently released some new packages free to the CRAN community as well

The release of 3 new packages for R designed to allow all R users to more quickly handle large, complex sets of data: iterators, foreach and doMC.

* iterators implements the “iterator” data structure familiar to users of languages like Java,

C# and Python to make it easy to program useful sequences – from all the prime numbers to the columns of a matrix or the rows of an external database.

* foreach builds on the “iterators” package to introduce a new way of programming loops in R. Unlike the traditional “for” loop, foreach runs multiple iterations simultaneously, in parallel. This makes loops run faster on a multi-core laptop, and enables distribution of large parallel-processing problems to multiple workstations in a cluster or in the cloud, without additional complicated programming. foreach works with parallel programming backends for R from the open-source and commercial domains.

* doMC is an open source parallel programming backend to enable parallel computation with “foreach” on Unix/Linux machines. It automatically enables foreach and iterator functions to work with the “multicore” package from R Core member Simon Urbanek

The new packages have been developed by REvolution Computing and released under open source licenses to the R community, enabling all existing R users

citation:

http://www.revolution-computing.com/aboutus/news-room/2009/breakthrough-parallel-packages-and-functions-for-r.php

Facebook Text Analytics

Here is a great presentation on Facebook Analytics using text mining.

Citation-
Citation;Text Analytics Summit 2009 – Roddy Lindsay – “Social Media, Happiness, Petabytes and LOLs”
and here is a presentation on HIVE and HADOOP
HIVE: Data Warehousing & Analytics on Hadoop

Facebook sure looks a surprisingly nice analytics company to work for.!!! No wonder they have all but swamped the competition.

R and SAS in Twitter Land

A tale of two languages ( set in Twitterland)

Everytime I post to the R help list, if the email contains the three words S.- A – S , I get plenty of e-spanking from senior professors and distinguished Linux people. On the other hand when I mentioned W-P-S I got dunked by the Don of SAS Global himself. We geeks are so passionate.

Here is some new stuff on Twitter for the R /Open Source community.

1) I manually made a list of

  1. best R blogs,
  2. R help lists ( on Nabble since Google Groups banned R help archive),
  3. Twitter Search for #rstats ( general search word for R)

I then copied the RSS feeds of each of the above.

2) I then went to www.twitterfeed.com (uses open Id) and linked a new Twitter Account to these RSS feeds

Screenshot-twitterfeed.com : feed your blog to twitter - Mozilla Firefox

3) I then tweaked the layout and added #rstats before each post to the new R resource http://twitter.com/Rarchive

http://twitter.com/Rarchive
http://twitter.com/Rarchive

If you are a tweeter you can follow it here http://twitter.com/Rarchive and never miss any R news going forward.

ps- I also did the same for sas for http://twitter.com/sascommunity

UPDATE

#rstats helps in SEO in Google since Google uses Twitter search as well. Existing best R search engine is http://rseek.com
In any case it is too late to change now since this is more like a automated firehose. Now you can use #rstats as well additional keywords to get more searchable useful stuff.

NOTE______

http://twitter.com/sas belongs to a guy who is wondering who is trying to hack his twitter account

, well you can check the screen shot below

Screenshot-Sky Sutton (SAS) on Twitter - Mozilla Firefox
Screenshot-Sky Sutton (SAS) on Twitter - Mozilla Firefox