BigML meets R #rstats

I am just checking the nice new R package created by BigML.com co-founder Justin Donaldson. The name of the new package is bigml, which can confuse a bit since there do exist many big suffix named packages in R (including biglm)

The bigml package is available at CRAN http://cran.r-project.org/web/packages/bigml/index.html

I just tweaked the code given at http://blog.bigml.com/2012/05/10/r-you-ready-for-bigml/ to include the ssl authentication code at http://www.brocktibert.com/blog/2012/01/19/358/

so it goes

> library(bigml)
Loading required package: RJSONIO
Loading required package: RCurl
Loading required package: bitops
Loading required package: plyr
> setCredentials(“bigml_username”,”API_key”)

# download the file needed for authentication
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

# set the curl options
curl <- getCurlHandle()
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
curlSetOpt(.opts = list(proxy = 'proxyserver:port'), curl = curl)

> iris.model = quickModel(iris, objective_field = ‘Species’)

Of course there are lots of goodies added here , so read the post yourself at http://blog.bigml.com/2012/05/10/r-you-ready-for-bigml/

Incidentally , the author of this R package (bigml) Justin Donalsdon who goes by name sudojudo at http://twitter.com/#!/sudojudo has also recently authored two other R packages including tsne at  http://cran.r-project.org/web/packages/tsne/index.html (tsne: T-distributed Stochastic Neighbor Embedding for R (t-SNE) -A “pure R” implementation of the t-SNE algorithm) and a GUI toolbar http://cran.r-project.org/web/packages/sculpt3d/index.html (sculpt3d is a GTK+ toolbar that allows for more interactive control of a dataset inside the RGL plot window. Controls for simple brushing, highlighting, labeling, and mouseMode changes are provided by point-and-click rather than through the R terminal interface)

This along with the fact the their recently released python bindings for bigml.com was one of the top news at Hacker News- shows bigML.com is going for some traction in bringing cloud computing, better software interfaces and data mining together!

Troubleshooting Rattle Installation- Data Mining R GUI

Screenshot of Synaptic Package Manager running...
Image via Wikipedia

I really find the Rattle GUI very very nice and easy to do any data mining task. The software is available from http://rattle.togaware.com/

The only issue is Rattle can be quite difficult to install due to dependencies on GTK+

After fiddling for a couple of years- this is what I did

1) Created dual boot OS- Basically downloaded the netbook remix from http://ubuntu.com I created a dual boot OS so you can choose at the beginning whether to use Windows or Ubuntu Linux in that session.  Alternatively you can download VM Player www.vmware.com/products/player/ if you want to do both

2) Download R packages using Ubuntu packages and Install GTK+ dependencies before that.

GTK + Requires

  1. Libglade
  2. Glib
  3. Cairo
  4. Pango
  5. ATK

If  you are a Linux newbie like me who doesnt get the sudo apt get, tar, cd, make , install rigmarole – scoot over to synaptic software packages or just the main ubuntu software centre and download these packages one by one.

For R Dependencies, you need

  • PMML
  • XML
  • RGTK2

Again use r-cran as the prefix to these package names and simply install (almost the same way Windows does it easily -double click)

see http://packages.ubuntu.com/search?suite=lucid&searchon=names&keywords=r-cran

4) Install Rattle from source

http://rattle.togaware.com/rattle-download.html

Advanced users can download the Rattle source packages directly:

Save theses to your hard disk (e.g., to your Desktop) but don’t extract them. Then, on GNU/Linux run the install command shown below. This command is entered into a terminal window:

  • R CMD INSTALL rattle_2.6.0.tar.gz

After installation-

5) Type library(rattle) and rattle.info to get messages on what R packages to update for a proper functioning

</code>

> library(rattle)
Rattle: Graphical interface for data mining using R.
Version 2.6.0 Copyright (c) 2006-2010 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> rattle.info()
Rattle: version 2.6.0
R: version 2.11.1 (2010-05-31) (Revision 52157)

Sysname: Linux
Release: 2.6.35-23-generic
Version: #41-Ubuntu SMP Wed Nov 24 10:18:49 UTC 2010
Nodename: k1-M725R
Machine: i686
Login: k1ng
User: k1ng

Installed Dependencies
RGtk2: version 2.20.3
pmml: version 1.2.26
colorspace: version 1.0-1
cairoDevice: version 2.14
doBy: version 4.1.2
e1071: version 1.5-24
ellipse: version 0.3-5
foreign: version 0.8-41
gdata: version 2.8.1
gtools: version 2.6.2
gplots: version 2.8.0
gWidgetsRGtk2: version 0.0-69
Hmisc: version 3.8-3
kernlab: version 0.9-12
latticist: version 0.9-43
Matrix: version 0.999375-46
mice: version 2.4
network: version 1.5-1
nnet: version 7.3-1
party: version 0.9-99991
playwith: version 0.9-53
randomForest: version 4.5-36 upgrade available 4.6-2
rggobi: version 2.1.16
survival: version 2.36-2
XML: version 3.2-0
bitops: version 1.0-4.1

Upgrade the packages with:

 > install.packages(c("randomForest"))

<code>

Now upgrade whatever package rattle.info tells to upgrade.

This is much simpler and less frustrating than some of the other ways to install Rattle.

If all goes well, you will see this familiar screen popup when you type

>rattle()

 

Message from RATTLE

Microsoft Windows Vista Wallpaper
Image by Brajeshwar via Flickr

A new release of the R GUI Rattle is making its way to CRAN (currently on the Austrian server).

Latest version 2.5.47 (revision 527) released 13 Nov 2010.

Change Log link for details –

http://cran.r-project.org/web/packages/rattle/index.html

Major changes relate to simplifying the installation of Rattle under the recently released R 2.12.0 on Microsoft Windows 32bit and 64bit.

The major advance for R 2.12.0 is the improved support for 64bit Microsoft Windows and thus support for much larger datasets in memory.

See the new installation steps at http://datamining.togaware.com/survivor/Internet_Connected.html

For Microsoft Windows installations, to upgrade your Rattle installation you may need to remove any old installs of the Gtk+ libraries using the Uninstall application from the Microsoft Windows Control Panel). Then install the new Gtk2 library:

http://downloads.sourceforge.net/gtk-win/gtk2-runtime-2.22.0-2010-10-21-ash.exe

You can the update Rattle to version 2.5.47:

> install.packages(“rattle“)

>library(rattle)

rattle.info()

The output from rattle.info() will include an “install.packages” command that will identify Rattle related packages that have updates available. You can cut-and-paste that command to the R prompt to have those packages updated in your installation.

Citation- From rattle-users@googlegroups.com

http://rattle.togaware.com/

Here comes PySpread- 85,899,345 rows and 14,316,555 columns

A Bold GNU Head
Image via Wikipedia

Whats new/ One more open source analytics package. Built like a spreadsheet with an ability to import a million cells-

From http://pyspread.sourceforge.net/index.html

about Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python.

Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.

Pyspread screenshot
features In pyspread, cells expect Python expressions and return Python objects. Therefore, complex data types such as lists, trees or matrices can be handled within a single cell. Macros can be used for functions that are too complex for a single expression.

Since Python modules can be easily used without external scripts, arbitrary size rational numbers (via gmpy), fixed point decimal numbers for business calculations, (via the decimal module from the standard library) and advanced statistics including plotting functions (via RPy) can be used in the spreadsheet. Everything is directly available from each cell. Just use the grid

Data can be imported and exported using csv files or the clipboard. Other forms of data exchange is possible using external Python modules.

In  order to simplify sparse matrix editing, pyspread features a three dimensional grid that can be sized up to 85,899,345 rows and 14,316,555 columns (64 bit-systems, depends on row height and column width). Note that importing a million cells requires about 500 MB of memory.

The concept of pyspread allows doing everything from each cell that a Python script can do. This may very well include deleting your hard drive or sending your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets. Since this is not the case for everyone (see the discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can trust a file manually. Inspect carefully.

Pyspread screenshot

requirements Pyspread runs on Linux, Windows and *nix platforms with GTK+ support. There are reports that it works with MacOS X as well. If you would like to contribute by testing on OS X please contact me.

Dependencies

Highly recommended for full functionality

  • PyMe >=0.8.1, Note for Windows™ users: If you want to use signatures without compiling PyMe try out Gpg4win.
  • gmpy >=1.1.0 and
  • rpy >=1.0.3.
maturity Pyspread is in early Beta release. This means that the core functionality is fully implemented but the program needs testing and polish.

and from the wiki

http://sourceforge.net/apps/mediawiki/pyspread/index.php?title=Main_Page

a spreadsheet with more powerful functions and data structures that are accessible inside each cell. Something like Python that empowers you to do things quickly. And yes, it should be free and it should run on Linux as well as on Windows. I looked around and found nothing that suited me. Therefore, I started pyspread.

Concept

  • Each cell accepts any input that works in a Python command line.
  • The inputs are parsed and evaluated by Python’s eval command.
  • The result objects are accessible via a 3D numpy object array.
  • String representations of the result objects are displayed in the cells.

Benefits

  • Each cell returns a Python object. This object can be anything including arrays and third party library objects.
  • Generator expressions can be used efficiently for data manipulation.
  • Efficient numpy slicing is used.
  • numpy methods are accessible for the data.

Installation

  1. Download the pyspread tarball or zip and unzip at a convenient place
  2. In case you do not have it already get and install Python, wxpython and numpy
If you want the examples to work, install gmpy, R and rpy
Really do check the version requirements that are mentioned on http://pyspread.sf.net
  1. Get install privileges (e.g. become root)
  2. Change into the directory and type
python setup.py install
Windows: Replace “python” with your Python interpreter (absolute path)
  1. Become normal user again
  2. Start pyspread by typing
pyspread
  1. Enjoy

Links

Next on Spreadsheet wishlist-

a MSI bundle /Windows Self Installer which has all dependencies bundled in it-linking to PostGresSQL 😉 etc

way to go Mr Martin Manns

mmanns < at > gmx < dot > net

Playing with Playwith- R Package for Interactive Data Visualizations

While just browsing through Google Code repositories for R Packages-

https://code.google.com/hosting/search?q=label:R

I came across Playwith-  which is basically a toolkit for creating interactive data visualizations. I then played with ClusterApp and it really seems promising (hierarchical) – Since I am using R 2.12 on Win 7 (x64) platform somthing broke but overall this seemed like a promising interactive tool making widget.

playwith is an R package, providing a GTK+ graphical user interface for editing and interacting with R plots.

The playwith package is maintained by Felix Andrews <felix@nfrac.org>

Here is the Data Visualization called Cluster App that impressed me There is an obvious synergy between Rattle and Playwith (though some bugs with new R 2.12 on an X64 do come into play)

https://code.google.com/p/playwith/wiki/ClusterApp