Interview BigML.com

Here is an interview with Charlie Parker, head of large scale online algorithms at http://bigml.com

Ajay- Describe your own personal background in scientific computing, and how you came to be involved with machine learning, cloud computing and BigML.com

Charlie- I am a machine learning Ph.D. from Oregon State University. Francisco Martin (our founder and CEO), Adam Ashenfelter (the lead developer on the tree algorithm), and myself were all studying machine learning at OSU around the same time. We all went our separate ways after that.

Francisco started Strands and turned it into a 100+ million dollar company building recommender systems. Adam worked for CleverSet, a probabilistic modeling company that was eventually sold to Cisco, I believe. I worked for several years in the research labs at Eastman Kodak on data mining, text analysis, and computer vision.

When Francisco left Strands to start BigML, he brought in Justin Donaldson who is a brilliant visualization guy from Indiana, and an ex-Googler named Jose Ortega who is responsible for most of our data infrastructure. They pulled in Adam and I a few months later. We also have Poul Petersen, a former Strands employee, who manages our herd of servers. He is a wizard and makes everyone else’s life much easier.

Ajay- You use clojure for the back end of BigML.com .Are there any other languages and packages you are considering? What makes clojure such a good fit for cloud computing ?

Charlie- Clojure is a great language because it offers you all of the benefits of Java (extensive libraries, cross-platform compatibility, easy integration with things like Hadoop, etc.) but has the syntactical elegance of a functional language. This makes our code base small and easy to read as well as powerful.

We’ve had occasional issues with speed, but that just means writing the occasional function or library in Java. As we build towards processing data at the Terabyte level, we’re hoping to create a framework that is language-agnostic to some extent. So if we have some great machine learning code in C, for example, we’ll use Clojure to tie everything together, but the code that does the heavy lifting will still be in C. For the API and Web layers, we use Python and Django, and Justin is a huge fan of HaXe for our visualizations.

Ajay- Current support is for Decision Trees. When can we see SVM, K Means Clustering and Logit Regression?

Charlie- Right now we’re focused on perfecting our infrastructure and giving you new ways to put data in the system, but expect to see more algorithms appearing in the next few months. We want to make sure they are as beautiful and easy to use as the trees are. Without giving too much away, the first new thing we will probably introduce is an ensemble method of some sort (such as Boosting or Bagging). Clustering is a little further away but we’ll get there soon!

Ajay- How can we use the BigML.com API using R and Python.

Charlie- We have a public github repo for the language bindings. https://github.com/bigmlcom/io Right now, there there are only bash scripts but that should change very soon. The python bindings should be there in a matter of days, and the R bindings in probably a week or two. Clojure and Java bindings should follow shortly after that. We’ll have a blog post about it each time we release a new language binding. http://blog.bigml.com/

Ajay- How can we predict large numbers of observations using a Model that has been built and pruned (model scoring)?

Charlie- We are in the process of refactoring our backend right now for better support for batch prediction and model evaluation. This is something that is probably only a few weeks away. Keep your eye on our blog for updates!

Ajay- How can we export models built in BigML.com for scoring data locally.

Charlie- This is as simple as a call to our API. https://bigml.com/developers/models The call gives you a JSON object representing the tree that is roughly equivalent to a PMML-style representation.

About-

You can read about Charlie Parker at http://www.linkedin.com/pub/charles-parker/11/85b/4b5 and the rest of the BigML team at

https://bigml.com/team

Protected: Using SAS and C/C++ together

Comparing Bit Torrent Downloaders

Tux, as originally drawn by Larry Ewing — Image via Wikipedia

I personally like UTorrent on Windows and KTorrent on Linux.

While no experts on this, anything that gets the data down faster while maximizing my pipes efficiency.

I also like Torrenting than any of the sudo-apt get method of downloading software or the zip unzip,tar untar, install/make file

Torrenting is a simpler way of sharing applications but sadly not used much by the stats computing community to share downloads.

Also I think any dashboard or visualization should be sorted (but not alphabetically but numerically/categorically)

SORT THE DASHBOARD —-KEEP IT SORTED

So I am partially recreating after sorting the data viz from http://en.wikipedia.org/wiki/Comparison_of_BitTorrent_clients

BitTorrent client	Magnet URI	Super-seeding	Embedded tracker	UPnP [81]	NAT Port Mapping Protocol	NAT traversal [82]	DHT [83]	Peer exchange	Encryption	UDP tracker	LPD
µTorrent	Yes	Yes[95]	Yes[96]	Yes[97]	Yes	Yes[98]	Yes[99]	Yes[85]	Yes[100]	Yes	Yes[101]
BitSpirit [11]	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No
BitTorrent 6	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes[85]	Yes	Yes	Yes
OneSwarm	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No
qBittorrent	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
SoMud	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Vuze (formerly Azureus)	Yes	Yes	Yes	Yes	Yes	Yes[102]	Yes[87]	Yes	Yes	Yes	No
BitComet	Yes	Yes	Separate download	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No
Tixati [43]	Yes	Yes	No	Yes	No	No	Yes	Yes	Yes	Yes	Partial
Aria2	Yes	No	Yes	No	No	No	Yes	Yes	Yes	Yes	Yes
Tribler	Yes	No	Yes	Yes	Yes	No	Yes	Yes	Yes	No	No
Bitflu	Yes	No	No	No	No	No	Yes	Yes	No	Yes	No
Deluge	Yes	No	No	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Flush	Yes	No	No	Yes	Yes	No	Yes	Yes	No	No	Yes
KTorrent	Yes	No	No	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Partial
Shareaza	Yes	No	No	Yes	Yes	No	Yes[93]	Yes	No	No	No
Transmission	Yes	No	No	Yes	Yes	Yes	Yes	Yes[94]	Yes	No	Yes
LimeWire	Partial	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	Yes	No
BitTyrant	No	Yes[citation needed]	Yes	Yes	Yes	Yes[86]	Yes[87]	Yes	Yes	No	No
BitTornado	No	Yes	Yes[84]	Yes	No	No	No	No	Yes	No	No
Torrent Swapper	No	Yes	Yes[84]	Yes	No	No	No	Yes	No	No	No
Localhost	No	Yes	Yes	Yes	No	Yes	Yes [89]	No	No	No	No
Meerkat Bittorrent Client	No	Yes	No	Yes	Yes	Yes	Yes	No	Yes	No	No
rTorrent	No	Yes	No	No	No	No	Yes	Yes	Yes	Yes	No[92]
TorrentFlux	No	Yes	No	Yes	No	No	No	No	Yes	No	No
TorrentVolve	No	Partial [76]	No	Partial[76]	Partial [76]	Partial [76]	Partial[76]	Partial [76]	Partial [76]	Partial [76]	No
Opera	No	No	Yes[90]	No	No	No	No	Yes[91]	No	No	No
BitTorrent 5 / Mainline	No	No	Yes[84]	Yes	Yes	No	Yes	Yes	Yes	No	No
ABC	No	No	Yes	Yes	No	No	No	No	No	No	No
Blog Torrent	No	No	Yes	No	No	No	No	No	No	No	No
MLDonkey	No	No	Yes	Yes	Yes	No	No	No	No	Yes	No
Tomato Torrent	No	No	Yes	No	No	No	Yes	No	No	No	No
Acquisition	No	No	No	No	Yes	No	No	No	No	No	No
Arctic Torrent	No	No	No	No	No	No	No	Yes	No	No	No
BitLet	No	No	No	Yes	No	No	No	No	No	No	No
BitLord	No	No	No	Yes	No	Yes	No	Yes	No	Yes	No
BitThief	No	No	No	No	No	No	No	No	No	No	No
Bits on Wheels	No	No	No	No	No	No	No	No	No	No	No
BTG	No	No	No	Yes	Yes	No	Yes	Yes	Yes	Yes	No
BTPD	No	No	No	No	No	No	No	No	No	No	No
FlashGet	No	No	No	No	No	No	Yes	No	Yes	No	No
Folx	No	No	No	Yes	Yes	No	Yes	Yes	No	Yes	No
Free Download Manager	No	No	No	No	No	No	Yes	Yes	No	No	No
G3 Torrent	No	No	No	No	No	No	No	No	No	No	No
Gnome BitTorrent	No	No	No	No	No	No	No	No	No	No	No
Halite	No	No	No	Yes	Yes	No	Yes	No	Yes	No[88]	No
QTorrent	No	No	No	No	No	No	No	No	No	No	No
Rufus	No	No	No	No	No	No	No	No	No	No	No
SymTorrent	No	No	No	N/A	N/A	N/A	No	No	No	No	No
Tonido Torrent	No	No	No	Yes	Yes	Yes	Yes	No	No	No	No
Torium	No	No	No	Yes	No	No	Yes	No	No	No	No
ZipTorrent	No	No	No	Yes	Yes	No	No	Yes	No	No	No

uTorrent Falcon Remote Controls Your BitTorrent Downloads from Any Browser [Downloads] (lifehacker.com)
Transmission 2.0 Adds a Whole Lot of Stability to the Popular BitTorrent Client [Downloads] (lifehacker.com)
Put uTorrent On Steroids By Installing Extensions On It [Windows] (makeuseof.com)
uTorrent Outpaces Vuze in BitTorrent Download Speed by 16% [File Sharing] (lifehacker.com)
uTorrent Adds Great iPhone (and Android) Remote Torrent Control Interface [Utorrent] (lifehacker.com)
Dropbox + uTorrent “Watched Folders” FTW (benjaminste.in)
BitTorrent’s Mainline and uTorrent clients reach 100 million active monthly users (downloadsquad.switched.com)
5 Best μTorrent Apps (maketecheasier.com)
Top 10 Cross-Platform BitTorrent Clients (tesarn.blogspot.com)
The 5 Best Torrent Clients For Linux (makeuseof.com)
You: Tribler BitTorrent Client Searches and Downloads Files, No Unreliable Tracker Required [Downloads] (lifehacker.com)
The Next Big DDOS Attack May Come via BitTorrent (gigaom.com)
BitTorrent Inc. To Launch All-In-One BitTorrent Ecosystem (torrentfreak.com)
Bittorrent Inc Launching All In One Application: Vuze Competitor (crenk.com)
BitTorrent Client Offers P2P Without Central Tracking (tech.slashdot.org)
How to Share Your Own Files Using BitTorrent [UltraNewb] (lifehacker.com)
Install apps on uTorrent with App Studio (madrasgeek.com)
Vuze 4.6 adds uTP support, speeds up torrent downloads (downloadsquad.switched.com)

PySpread Magic

Image via Wikipedia

Just working with PySpread- and worked on a 1 million by 1 million spreadsheet- Python sure looks promising for the way ahead for stat computing ( you need to

sudo apt-get install python-numpy python-rpy python-scipy python-gmpy wxpython*,

cd to the untarred bz2 file from http://pyspread.sourceforge.net/download.html, (like

:~/Downloads$ cd pyspread-0.1.2

:~/Downloads/pyspread-0.1.2

sudo python setup.py install

)

http://pyspread.sourceforge.net/

by Martin Manns

about

Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python.

Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.

features

Three dimensional grid with up to 85,899,345 rows and 14,316,555 columns (64 bit systems, depends on row height and column width). Note that a million cells require about 500 MB of memory.
Complex data types such as lists, trees or matrices within a single cell.
Macros for functionalities that are too complex for a single Python expression.
Python module access from each cell, which allows:
- Arbitrary size rational numbers (via gmpy),
- Fixed point decimal numbers for business calculations, (via the decimal module from the standard library)
- Advanced statistics including plotting functions (via RPy)
- Much more via <your favourite module>.
CSV import and export
Clipboard access

warning

The concept of pyspread allows doing everything from each cell that a Python script can do. This powerful feature has its drawbacks. A spreadsheet may very well delete your hard drive or send your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets.

Since this is not the case for everyone (see discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can approve a file manually. Inspect carefully.

Python Package Index : PyPI (pypi.python.org)
SciPy – (scipy.org)
Top Ten Articles of 2010 (blog.pythonlibrary.org)
Ride the snake: Calling Python libraries from Haskell (john-millikin.com)
PyPy 1.4: Ouroboros in practice (morepypy.blogspot.com)
pyFLTK Home Page (pyfltk.sourceforge.net)
PyPy 1.4.1 (morepypy.blogspot.com)
python -me : a silly but useful command line trick (voidspace.org.uk)
PyPM Index for Python Developers (descentintodarkness.wordpress.com)
Python Extension Packages for Windows – Christoph Gohlke (lfd.uci.edu)
Ruby, Python, and Science (johndcook.com)
Compiling Python Code (effbot.org)
Deep end is deep (ask.metafilter.com)

Cloud Computing with R

Illusion of Depth and Space (4/22) - Rotating ... — Image by Dominic's pics via Flickr

Here is a short list of resources and material I put together as starting points for R and Cloud Computing It’s a bit messy but overall should serve quite comprehensively.

Cloud computing is a commonly used expression to imply a generational change in computing from desktop-servers to remote and massive computing connections,shared computers, enabled by high bandwidth across the internet.

As per the National Institute of Standards and Technology Definition,
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

(Citation: The NIST Definition of Cloud Computing

Authors: Peter Mell and Tim Grance
Version 15, 10-7-09
National Institute of Standards and Technology, Information Technology Laboratory
http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

From http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Web-Interfaces

R Web Interfaces

Rweb is developed and maintained by Jeff Banfield. The Rweb Home Page provides access to all three versions of Rweb—a simple text entry form that returns output and graphs, a more sophisticated JavaScript version that provides a multiple window environment, and a set of point and click modules that are useful for introductory statistics courses and require no knowledge of the R language. All of the Rweb versions can analyze Web accessible datasets if a URL is provided.
The paper “Rweb: Web-based Statistical Analysis”, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, was published in the Journal of Statistical Software (http://www.jstatsoft.org/v04/i01/).

Ulf Bartel has developed R-Online, a simple on-line programming environment for R which intends to make the first steps in statistical programming with R (especially with time series) as easy as possible. There is no need for a local installation since the only requirement for the user is a JavaScript capable browser. See http://osvisions.com/r-online/ for more information.

Rcgi is a CGI WWW interface to R by MJ Ray. It had the ability to use “embedded code”: you could mix user input and code, allowing the HTMLauthor to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output was possible in PostScript or GIF formats and the executed code was presented to the user for revision. However, it is not clear if the project is still active.

Currently, a modified version of Rcgi by Mai Zhou (actually, two versions: one with (bitmap) graphics and one without) as well as the original code are available from http://www.ms.uky.edu/~statweb/.

CGI-based web access to R is also provided at http://hermes.sdu.dk/cgi-bin/go/. There are many additional examples of web interfaces to R which basically allow to submit R code to a remote server, see for example the collection of links available from http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatCompCourse.

David Firth has written CGIwithR, an R add-on package available from CRAN. It provides some simple extensions to R to facilitate running R scripts through the CGI interface to a web server, and allows submission of data using both GET and POST methods. It is easily installed using Apache under Linux and in principle should run on any platform that supports R and a web server provided that the installer has the necessary security permissions. David’s paper “CGIwithR: Facilities for Processing Web Forms Using R” was published in the Journal of Statistical Software (http://www.jstatsoft.org/v08/i10/). The package is now maintained by Duncan Temple Lang and has a web page athttp://www.omegahat.org/CGIwithR/.

Rpad, developed and actively maintained by Tom Short, provides a sophisticated environment which combines some of the features of the previous approaches with quite a bit of JavaScript, allowing for a GUI-like behavior (with sortable tables, clickable graphics, editable output), etc.
Jeff Horner is working on the R/Apache Integration Project which embeds the R interpreter inside Apache 2 (and beyond). A tutorial and presentation are available from the project web page at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RApacheProject.

Rserve is a project actively developed by Simon Urbanek. It implements a TCP/IP server which allows other programs to use facilities of R. Clients are available from the web site for Java and C++ (and could be written for other languages that support TCP/IP sockets).

OpenStatServer is being developed by a team lead by Greg Warnes; it aims “to provide clean access to computational modules defined in a variety of computational environments (R, SAS, Matlab, etc) via a single well-defined client interface” and to turn computational services into web services.

Two projects use PHP to provide a web interface to R. R_PHP_Online by Steve Chen (though it is unclear if this project is still active) is somewhat similar to the above Rcgi and Rweb. R-php is actively developed by Alfredo Pontillo and Angelo Mineo and provides both a web interface to R and a set of pre-specified analyses that need no R code input.

webbioc is “an integrated web interface for doing microarray analysis using several of the Bioconductor packages” and is designed to be installed at local sites as a shared computing resource.

Rwui is a web application to create user-friendly web interfaces for R scripts. All code for the web interface is created automatically. There is no need for the user to do any extra scripting or learn any new scripting techniques. Rwui can also be found at http://rwui.cryst.bbk.ac.uk.

Finally, the R.rsp package by Henrik Bengtsson introduces “R Server Pages”. Analogous to Java Server Pages, an R server page is typically HTMLwith embedded R code that gets evaluated when the page is requested. The package includes an internal cross-platform HTTP server implemented in Tcl, so provides a good framework for including web-based user interfaces in packages. The approach is similar to the use of the brew package withRapache with the advantage of cross-platform support and easy installation.

Also additional R Cloud Computing Use Cases
http://wwwdev.ebi.ac.uk/Tools/rcloud/

ArrayExpress R/Bioconductor Workbench

Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster

Start the workbench by downloading the package for your operating system (Macintosh or Windows), or via Java Web Start, and you will get access to an instance of R running on one of EBI’s powerful machines. You can install additional packages, upload your own data, work with graphics and collaborate with colleagues, all as if you are running R locally, but unlimited by your machine’s memory, processor or data storage capacity.

Most up-to-date R version built for multicore CPUs
Access to all Bioconductor packages
Access to our computing infrastructure
Fast access to data stored in EBI’s repositories (e.g., public microarray data in ArrayExpress)

Using R Google Docs
http://www.omegahat.org/RGoogleDocs/run.pdf
It uses the XML and RCurl packages and illustrates that it is relatively quick and easy
to use their primitives to interact with Web services.

Using R with Amazon
Citation
http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:

Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.

Connecting to R on Amazon EC2- Detailed tutorials
Ubuntu Linux version
https://decisionstats.com/2010/09/25/running-r-on-amazon-ec2/
and Windows R version
https://decisionstats.com/2010/10/02/running-r-on-amazon-ec2-windows/

Connecting R to Data on Google Storage and Computing on Google Prediction API
https://github.com/onertipaday/predictionapirwrapper
R wrapper for working with Google Prediction API

This package consists in a bunch of functions allowing the user to test Google Prediction API from R.
It requires the user to have access to both Google Storage for Developers and Google Prediction API:
see http://code.google.com/apis/storage/ and http://code.google.com/apis/predict/ for details.

Example usage:

#This example requires you had previously created a bucket named data_language on your Google Storage and you had uploaded a CSV file named language_id.txt (your data) into this bucket – see for details
library(predictionapirwrapper)

and Elastic R for Cloud Computing
http://user2010.org/tutorials/Chine.html

Abstract

Elastic-R is a new portal built using the Biocep-R platform. It enables statisticians, computational scientists, financial analysts, educators and students to use cloud resources seamlessly; to work with R engines and use their full capabilities from within simple browsers; to collaborate, share and reuse functions, algorithms, user interfaces, R sessions, servers; and to perform elastic distributed computing with any number of virtual machines to solve computationally intensive problems.
Also see Karim Chine’s http://biocep-distrib.r-forge.r-project.org/

R for Salesforce.com

At the point of writing this, there seem to be zero R based apps on Salesforce.com This could be a big opportunity for developers as both Apex and R have similar structures Developers could write free code in R and charge for their translated version in Apex on Salesforce.com

Force.com and Salesforce have many (1009) apps at
http://sites.force.com/appexchange/home for cloud computing for
businesses, but very few forecasting and statistical simulation apps.

Example of Monte Carlo based app is here
http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016cT9EAI#

These are like iPhone apps except meant for business purposes (I am
unaware if any university is offering salesforce.com integration
though google apps and amazon related research seems to be on)

Force.com uses a language called Apex and you can see
http://wiki.developerforce.com/index.php/App_Logic and
http://wiki.developerforce.com/index.php/An_Introduction_to_Formulas
Apex is similar to R in that is OOPs

SAS Institute has an existing product for taking in Salesforce.com data.

A new SAS data surveyor is
available to access data from the Customer Relationship Management
(CRM) software vendor Salesforce.com. at
http://support.sas.com/documentation/cdl/en/whatsnew/62580/HTML/default/viewer.htm#datasurveyorwhatsnew902.htm)

Personal Note-Mentioning SAS in an email to a R list is a big no-no in terms of getting a response and love. Same for being careless about which R help list to email (like R devel or R packages or R help)

For python based cloud see http://pi-cloud.com

Here comes PySpread- 85,899,345 rows and 14,316,555 columns

Whats new/ One more open source analytics package. Built like a spreadsheet with an ability to import a million cells-

From http://pyspread.sourceforge.net/index.html

about	Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python. Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.
features	In pyspread, cells expect Python expressions and return Python objects. Therefore, complex data types such as lists, trees or matrices can be handled within a single cell. Macros can be used for functions that are too complex for a single expression. Since Python modules can be easily used without external scripts, arbitrary size rational numbers (via gmpy), fixed point decimal numbers for business calculations, (via the decimal module from the standard library) and advanced statistics including plotting functions (via RPy) can be used in the spreadsheet. Everything is directly available from each cell. Just use the grid Data can be imported and exported using csv files or the clipboard. Other forms of data exchange is possible using external Python modules. In order to simplify sparse matrix editing, pyspread features a three dimensional grid that can be sized up to 85,899,345 rows and 14,316,555 columns (64 bit-systems, depends on row height and column width). Note that importing a million cells requires about 500 MB of memory. The concept of pyspread allows doing everything from each cell that a Python script can do. This may very well include deleting your hard drive or sending your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets. Since this is not the case for everyone (see the discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can trust a file manually. Inspect carefully.
requirements	Pyspread runs on Linux, Windows and *nix platforms with GTK+ support. There are reports that it works with MacOS X as well. If you would like to contribute by testing on OS X please contact me. Dependencies Python >=2.4 <3.0, numpy >=1.1.0 and wxPython >=2.8.10.1. Highly recommended for full functionality PyMe >=0.8.1, Note for Windows™ users: If you want to use signatures without compiling PyMe try out Gpg4win. gmpy >=1.1.0 and rpy >=1.0.3.
maturity	Pyspread is in early Beta release. This means that the core functionality is fully implemented but the program needs testing and polish.

and from the wiki

http://sourceforge.net/apps/mediawiki/pyspread/index.php?title=Main_Page

a spreadsheet with more powerful functions and data structures that are accessible inside each cell. Something like Python that empowers you to do things quickly. And yes, it should be free and it should run on Linux as well as on Windows. I looked around and found nothing that suited me. Therefore, I started pyspread.

Concept

Each cell accepts any input that works in a Python command line.
The inputs are parsed and evaluated by Python’s eval command.
The result objects are accessible via a 3D numpy object array.
String representations of the result objects are displayed in the cells.

Benefits

Each cell returns a Python object. This object can be anything including arrays and third party library objects.
Generator expressions can be used efficiently for data manipulation.
Efficient numpy slicing is used.
numpy methods are accessible for the data.

Installation

Download the pyspread tarball or zip and unzip at a convenient place
In case you do not have it already get and install Python, wxpython and numpy

If you want the examples to work, install gmpy, R and rpy

Really do check the version requirements that are mentioned on http://pyspread.sf.net

Get install privileges (e.g. become root)
Change into the directory and type

python setup.py install

Windows: Replace “python” with your Python interpreter (absolute path)

Become normal user again
Start pyspread by typing

pyspread

Enjoy

R Apache – The next frontier of R Computing

I am currently playing/ trying out RApache- one more excellent R product from Vanderbilt’s excellent Dept of Biostatistics and it’s prodigious coder Jeff Horner.

The big ninja himself

I really liked the virtual machine idea- you can download a virtual image of Rapache and play with it- .vmx is easy to create and great to share-

http://rapache.net/vm.html

Basically using R Apache (with an EC2 on backend) can help you create customized dashboards, BI apps, etc all using R’s graphical and statistical capabilities.

What’s R Apache?

As per

http://biostat.mc.vanderbilt.edu/wiki/Main/RapacheWebServicesReport

Rapache embeds the R interpreter inside the Apache 2 web server. By doing this, Rapache realizes the full potential of R and its facilities over the web. R programmers configure appache by mapping Universal Resource Locaters (URL’s) to either R scripts or R functions. The R code relies on CGI variables to read a client request and R’s input/output facilities to write the response.

One advantage to Rapache’s architecture is robust multi-process management by Apache. In contrast to Rserve and RSOAP, Rapache is a pre-fork server utilizing HTTP as the communications protocol. Another advantage is a clear separation, a loose coupling, of R code from client code. With Rserve and RSOAP, the client must send data and R commands to be executed on the server. With Rapache the only client requirements are the ability to communicate via HTTP. Additionally, Rapache gains significant authentication, authorization, and encryption mechanism by virtue of being embedded in Apache.

Existing Demos of Architechture based on R Apache-

http://rweb.stat.ucla.edu/ggplot2/ An interactive web dashboard for plotting graphics based on csv or Google Spreadsheet Data
http://labs.dataspora.com/gameday/ A demo visualization of a web based dashboard system of baseball pitches by pitcher by player

3. http://data.vanderbilt.edu/rapache/bbplot For baseball results – a demo of a query based web dashboard system- very good BI feel.

Whats coming next in R Apache?

You can download version 1.1.10 of rApache now. There
are only two significant changes and you don’t have to edit your
apache config or change any code (just recompile rApache and
reinstall):

1) Error reporting should be more informative. both when you
accidentally introduce errors in the Apache config, and when your code
introduces warnings and errors from web requests.

I’ve struggled with this one for awhile, not really knowing what
strategy would be best. Basically, rApache hooks into the R I/O layer
at such a low level that it’s hard to capture all warnings and errors
as they occur and introduce them to the user in a sane manner. In
prior releases, when ROutputErrors was in effect (either the apache
directive or the R function) one would typically see a bunch of grey
boxes with a red outline with a title of RApache Warning/Error!!!.
Unfortunately those grey boxes could contain empty lines, one line of
error, or a few that relate to the lines in previously displayed
boxes. Really a big uninformative mess.

The new approach is to print just one warning box with the title
“”Oops!!! <b>rApache</b> has something to tell you. View source and
read the HTML comments at the end.” and then as the title implies you
can read the HTML comment located at the end of the file… after the
closing html. That way, you’re actually reading how R would present
the warnings and errors to you as if you executed the code at the R
command prompt. And if you don’t use ROutputErrors, the warning/error
messages are printed in the Apache log file, just as they were before,
but nicer 😉

2) Code dispatching has changed so please let me know if I’ve
introduced any strange behavior.

This was necessary to enhance error reporting. Prior to this release,
rApache would use R’s C API exclusively to build up the call to your
code that is then passed to R’s evaluation engine. The advantage to
this approach is that it’s much more efficient as there is no parsing
involved, however all information about parse errors, files which
produced errors, etc. were lost. The new approach uses R’s built-in
parse function to build up the call and then passes it of to R. A
slight overhead, but it should be negligible. So, if you feel that
this approach is too slow OR I’ve introduced bugs or strange behavior,
please let me know.

FUTURE PLANS

I’m gaining more experience building Debian/Ubuntu packages each day,
so hopefully by some time in 2011 you can rely on binary releases for
these distributions and not install rApache from source! Fingers
crossed!

Development on the rApache 1.1 branch will be winding down (save bug
fix releases) as I transition to the 1.2 branch. This will involve
taking out a small chunk of code that defines the rApache development
environment (all the CGI variables and the functions such as
setHeader, setCookie, etc) and placing it in its own R package…
unnamed as of yet. This is to facilitate my development of the ralite
R package, a small single user cross-platform web server.

The goal for ralite is to speed up development of R web applications,
take out a bit of friction in the development process by not having to
run the full rApache server. Plus it would allow users to develop in
the rApache enronment while on windows and later deploy on more
capable server environments. The secondary goal for ralite is it’s use
in other web server environments (nginx and IIS come to mind) as a
persistent per-client process.

And finally, wiki.rapache.net will be the new www.rapache.net once I
translate the manual over… any day now.

From –http://biostat.mc.vanderbilt.edu/wiki/Main/JeffreyHorner

Not convinced ?- try the demos above.

Almost There (r-bloggers.com)
The Hadoop guy becomes the Apache guy (zdnet.com)
Video: Google’s New Page Speed Mod for Apache Web Servers (thechromesource.com)
Make your websites run faster, automatically — try mod_pagespeed for Apache (googlecode.blogspot.com)
Apache HTTP Server at ApacheCon: Thou Shalt Not Fork() (thebitsource.com)
The Apache way meets the Oracle way (zdnet.com)
11 Apache Technologies for the Enterprise (itexpertvoice.com)

Tag: Cross-platform

Protected: Using SAS and C/C++ together

PySpread Magic

Here comes PySpread- 85,899,345 rows and 14,316,555 columns

Concept

Benefits

Installation

Links

R Apache – The next frontier of R Computing

Please share:

Related Articles (Ps the Related Articles is auto generated by Zementa- a software embedded within WordPress.com in case you are wondering what the deal with the linking is)

Please share:

Related Articles

Please share:

(Citation: The NIST Definition of Cloud Computing

R Web Interfaces

ArrayExpress R/Bioconductor Workbench

Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster

Abstract

Related Articles

Please share:

Concept

Benefits

Installation

Links

Related Articles

Please share:

Related Articles

Please share: