Using R and RapidMiner together #rstats

I just came across this interesting corporate blog, and I must confess I really like the design as well as the content in it. Simafore is an  analytics company. The post of course was on combining R with Rapid Miner

There are many packages and libraries in R, specifically tailored to handle time series forecasting in the “traditional” manner. RapidMiner integrates really well with R by providing two mechanisms:

  • an interactive console, similar to the native R console and somewhat less sophisticated thanRStudio
  • and a more powerful full integration of R capabilities within the RapidMiner process design perspective.

The first option is fairly easy to put into work, assuming you have successfully added the R extension to RapidMiner. But the second option requires some initial planning. The key is to understand how to pass data from RapidMiner to R and back. Once you understand this simple but important aspect, then R essentially becomes another powerful “operator” within the vast library of existing RapidMiner operators

you can read the complete article here  http://www.simafore.com/blog/bid/204923/combining-power-of-r-and-rapidminer-for-time-series-forecasting

Is the non-Chinese Internet America’s cyber colony?

The world’s largest internet companies are either in China or USA (except for a few in Japan)

http://en.wikipedia.org/wiki/List_of_largest_Internet_companies

China’s strategy has helped it in the following-

  1. protect it’s citizen’s data from foreign data collection
  2. helped safeguard the industrial secrets of it’s own corporations
  3. create a domestic ecosystem that benefits it’s own entrepreneurs and investors
  4. created basic infrastructure for resisting cyber warfare (et Stuxnet virus)

Rest of the non Chinese World has now the following ignominy-

  1. Most of profits from IPOs, Ads flow to US companies
  2. US Govt arm spy in active collaboration with US companies including gathering industrial secrets for trade negotiations
  3. Almost zero non Chinese non American Internet Infrastructure
  4. built a monopoly of California led companies across rest of world
  5. led to brain drain of top technical talent to USA
  6. Great ecosystem of investors and tech in USA- to the detriment of all other countries

Internet is the new opium, and after getting the rest of the world addicted- USA law makers are abandoning core principles like net neutrality and opting for unrestricted warrant less spying , as well as preparing inroads for cyber attacks

China is correctly safeguarding the future interests of it’s citizens- while the rest of the world is now a cyber colony mostly occupied by USA

Quo Vadis?

Countries have no friends, only interests

Book Review – Big Data Analytics with R and Hadoop

I have written about Vignesh ‘s impressive work in R before including helping update the RGoogleAnalytics package for the API changes while at Tatvic* He is quite young and very eager to contribute to open source and knowledge.

This is a fairly timely impressive book given that both R and Hadoop are hot topics, have a lot of noise and hoopla around them, and need a straight forward explanation on how to do things using R and Hadoop. It demystifies both R and Hadoop sufficiently for you to actually not be intimidated at the thought  of learning multiple languages (R / Java/ Map Reduce), multiple paradigms (distributed computing and analysis) and multiple installations ( R/ Hadoop/RHadoop). Sufficient to say if the future belongs to Big Data/ Hadoop. Linux users will have it easier than Windows people.

One main criticism I found is to the lay reader everything is written in bullet points which can affect the readability if you are trying to get the big picture. However for the technical user or reader this is really a brilliant way, as everything is neatly written as do this and then do that etc.

The book thus aims to be more of a tutorial and has many nice examples too. I wish however a few more examples from Industry would have added more juice in this. I therefore hope for a companion site which has all the R code and datasets for testing and trying out the business analytics examples .

One wishes the author had written more about the biglm, ff  packages or even RevoScaleR packages . Chapter 5 with Data Analytics should have been more elaborate.  This can be done with more references – the section on visualizing data is just  2 pages and ignores some packages like GoogleVis or even bigvis package. The section about MongoDB and other data types is very useful but again is much more technical and much less analytical. For eg. when does one typically encounter MongoDB versus other data types- what are the drawbacks etc

This is thus a very practical handbook for the tech minded and it is quite affordable for the ebook ( Indian version is just 3.5 $)

I recommend this book highly for people who are aiming to practically implement Big Data Analytics . It is not for statisticians or business users but for people who actually want to set up the whole thing.

Please take a look at http://www.packtpub.com/big-data-analytics-with-r-and-hadoop/book and try it out for a price of less than a (Starbucks!) latte or  a movie DVD .

 

R in the cloud – Revolution takes to AWS

Finally the people at Revolution Analytics have made their software available on AWS .Interesting development and it remains how it will be followed by other providers in stats software.

http://blog.revolutionanalytics.com/2014/02/revolution-r-enterprise-in-the-amazon-cloud.html

Users now have the opportunity to perform statistical analysis and advanced analytics on data sets they might have stored in Amazon’s cloud-based object store Simple Storage Service (S3) or access data from Amazon’s Relational Data Service (RDS).

The cloud offers many benefits to the user, and the AWS Marketplace is no exception. The ability to spin up pre-installed versions of RRE 7 takes all the guesswork out of deployment and provides for a consistent and reliable experience with the software.  Within minutes a user can gain access to R-based analysis from anywhere he or she has an Internet connection.

The Windows version is accessed via Windows Remote Desktop and leverages RRE DevelopR IDE. The Linux version is browser-based and leverages RStudio Server Pro to provide a multi-user IDE.  Both versions are available on instances from 2 – 32 vCPUs and can handle data sets of up to 1 TB for RRE ScaleR analysis. The solution is single-instance only and does not currently offer support for grids or clusters

 

http://www.revolutionanalytics.com/revolution-r-enterprise-aws-marketplace

Technical Details

  • General, Compute, Memory and Storage instances available, 2-32 vCPUs
  • Instances with attached storage recommended. Long-term storage requires EBS or backup to S3
  • Single-server instances only (no cluster or grid support).
  • Revolution R Enterprise DeployR not included.
  • Tech support forums monitored from Sunday, 5:00 PM PDT to Friday, 5:00 PM PDT. Tech support provided in English to registered users only.

Windows Instances

Platform: Windows Server 2008 R2
Revolution R Enterprise version: 7.0.0 (includes R 3.0.0)
Client Requirements: Windows Remote Desktop to access Revolution R Enterprise DeployR IDE

Linux Instances

Platform: Redhat Enterprise Linux 6.4
Revolution R Enterprise version: 7.0.0 (includes R 3.0.0)
Client Requirements: Compatible browse

https://aws.amazon.com/marketplace/pp/B00GHXJZVY/ref=_ptnr_ISV_aws_web

Try one instance of this product for 14 days. There will be no software charges but AWS infrastructure charges still apply. Free Trials will automatically convert to a paid subscription upon expiration.
Hourly Fees (includes Windows 2008 R2 2008R2 X64)
Total hourly fees will vary by instance type and EC2 region.
EC2 Instance Type Software EC2 Total
Standard Large (m1.large) $2.50/hr $0.364/hr $2.864/hr
Standard XL (m1.xlarge) $5.00/hr $0.728/hr $5.728/hr
High-Memory 2XL (m2.2xlarge) $5.00/hr $1.02/hr $6.02/hr
High-Memory 4XL (m2.4xlarge) $10.00/hr $2.04/hr $12.04/hr
High-CPU XL (c1.xlarge) $5.00/hr $0.90/hr $5.90/hr
High I/O 4XL (hi1.4xlarge) $20.00/hr $3.58/hr $23.58/hr
Cluster Compute 8XL (cc2.8xlarge) $20.00/hr $2.97/hr $22.97/hr
EBS Storage Fees
$0.05 / GB / Month for Standard EBS Storage

 

Big Data Evil Empire

  1. Much more progress has been made in data storage , data querying and data analysis of huge amounts of personally identifiable information , than in encrypting such information
  2. Big Data has as much dual use usage for governments and corporations as uranium has for building bombs or power plants.
  3. There is as much lucre and potential revenue for encrypted data streams in the cloud era – as there for anti virus software in the PC era
  4. Tracking citizens totally is evil- the total costs of such programs is unjustified given the thwarted terrorism plots by Big Data ‘s Cyber Spying. At best I can understand governments spying on citizen’s of other countries to gain advantages in trade
  5. The American dominance of cyber spying and big data threaten to unravel and undermine it’s credibility as de facto leader of the Internet. It proves China’s vision of a walled off internet makes sense and that is a dangerous precedent which could lead to the break up of the internet along national boundaries of electronic fire walls.

Writing on APIs for Programmable Web

I have been writing free lance on APIs for Programmable Web. Here is an updated list of the articles, many of these would be of interest to analytics users. Note- some of these are interviews and they are in bold. Note to regular readers: I keep updating this list , and at each updation bring it to the front page, then allowing the blog postings to slide it down!

Scoreoid Aims to Gamify the World Using APIs January 27th, 2014

Plot.ly’s Plot to Visualize More Data January 22nd, 2014

LumenData’s Acquisition of Algorithms.io is a Win-Win January 8th, 2014

Yactraq API Sees Huge Growth in 2013  January 6th, 2014

Scrape.it Describes a Better Way to Extract Data December 20th, 2013

Exclusive Interview: App Store Analytics API December 4th, 2013

APIs Enter 3d Printing Industry November 29th, 2013

PW Interview: José Luis Martinez of Textalytics November 6th, 2013

PW Interview Simon Chan PredictionIO November 5th, 2013

PW Interview: Scott Gimpel Founder and CEO FantasyData.com October 23rd, 2013

PW Interview Brandon Levy, cofounder and CEO of Stitch Labs October 8th, 2013

PW Interview: Jolo Balbin Co-Founder Text Teaser  September 18th, 2013

PW Interview:Bob Bickel CoFounder Redline13 July 29th, 2013

PW Interview : Brandon Wirtz CTO Stremor.com   July 4th, 2013

PW Interview: Andy Bartley, CEO Algorithms.io  June 4th, 2013

PW Interview: Francisco J Martin, CEO BigML.com 2013/05/30

PW Interview: Tal Rotbart Founder- CTO, SpringSense 2013/05/28

PW Interview: Jeh Daruwala CEO Yactraq API, Behavorial Targeting for videos 2013/05/13

PW Interview: Michael Schonfeld of Dwolla API on Innovation Meeting the Payment Web  2013/05/02

PW Interview: Stephen Balaban of Lamda Labs on the Face Recognition API  2013/04/29

PW Interview: Amber Feng, Stripe API, The Payment Web 2013/04/24

PW Interview: Greg Lamp and Austin Ogilvie of Yhat on Shipping Predictive Models via API   2013/04/22

Google Mirror API documentation is open for developers   2013/04/18

PW Interview: Ricky Robinett, Ordr.in API, Ordering Food meets API    2013/04/16

PW Interview: Jacob Perkins, Text Processing API, NLP meets API   2013/04/10

Amazon EC2 On Demand Windows Instances -Prices reduced by 20%  2013/04/08

Amazon S3 API Requests prices slashed by half  2013/04/02

PW Interview: Stuart Battersby, Chatterbox API, Machine Learning meets Social 2013/04/02

PW Interview: Karthik Ram, rOpenSci, Wrapping all science API2013/03/20

Viralheat Human Intent API- To buy or not to buy 2013/03/13

Interview Tammer Kamel CEO and Founder Quandl 2013/03/07

YHatHQ API: Calling Hosted Statistical Models 2013/03/04

Quandl API: A Wikipedia for Numerical Data 2013/02/25

Amazon Redshift API is out of limited preview and available! 2013/02/18

Windows Azure Media Services REST API 2013/02/14

Data Science Toolkit Wraps Many Data Services in One API 2013/02/11

Diving into Codeacademy’s API Lessons 2013/01/31

Google APIs finetuning Cloud Storage JSON API 2013/01/29

2012
Ergast API Puts Car Racing Fans in the Driver’s Seat 2012/12/05
Springer APIs- Fostering Innovation via API Contests 2012/11/20
Statistically programming the web – Shiny,HttR and RevoDeploy API 2012/11/19
Google Cloud SQL API- Bigger ,Faster and now Free 2012/11/12
A Look at the Web’s Most Popular API -Google Maps API 2012/10/09
Cloud Storage APIs for the next generation Enterprise 2012/09/26
Last.fm API: Sultan of Musical APIs 2012/09/12
Socrata Data API: Keeping Government Open 2012/08/29
BigML API Gets Bigger 2012/08/22
Bing APIs: the Empire Strikes Back 2012/08/15
Google Cloud SQL: Relational Database on the Cloud 2012/08/13
Google BigQuery API Makes Big Data Analytics Easy 2012/08/05
Your Store in The Cloud -Google Cloud Storage API 2012/08/01
Predict the future with Google Prediction API 2012/07/30
The Romney vs Obama API 2012/07/27

Interview: Linkurious aims to simplify graph databases

linkurious-239x60-trHere is an interview with a really interesting startup Linkurious and it’s co-founders Sebastien Heymann( also co-founder of Gephi) and Jean Villedieu. They are hoping to making graph databases easier to use and thus spur on their usage.

Decisionstats (DS)-  How did you come up about setting across your startup

Linkurious (L) -A lot of businesses are struggling to understand the connections within their data. Who are the persons connected to this financial transaction? What happens to the telecommunication network if this antenna fails? Who is the most influential person in this community? There are a lot of questions that involve a deep understanding of graphs. Most business intelligence and data visualization tools are not adapted for these questions because they have a hard time handling queries about connections and because their interface is not suited for network visualization.
I noticed this because I co-founded a graph visualization software called Gephi a few years ago. It quickly became a reference and the software was downloaded 250k times last year. It really helped people understand the connections in their data in a new way.
In 2013, this success inspired me to found Linkurious. The idea is to provide a solution that’s easy to use to democratize graph visualization.

What does it mean?
We want to help people understand the connection in their data. Linkurious is really easy to use and optimized for the exploration of graphs.
You can install it in minutes. Then, it gives you a search interface through which you can query the data. What’s special about our software is that the result of your search is represented as a graph that you can explore dynamically. Contrary to Gephi or other graph visualization tools, Linkurious only shows you a limited subset of your data and not the whole graph. The goal here is to focus on what the user is looking for and help him find an answer faster.
In order to do that, Linkurious also comes with the ability to filter nodes or color them according to their properties. This way, it’s much faster to understand the data.

DS- How do you support packages from Python , and R and other languages like Julia? What is Linkurious based on?

L- Linkurious is largely based on a stack of open-source technologies. We rely on Neo4j, the leading graph database to store and access the data. Neo4j can handle really large datasets, this means that our users can access the information much faster than with a traditional SQL database. Neo4j also comes with a query language that allows “smart search”, locating nodes and relationships based on rules like “what’s the shortest path between these 2 nodes?” or “who among the close network of this person has been to London and loves sushi”. That’s the kind of things that Facebook delivers via Graph Search and it’s exciting to see these technologies applied in the business world.
We also use Nodejs, Sigmajs and ElasticSearch.

DS-  Name  a few case studies where enterprises have used graphical analysis for great benefit?

L- There really are a lot of use cases for graph visualization and we are learning about it almost every day. There are well know applications that are connected to security. For example, graph databases are great to identify suspicious patterns across a variety of data sources. People using false identities to defraud bank tend to share addresses, phone numbers or names. Without graphs, it’s hard to see how they are connected and they tend to remain undetected until it’s too late. Graph visualization can be triggered by alert systems. Then, analysts can investigate the data and decide whether the alert should be escalated or not.
In the telecom industry, you can use graph to map your network and identify weak links, assess the potential of a failure (i.e. impact analysis). Graph visualization helps understand these information and better manage the network.

We also have clients in the logistics, health or consulting industry. Every data oriented industry needs data visualization tools, and graphs offer powerful ways to ask new questions and reveal unforeseen information.

DS-What are some of the challenges with creating, sustaining and maintaining a cutting edge technology startup in Europe and France

L- There are a lot of challenges with creating and sustaining a challenges. I think the bigger ones are not necessarily location-related. The main issue is to build something people want. It’s certainly been our biggest challenge. We’ve used a lean startup approach to ship a prototype of our product as fast as we could. The first version of Linkurious was buggy and didn’t much interest from customers. But we did get feedback from a few people who really liked it. Since then, we’ve been focusing on them to develop our vision of Linkurious. We are pleased with the results, I think we are on the right path but it’s really a journey.
As for the more location-related challenges, I think France usually gets a bad rep for not being start-up friendly. Our experience has been quite the contrary. There are administrative annoyances but we also benefit from generous benefits, access to great engineers and a burgeoning startup eco-system!

About

The mission of Linurio.us is  to help users access and navigate graph databases in a simple manner so they can make sense of their data.

Some of their interesting solutions are here.