Talking on Big Data Analytics

I am going  being sponsored to a Government of India sponsored talk on Big Data Analytics at Bangalore on Friday the 13 th of July. If you are in Bangalore, India you may drop in for a dekko. Schedule and Abstracts (i am on page 7 out 9) .

Your tax payer money is hard at work- (hassi majak only if you are a desi. hassi to fassi.)

13 July 2012 (9.30 – 11.00 & 11.30 – 1.00)
Big Data Big Analytics
The talk will showcase using open source technologies in statistical computing for big data, namely the R programming language and its use cases in big data analysis. It will review case studies using the Amazon Cloud, custom packages in R for Big Data, tools like Revolution Analytics RevoScaleR package, as well as the newly launched SAP Hana used with R. We will also review Oracle R Enterprise. In addition we will show some case studies using BigML.com (using Clojure) , and approaches using PiCloud. In addition it will showcase some of Google APIs for Big Data Analysis.

Lastly we will talk on social media analysis ,national security use cases (i.e. cyber war) and privacy hazards of big data analytics.

Schedule

View more presentations from Ajay Ohri.
Abstracts

View more documents from Ajay Ohri.

 

Cyber Cold War

I try to write on cyber conflict without getting into the politics of why someone is hacking someone else. I always get beaten by someone in the comments thread when I write on politics.

But recent events have forced me to update my usual “how-to” cyber conflict to “why” cyber conflict. This is because of a terrorist attack in my hometown Delhi.

(updated-

http://www.nytimes.com/2012/02/14/world/middleeast/israeli-embassy-officials-attacked-in-india-and-georgia.html?_r=1&hp

Iran allegedly tried  (as per Israel) to assassinate the wife of Israeli Defence Attache in Delhi using a magnetic bomb, India as she went to school to pick up her kids, somebody else put a grenade in Israeli embassy car in Georgia which was found in time. 

Based on reports , initial work suggests the bomb was much more sophisticated than local terrorists, but the terrorists seemed to have some local recce work done.

India has 0 history of antisemitism but this is the second time Israelis have been targeted since 26/11 Mumbai attacks. India buys 12 % of oil annually from Iran (and refuses to join the oil embargo called by US and Europe)

Cyber Conflict is less painful than conflict, which is inevitable as long as mankind exists. Also the Western hemisphere needs a moon shot (cyber conflict could be the Sputnik like moment) and with declining and aging populations but better technology, Western Hemisphere govts need cyber conflict as they are running out of humans to fight their wars. Eastern govt. are even more obnoxious in using children for conflict propaganda, and corruption.

Last week CIA.gov website went down

This week Iranian govt is allegedly blocking https traffic on eve of Annual Revolution Day (what a coincidence!)

 

Some resources to help Internet users in Iran (or maybe this could be a dummy test for the big one – hacking the great firewall of China)

News from Hacker News-

http://news.ycombinator.com/item?id=3575029

 

I’m writing this to report the serious troubles we have regarding accessing Internet in Iran at the moment. Since Thursday Iranian government has shutted down the https protocol which has caused almost all google services (gmail, and google.com itself) to become inaccessible. Almost all websites that reply on Google APIs (like wolfram alpha) won’t work. Accessing to any website that replies on https (just imaging how many websites use this protocol, from Arch Wiki to bank websites). Also accessing many proxies is also impossible. There are almost no official reports on this and with many websites and my email accounts restricted I can just confirm this based on my own and friends experience. I have just found one report here:

Iran Shut Down Gmail , Google , Yahoo and sites using “Https” Protocol

The reason for this horrible shutdown is that the Iranian regime celebrates 1979 Islamic revolution tomorrow.

I just wanted to let you guys know about this. If you have any solution regarding bypassing this restriction please help!

 

The boys at Tor think they can help-

but its not so elegant, as I prefer creating a  batch file rather than explain coding to newbies. 

this is still getting to better and easier interfaces

https://www.torproject.org/projects/obfsproxy-instructions.html.en

Obfsproxy Instructions

client torrc

Step 1: Install dependencies, obfsproxy, and Tor

 

You will need a C compiler (gcc), the autoconf and autotools build system, the git revision control system, pkg-config andlibtoollibevent-2 and its headers, and the development headers of OpenSSL.

On Debian testing or Ubuntu oneiric, you could do:
# apt-get install autoconf autotools-dev gcc git pkg-config libtool libevent-2.0-5 libevent-dev libevent-openssl-2.0-5 libssl-dev

If you’re on a more stable Linux, you can either try our experimental backport libevent2 debs or build libevent2 from source.

Clone obfsproxy from its git repository:
$ git clone https://git.torproject.org/obfsproxy.git
The above command should create and populate a directory named ‘obfsproxy’ in your current directory.

Compile obfsproxy:
$ cd obfsproxy
$ ./autogen.sh && ./configure && make

Optionally, as root install obfsproxy in your system:
# make install

If you prefer not to install obfsproxy as root, you can instead just modify the Transport lines in your torrc file (explained below) to point to your obfsproxy binary.

You will need Tor 0.2.3.11-alpha or later.


Step 2a: If you’re the client…

 

First, you need to learn the address of a bridge that supports obfsproxy. If you don’t know any, try asking a friend to set one up for you. Then the appropriate lines to your tor configuration file:

UseBridges 1
Bridge obfs2 128.31.0.34:1051
ClientTransportPlugin obfs2 exec /usr/local/bin/obfsproxy --managed

Don’t forget to replace 128.31.0.34:1051 with the IP address and port that the bridge’s obfsproxy is listening on.
 Congratulations! Your traffic should now be obfuscated by obfsproxy. You are done! You can now start using Tor.

For old fashioned tunnel creation under Seas of English Channel-

http://dag.wieers.com/howto/ssh-http-tunneling/

Tunneling SSH over HTTP(S)
This document explains how to set up an Apache server and SSH client to allow tunneling SSH over HTTP(S). This can be useful on restricted networks that either firewall everything except HTTP traffic (tcp/80,tcp/443) or require users to use a local (HTTP) proxy.
A lot of people asked why doing it like this if you can just make sshd listen on port 443. Well, that might work if your environment is not hardened like I have seen at several companies, but this setup has a few advantages.

  • You can proxy to anywhere (see the Proxy directive in Apache) based on names
  • You can proxy to any port you like (see the AllowCONNECT directive in Apache)
  • It works even when there is a layer-7 protocol firewall
  • If you enable proxytunnel ssl support, it is indistinguishable from real SSL traffic
  • You can come up with nice hostnames like ‘downloads.yourdomain.com’ and ‘pictures.yourdomain.com’ and for normal users these will look like normal websites when visited.
  • There are many possibilities for doing authentication further along the path
  • You can do proxy-bouncing to the n-th degree to mask where you’re coming from or going to (however this requires more changes to proxytunnel, currently I only added support for one remote proxy)
  • You do not have to dedicate an IP-address for sshd, you can still run an HTTPS site

Related-

http://opensourceandhackystuff.blogspot.in/2012/02/captive-portal-security-part-1.html

and some crypto for young people

http://users.telenet.be/d.rijmenants/en/onetimepad.htm

 

Me- What am I doing about it? I am just writing poems on hacking at http://poemsforkush.com

2011 Analytics Recap

Events in the field of data that impacted us in 2011

1) Oracle unveiled plans for R Enterprise. This is one of the strongest statements of its focus on in-database analytics. Oracle also unveiled plans for a Public Cloud

2) SAS Institute released version 9.3 , a major analytics software in industry use.

3) IBM acquired many companies in analytics and high tech. Again.However the expected benefits from Cognos-SPSS integration are yet to show a spectacular change in market share.

2011 Selected acquisitions

Emptoris Inc. December 2011

Cúram Software Ltd. December 2011

DemandTec December 2011

Platform Computing October 2011

 Q1 Labs October 2011

Algorithmics September 2011

 i2 August 2011

Tririga March 2011

 

4) SAP promised a lot with SAP HANA- again no major oohs and ahs in terms of market share fluctuations within analytics.

http://www.sap.com/india/news-reader/index.epx?articleID=17619

5) Amazon continued to lower prices of cloud computing and offer more options.

http://aws.amazon.com/about-aws/whats-new/2011/12/21/amazon-elastic-mapreduce-announces-support-for-cc2-8xlarge-instances/

6) Google continues to dilly -dally with its analytics and cloud based APIs. I do not expect all the APIs in the Google APIs suit to survive and be viable in the enterprise software space.  This includes Google Cloud Storage, Cloud SQL, Prediction API at https://code.google.com/apis/console/b/0/ Some of the location based , translation based APIs may have interesting spin offs that may be very very commercially lucrative.

7) Microsoft -did- hmm- I forgot. Except for its investment in Revolution Analytics round 1 many seasons ago- very little excitement has come from MS plans in data mining- The plugins for cloud based data mining from Excel remain promising yet , while Azure remains a stealth mode starter.

8) Revolution Analytics promised us a GUI and didnt deliver (till yet 🙂 ) . But it did reveal a much better Enterprise software Revolution R 5.0 is one of the strongest enterprise software in the R /Stat Computing space and R’s memory handling problem is now an issue of perception than actual stuff thanks to newer advances in how it is used.

9) More conferences, more books and more news on analytics startups in 2011. Big Data analytics remained a strong buzzword. Expect more from this space including creative uses of Hadoop based infrastructure.

10) Data privacy issues continue to hamper and impede effective analytics usage. So does rational and balanced regulation in some of the most advanced economies. We expect more regulation and better guidelines in 2012.

Interview Zach Goldberg, Google Prediction API

Here is an interview with Zach Goldberg, who is the product manager of Google Prediction API, the next generation machine learning analytics-as-an-api service state of the art cloud computing model building browser app.
Ajay- Describe your journey in science and technology from high school to your current job at Google.

Zach- First, thanks so much for the opportunity to do this interview Ajay!  My personal journey started in college where I worked at a startup named Invite Media.   From there I transferred to the Associate Product Manager (APM) program at Google.  The APM program is a two year rotational program.  I did my first year working in display advertising.  After that I rotated to work on the Prediction API.

Ajay- How does the Google Prediction API help an average business analytics customer who is already using enterprise software , servers to generate his business forecasts. How does Google Prediction API fit in or complement other APIs in the Google API suite.

Zach- The Google Prediction API is a cloud based machine learning API.  We offer the ability for anybody to sign up and within a few minutes have their data uploaded to the cloud, a model built and an API to make predictions from anywhere. Traditionally the task of implementing predictive analytics inside an application required a fair amount of domain knowledge; you had to know a fair bit about machine learning to make it work.  With the Google Prediction API you only need to know how to use an online REST API to get started.

You can learn more about how we help businesses by watching our video and going to our project website.

Ajay-  What are the additional use cases of Google Prediction API that you think traditional enterprise software in business analytics ignore, or are not so strong on.  What use cases would you suggest NOT using Google Prediction API for an enterprise.

Zach- We are living in a world that is changing rapidly thanks to technology.  Storing, accessing, and managing information is much easier and more affordable than it was even a few years ago.  That creates exciting opportunities for companies, and we hope the Prediction API will help them derive value from their data.

The Prediction API focuses on providing predictive solutions to two types of problems: regression and classification. Businesses facing problems where there is sufficient data to describe an underlying pattern in either of these two areas can expect to derive value from using the Prediction API.

Ajay- What are your separate incentives to teach about Google APIs  to academic or researchers in universities globally.

Zach- I’d refer you to our university relations page

Google thrives on academic curiosity. While we do significant in-house research and engineering, we also maintain strong relations with leading academic institutions world-wide pursuing research in areas of common interest. As part of our mission to build the most advanced and usable methods for information access, we support university research, technological innovation and the teaching and learning experience through a variety of programs.

Ajay- What is the biggest challenge you face while communicating about Google Prediction API to traditional users of enterprise software.

Zach- Businesses often expect that implementing predictive analytics is going to be very expensive and require a lot of resources.  Many have already begun investing heavily in this area.  Quite often we’re faced with surprise, and even skepticism, when they see the simplicity of the Google Prediction API.  We work really hard to provide a very powerful solution and take care of the complexity of building high quality models behind the scenes so businesses can focus more on building their business and less on machine learning.

 

 

Google Cloud SQL

Another xing bang API from the boyz in Mountain View. (entry by invite only) But it is free and you can test your stuff on a MySQL db =10 GB

Database as a service ? (Maybe)— while Amazon was building fires (and Fire)

—————————————————————–

https://code.google.com/apis/sql/index.html

What is Google Cloud SQL?

Google Cloud SQL is a web service that provides a highly available, fully-managed, hosted SQL storage solution for your App Engine applications.

What are the benefits of using Google Cloud SQL?

You can access a familiar, highly available SQL database from your App Engine applications, without having to worry about provisioning, management, and integration with other Google services.

How much does Google Cloud SQL cost?

We will not be billing for this service in 2011. We will give you at least 30 days’ advance notice before we begin billing in the future. Other services such as Google App Engine, Google Cloud Storage etc. that you use with Google Cloud SQL may have their own payment terms, and you need to pay for them. Please consult their documentation for details.

Currently you are limited to the three instance sizes. What if I need to store more data or need better performance?

In the Limited Preview period, we only have three sizes available. If you have specific needs, we would like to hear from you on our google-cloud-sqldiscussion board.

When is Google Cloud SQL be out of Limited Preview?

We are working hard to make the service generally available.We don’t have a firm date that we can announce right now.

Do you support all the features of MySQL?

In general, Google Cloud SQL supports all the features of MySQL. The following are lists of all the unsupported features and notable differences that Google Cloud SQL has from MySQL.

Unsupported Features:

  • User defined functions
  • MySql replication

Unsupported MySQL statements:

  • LOAD DATA INFILE
  • SELECT ... INTO OUTFILE
  • SELECT ... INTO DUMPFILE
  • INSTALL PLUGIN .. SONAME ...
  • UNINSTALL PLUGIN
  • CREATE FUNCTION ... SONAME ...

Unsupported SQL Functions:

  • LOAD_FILE()

Notable Differences:

  • If you want to import databases with binary data into your Google Cloud SQL instance, you must use the --hex-blob option with mysqldump.Although this is not a required flag when you are using a local MySQL server instance and the MySQL command line, it is required if you want to import any databases with binary data into your Google Cloud SQL instance. For more information, see Importing Data.
How large a database can I use with Google Cloud SQL?
Currently, in this limited preview period, your database instance must be no larger than 10GB.
How can I be notified when there are any changes to Google Cloud SQL?
You can sign up for the sql-announcements forum where we post announcements and news about the Google Cloud SQL.
How can I cancel my Google Cloud SQL account?
To remove all data from your Google Cloud SQL account and disable the service:

  1. Delete all your data. You can remove your tables, databases, and indexes using the drop command. For more information, see SQL DROP statement.
  2. Deactivate the Google Cloud SQL by visiting the Services pane and clicking the On button next to Google Cloud SQL. The button changes from Onto Off.
How do I report a bug, request a feature, or ask a question?
You can report bugs and request a feature on our project page.You can ask a question in our discussion forum.

Getting Started

Can I use languages other than Java or Python?
Only Java and Python are supported for Google Cloud SQL.
Can I use Google Cloud SQL outside of Google App Engine?
The Limited Preview is primarily focused on giving Google App Engine customers the ability to use a familiar relational database environment. Currently, you cannot access Google Cloud SQL from outside Google App Engine.
What database engine are we using in the Google Cloud SQL?
MySql Version 5.1.59
Do I need to install a local version of MySQL to use the Development Server?
Yes.

Managing Your Instances

Do I need to use the Google APIs Console to use Google Cloud SQL?
Yes. For basic tasks like granting access control to applications, creating instances, and deleting instances, you need to use the Google APIs Console.
Can I import or export specific databases?
No, currently it is not possible to export specific databases. You can only export your entire instance.
Do I need a Google Cloud Storage account to import or export my instances?
Yes, you need to sign up for a Google Cloud Storage account or have access to a Google Cloud Storage account to import or export your instances. For more information, see Importing and Exporting Data.
If I delete my instance, can I reuse the instance name?
Yes, but not right away. The instance name is reserved for up to two months before it can be reused.

Tools & Resources

Can I use Django with Google Cloud SQL?
No, currently Google Cloud SQL is not compatible with Django.
What is the best tool to use for interacting with my instance?
There are a variety of tools available for Google Cloud SQL. For executing simple statements, you can use the SQL prompt. For executing more complicated tasks, you might want to use the command line tool. If you want to use a tool with a graphical interface, the SQuirrel SQL Client provides an interface you can use to interact with your instance.

Common Technical Questions

Should I use InnoDB for my tables?
Yes. InnoDB is the default storage engine in MySQL 5.5 and is also the recommended storage engine for Google Cloud SQL. If you do not need any features that require MyISAM, you should use InnoDB. You can convert your existing tables using the following SQL command, replacing tablename with the name of the table to convert:

ALTER tablename ENGINE = InnoDB;

If you have a mysqldump file where all your tables are in MyISAM format, you can convert them by piping the file through a sed script:

mysqldump --databases database_name [-u username -p  password] --hex-blob database_name | sed 's/ENGINE=MyISAM/ENGINE=InnoDB/g' > database_file.sql

Warning: You should not do this if your mysqldump file contains the mysql schema. Those files must remain in MyISAM.

Are there any size or QPS limits?
Yes, the following limits apply to Google Cloud SQL:

Resource Limits from External Requests Limits from Google App Engine
Queries Per Second (QPS) 5 QPS No limit
Maximum Request Size 16 MB
Maximum Response Size 16 MB

Google App Engine Limits

Google App Engine applications are also subject to additional Google App Engine quotas and limits. Requests from Google App Engine applications to Google Cloud SQL are subject to the following time limits:

  • All database requests must finish within the HTTP request timer, around 60 seconds.
  • Offline requests like cron tasks have a time limit of 10 minutes.
  • Backend requests to Google Cloud SQL have a time limit of 10 minutes.

App Engine-specific quotas and access limits are discussed on the Google App Engine Quotas page.

Should I use Google Cloud SQL with my non-High Replication App Engine application?
We recommend that you use Google Cloud SQL with High Replication App Engine applications. While you can use use Google Cloud SQL with applications that do not use high replication, doing so might impact performance.
Source-
https://code.google.com/apis/sql/faq.html#supportmysqlfeatures