Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline

—————————————

If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)

Usage:

        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                        optional:
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)

 

(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed

—————————————————————–

Modeling in Python

Continue reading “Data Frame in Python”

Cyber Cold War

I try to write on cyber conflict without getting into the politics of why someone is hacking someone else. I always get beaten by someone in the comments thread when I write on politics.

But recent events have forced me to update my usual “how-to” cyber conflict to “why” cyber conflict. This is because of a terrorist attack in my hometown Delhi.

(updated-

http://www.nytimes.com/2012/02/14/world/middleeast/israeli-embassy-officials-attacked-in-india-and-georgia.html?_r=1&hp

Iran allegedly tried  (as per Israel) to assassinate the wife of Israeli Defence Attache in Delhi using a magnetic bomb, India as she went to school to pick up her kids, somebody else put a grenade in Israeli embassy car in Georgia which was found in time. 

Based on reports , initial work suggests the bomb was much more sophisticated than local terrorists, but the terrorists seemed to have some local recce work done.

India has 0 history of antisemitism but this is the second time Israelis have been targeted since 26/11 Mumbai attacks. India buys 12 % of oil annually from Iran (and refuses to join the oil embargo called by US and Europe)

Cyber Conflict is less painful than conflict, which is inevitable as long as mankind exists. Also the Western hemisphere needs a moon shot (cyber conflict could be the Sputnik like moment) and with declining and aging populations but better technology, Western Hemisphere govts need cyber conflict as they are running out of humans to fight their wars. Eastern govt. are even more obnoxious in using children for conflict propaganda, and corruption.

Last week CIA.gov website went down

This week Iranian govt is allegedly blocking https traffic on eve of Annual Revolution Day (what a coincidence!)

 

Some resources to help Internet users in Iran (or maybe this could be a dummy test for the big one – hacking the great firewall of China)

News from Hacker News-

http://news.ycombinator.com/item?id=3575029

 

I’m writing this to report the serious troubles we have regarding accessing Internet in Iran at the moment. Since Thursday Iranian government has shutted down the https protocol which has caused almost all google services (gmail, and google.com itself) to become inaccessible. Almost all websites that reply on Google APIs (like wolfram alpha) won’t work. Accessing to any website that replies on https (just imaging how many websites use this protocol, from Arch Wiki to bank websites). Also accessing many proxies is also impossible. There are almost no official reports on this and with many websites and my email accounts restricted I can just confirm this based on my own and friends experience. I have just found one report here:

Iran Shut Down Gmail , Google , Yahoo and sites using “Https” Protocol

The reason for this horrible shutdown is that the Iranian regime celebrates 1979 Islamic revolution tomorrow.

I just wanted to let you guys know about this. If you have any solution regarding bypassing this restriction please help!

 

The boys at Tor think they can help-

but its not so elegant, as I prefer creating a  batch file rather than explain coding to newbies. 

this is still getting to better and easier interfaces

https://www.torproject.org/projects/obfsproxy-instructions.html.en

Obfsproxy Instructions

client torrc

Step 1: Install dependencies, obfsproxy, and Tor

 

You will need a C compiler (gcc), the autoconf and autotools build system, the git revision control system, pkg-config andlibtoollibevent-2 and its headers, and the development headers of OpenSSL.

On Debian testing or Ubuntu oneiric, you could do:
# apt-get install autoconf autotools-dev gcc git pkg-config libtool libevent-2.0-5 libevent-dev libevent-openssl-2.0-5 libssl-dev

If you’re on a more stable Linux, you can either try our experimental backport libevent2 debs or build libevent2 from source.

Clone obfsproxy from its git repository:
$ git clone https://git.torproject.org/obfsproxy.git
The above command should create and populate a directory named ‘obfsproxy’ in your current directory.

Compile obfsproxy:
$ cd obfsproxy
$ ./autogen.sh && ./configure && make

Optionally, as root install obfsproxy in your system:
# make install

If you prefer not to install obfsproxy as root, you can instead just modify the Transport lines in your torrc file (explained below) to point to your obfsproxy binary.

You will need Tor 0.2.3.11-alpha or later.


Step 2a: If you’re the client…

 

First, you need to learn the address of a bridge that supports obfsproxy. If you don’t know any, try asking a friend to set one up for you. Then the appropriate lines to your tor configuration file:

UseBridges 1
Bridge obfs2 128.31.0.34:1051
ClientTransportPlugin obfs2 exec /usr/local/bin/obfsproxy --managed

Don’t forget to replace 128.31.0.34:1051 with the IP address and port that the bridge’s obfsproxy is listening on.
 Congratulations! Your traffic should now be obfuscated by obfsproxy. You are done! You can now start using Tor.

For old fashioned tunnel creation under Seas of English Channel-

http://dag.wieers.com/howto/ssh-http-tunneling/

Tunneling SSH over HTTP(S)
This document explains how to set up an Apache server and SSH client to allow tunneling SSH over HTTP(S). This can be useful on restricted networks that either firewall everything except HTTP traffic (tcp/80,tcp/443) or require users to use a local (HTTP) proxy.
A lot of people asked why doing it like this if you can just make sshd listen on port 443. Well, that might work if your environment is not hardened like I have seen at several companies, but this setup has a few advantages.

  • You can proxy to anywhere (see the Proxy directive in Apache) based on names
  • You can proxy to any port you like (see the AllowCONNECT directive in Apache)
  • It works even when there is a layer-7 protocol firewall
  • If you enable proxytunnel ssl support, it is indistinguishable from real SSL traffic
  • You can come up with nice hostnames like ‘downloads.yourdomain.com’ and ‘pictures.yourdomain.com’ and for normal users these will look like normal websites when visited.
  • There are many possibilities for doing authentication further along the path
  • You can do proxy-bouncing to the n-th degree to mask where you’re coming from or going to (however this requires more changes to proxytunnel, currently I only added support for one remote proxy)
  • You do not have to dedicate an IP-address for sshd, you can still run an HTTPS site

Related-

http://opensourceandhackystuff.blogspot.in/2012/02/captive-portal-security-part-1.html

and some crypto for young people

http://users.telenet.be/d.rijmenants/en/onetimepad.htm

 

Me- What am I doing about it? I am just writing poems on hacking at http://poemsforkush.com

Calling #Rstats lovers and bloggers – to work together on “The R Programming wikibook”

so you think u like R, huh. Well it is time to pay it forward.

Message from a dear R blogger, Tal G from Tel Aviv (creator of R-bloggers.com and SAS-X.com)

———————————————————————————————————-
Calling R lovers and bloggers – to work together on “The R Programming wikibook”
Posted: 20 Jun 2011 07:05 AM PDT

This post is a call for both R community members and R-bloggers, to come and help make The R Programming wikibook be amazing:

Dear R community member – please consider giving a visit to The R Programming wikibook. If you wish to contribute your knowledge and editing skills to the project, then you could learn how to write in wiki-markup here, and how to edit a wikibook here (you can even use R syntax highlighting in the wikibook). You could take information into the site from the (soon to be) growing list of available R resources for harvesting.

Dear R blogger, you can help The R Programming wikibook by doing the following:

Write to your readers about the project and invite them to join.
Add your blog’s R content as an available resource for other editors to use for the wikibook. Here is how to do that:
First, make a clear indication on your blog that your content is licensed under cc-by-sa copyrights (*see what it means at the end of the post). You can do this by adding it to the footer of your blog, or by writing a post that clearly states that this is the case (what a great opportunity to write to your readers about the project…).
Next, go and add a link, to where all of your R content is located on your site, to the resource page (also with a link to the license post, if you wrote one). For example, since I write about other things besides R, I would give a link to my R category page, and will also give a link to this post. If you do not know how to add it to the wiki, just e-mail me about it (tal.galili@gmail.com).
If you are an R blogger, besides living up to the spirit of the R community, you will benefit from joining this project in that every time someone will use your content on the wikibook, they will add your post as a resource. In the long run, this is likely to help visitors of the site get to know about you and strengthen your site’s SEO ranking. Which reminds me, if you write about this, I always appreciate a link back to my blog

* Having a cc-by-sa copyrights means that you will agree that anyone may copy, distribute, display, and make derivative works based on your content, only if they give the author (you) the credits in the manner specified by you. And also that the user may distribute derivative works only under a license identical to the license that governs the original work.

———-

Three more points:

1) This post is a result of being contacted by Paul (a.k.a: PAC2), asking if I could help promote “The R Programming wikibook” among R-bloggers and their readers. Paul has made many contributions to the book so far. So thank you Paul for both reaching out and helping all of us with your work on this free open source project.

2) I should also mention that the R wiki exists and is open for contribution. And naturally, every thing that will help the R wikibook will help the R wiki as well.

3) Copyright notice: I hereby release all of the writing material content that is categoriesed in the R category page, under the cc-by-sa copyrights (date: 20.06.2011). Now it’s your turn!

———-

List of R bloggers who have joined: (This list will get updated as this “group writing” project will progress)

R-statistics blog (that’s Tal…)
Decisionstats.com (That’s me)
……………………………………………………………………………….
3) Copyright notice: I hereby release all of the writing material content of this website, under the cc-by-sa copyrights (date: 21.06.2011). Now it’s your turn!

https://decisionstats.com/privacy-3/

Content Licensing-
This website has all content licensed under
http://creativecommons.org/licenses/by-sa/3.0/
You are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work

What is a White Paper?

Christine and Jimmy Wales
Image via Wikipedia

As per Jimmy Wales and his merry band at Wiki (pedia not leaky-ah)- The emphasis is mine

What is the best white paper you have read in the past 15 years.

Categories are-

  • Business benefits: Makes a business case for a certain technology or methodology.
  • Technical: Describes how a certain technology works.
  • Hybrid: Combines business benefits with technical details in a single document.
  • Policy: Makes a case for a certain political solution to a societal or economic challenge.
——————————————————————————————————————————————————



white paper is an authoritative report or guide that helps solve a problem. White papers are used to educate readers and help people make decisions, and are often requested and used in politics, policy, business, and technical fields. In commercial use, the term has also come to refer to documents used by businesses as a marketing or sales tool. Policy makers frequently request white papers from universities or academic personnel to inform policy developments with expert opinions or relevant research.

Government white papers

In the Commonwealth of Nations, “white paper” is an informal name for a parliamentary paper enunciating government policy; in the United Kingdom these are mostly issued as “Command papers“. White papers are issued by the government and lay out policy, or proposed action, on a topic of current concern. Although a white paper may on occasion be a consultation as to the details of new legislation, it does signify a clear intention on the part of a government to pass new law. White Papers are a “…. tool of participatory democracy … not [an] unalterable policy commitment.[1] “White Papers have tried to perform the dual role of presenting firm government policies while at the same time inviting opinions upon them.” [2]

In Canada, a white paper “is considered to be a policy document, approved by Cabinet, tabled in the House of Commons and made available to the general public.”[3] A Canadian author notes that the “provision of policy information through the use of white and green papers can help to create an awareness of policy issues among parliamentarians and the public and to encourage an exchange of information and analysis. They can also serve as educational techniques”.[4]

“White Papers are used as a means of presenting government policy preferences prior to the introduction of legislation”; as such, the “publication of a White Paper serves to test the climate of public opinion regarding a controversial policy issue and enables the government to gauge its probable impact”.[5]

By contrast, green papers, which are issued much more frequently, are more open ended. These green papers, also known as consultation documents, may merely propose a strategy to be implemented in the details of other legislation or they may set out proposals on which the government wishes to obtain public views and opinion.

White papers published by the European Commission are documents containing proposals for European Union action in a specific area. They sometimes follow a green paper released to launch a public consultation process.

For examples see the following:

 Commercial white papers

Since the early 1990s, the term white paper has also come to refer to documents used by businesses and so-called think tanks as marketing or sales tools. White papers of this sort argue that the benefits of a particular technologyproduct or policy are superior for solving a specific problem.

These types of white papers are almost always marketing communications documents designed to promote a specific company’s or group’s solutions or products. As a marketing tool, these papers will highlight information favorable to the company authorizing or sponsoring the paper. Such white papers are often used to generate sales leads, establish thought leadership, make a business case, or to educate customers or voters.

There are four main types of commercial white papers:

  • Business benefits: Makes a business case for a certain technology or methodology.
  • Technical: Describes how a certain technology works.
  • Hybrid: Combines business benefits with technical details in a single document.
  • Policy: Makes a case for a certain political solution to a societal or economic challenge.

Resources

  • Stelzner, Michael (2007). Writing White Papers: How to capture readers and keep them engaged. Poway, California: WhitePaperSource Publishing. pp. 214. ISBN 9780977716937.
  • Bly, Robert W. (2006). The White Paper Marketing Handbook. Florence, Kentucky: South-Western Educational Publishing. pp. 256. ISBN 9780324300826.
  • Kantor, Jonathan (2009). Crafting White Paper 2.0: Designing Information for Today’s Time and Attention Challenged Business Reader. Denver,Colorado: Lulu Publishing. pp. 167.ISBN 9780557163243.

High Performance Analytics

Marry Big Data Analytics to High Performance Computing, and you get the buzzword of this season- High Performance Analytics.

It basically consists of Parallelized code to run in parallel on custom hardware, in -database analytics for speed, and cloud computing /high performance computing environments. On an operational level, it consists of software (as in analytics) partnering with software (as in databases, Map reduce, Hadoop) plus some hardware (HP or IBM mostly). It is considered a high margin , highly profitable, business with small number of deals compared to say desktop licenses.

As per HPC Wire- which is a great tool/newsletter to keep updated on HPC , SAS Institute has been busy on this front partnering with EMC Greenplum and TeraData (who also acquired  SAS Partner AsterData to gain a much needed foot in the MR/SQL space) Continue reading “High Performance Analytics”

Tale of Two Wiki Appeals

Mirror, Mirror on the Wall-

Whos the Prettiest Wiki of them All

(click on the image you like)


Reputation on Social Networks

Law of Diminishing Marginal Utility
Image via Wikipedia

Classical Economics talks of the value of utlity, diminishing marginal utility if the same things is repeated again and again (like spam in an online community)

StackOverflow has a great way of measuring reputation – and thus allows intangible benefits /awards -similar to wikipedia badges , reddit karma. Utility is also auto generated like @klout  on twitter or lists memberships and other sucessful open source communities online including Ubuntu forums have ways to create ah hierarchies even in class less utopian classes.

Basically it then acts as the motivating game as the mostly boy population try to race on numbers.

 

in Stack Overflow- you can get buddies to upvote you and basically act as a role playing game too.

—–From http://stackoverflow.com/faq#reputation

To gain reputation, post good questions and useful answers. Your peers will vote on your posts, and those votes will cause you to gain (or, in rare cases, lose) reputation:

answer is voted up +10
question is voted up +5
answer is accepted +15 (+2 to acceptor)
post is voted down -2 (-1 to voter)

A maximum of 30 votes can be cast per user per day, and you can earn a maximum of 200 reputation per day (although accepted answers and bounty awards are immune to this limit). Also, please note that votes for any posts marked “community wiki” do not generate reputation.

Amass enough reputation points and Stack Overflow will allow you to go beyond simply asking and answering questions:

15 Vote up
15 Flag offensive
50 Leave comments
100 Edit community wiki posts
125 Vote down (costs 1 rep)
200 Reduced advertising