Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline

—————————————

If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)

Usage:

        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                        optional:
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)

 

(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed

—————————————————————–

Modeling in Python

Continue reading “Data Frame in Python”

New Amazon Instance: High I/O for NoSQL

Latest from the Amazon Cloud-

hi1.4xlarge instances come with eight virtual cores that can deliver 35 EC2 Compute Units (ECUs) of CPU performance, 60.5 GiB of RAM, and 2 TiB of storage capacity across two SSD-based storage volumes. Customers using hi1.4xlarge instances for their applications can expect over 120,000 4 KB random write IOPS, and as many as 85,000 random write IOPS (depending on active LBA span). These instances are available on a 10 Gbps network, with the ability to launch instances into cluster placement groups for low-latency, full-bisection bandwidth networking.

High I/O instances are currently available in three Availability Zones in US East (N. Virginia) and two Availability Zones in EU West (Ireland) regions. Other regions will be supported in the coming months. You can launch hi1.4xlarge instances as On Demand instances starting at $3.10/hour, and purchase them as Reserved Instances

http://aws.amazon.com/ec2/instance-types/

High I/O Instances

Instances of this family provide very high instance storage I/O performance and are ideally suited for many high performance database workloads. Example applications include NoSQL databases like Cassandra and MongoDB. High I/O instances are backed by Solid State Drives (SSD), and also provide high levels of CPU, memory and network performance.

High I/O Quadruple Extra Large Instance

60.5 GB of memory
35 EC2 Compute Units (8 virtual cores with 4.4 EC2 Compute Units each)
2 SSD-based volumes each with 1024 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
Storage I/O Performance: Very High*
API name: hi1.4xlarge

*Using Linux paravirtual (PV) AMIs, High I/O Quadruple Extra Large instances can deliver more than 120,000 4 KB random read IOPS and between 10,000 and 85,000 4 KB random write IOPS (depending on active logical block addressing span) to applications. For hardware virtual machines (HVM) and Windows AMIs, performance is approximately 90,000 4 KB random read IOPS and between 9,000 and 75,000 4 KB random write IOPS. The maximum sequential throughput on all AMI types (Linux PV, Linux HVM, and Windows) per second is approximately 2 GB read and 1.1 GB write.

Doing RFM Analysis in R


RFM is a method used for analyzing customer behavior and defining market segments. It is commonly used in database marketing and direct marketing and has received particular attention in retail.


RFM stands for


  • Recency – How recently did the customer purchase?
  • Frequency – How often do they purchase?
  • Monetary Value – How much do they spend?

To create an RFM analysis, one creates categories for each attribute. For instance, the Recency attribute might be broken into three categories: customers with purchases within the last 90 days; between 91 and 365 days; and longer than 365 days. Such categories may be arrived at by applying business rules, or using a data mining technique, such as CHAID, to find meaningful breaks.

from-http://en.wikipedia.org/wiki/RFM

If you are new to RFM or need more step by step help, please read here

https://decisionstats.com/2010/10/03/ibm-spss-19-marketing-analytics-and-rfm/

and here is R code- note for direct marketing you need to compute Monetization based on response rates (based on offer date) as well



##Creating Random Sales Data of the format CustomerId (unique to each customer), Sales.Date,Purchase.Value

sales=data.frame(sample(1000:1999,replace=T,size=10000),abs(round(rnorm(10000,28,13))))

names(sales)=c("CustomerId","Sales Value")

sales.dates <- as.Date("2010/1/1") + 700*sort(stats::runif(10000))

#generating random dates

sales=cbind(sales,sales.dates)

str(sales)

sales$recency=round(as.numeric(difftime(Sys.Date(),sales[,3],units="days")) )

library(gregmisc)

##if you have existing sales data you need to just shape it in this format

rename.vars(sales, from="Sales Value", to="Purchase.Value")#Renaming Variable Names

## Creating Total Sales(Monetization),Frequency, Last Purchase date for each customer

salesM=aggregate(sales[,2],list(sales$CustomerId),sum)

names(salesM)=c("CustomerId","Monetization")

salesF=aggregate(sales[,2],list(sales$CustomerId),length)

names(salesF)=c("CustomerId","Frequency")

salesR=aggregate(sales[,4],list(sales$CustomerId),min)

names(salesR)=c("CustomerId","Recency")

##Merging R,F,M

test1=merge(salesF,salesR,"CustomerId")

salesRFM=merge(salesM,test1,"CustomerId")

##Creating R,F,M levels 

salesRFM$rankR=cut(salesRFM$Recency, 5,labels=F) #rankR 1 is very recent while rankR 5 is least recent

salesRFM$rankF=cut(salesRFM$Frequency, 5,labels=F)#rankF 1 is least frequent while rankF 5 is most frequent

salesRFM$rankM=cut(salesRFM$Monetization, 5,labels=F)#rankM 1 is lowest sales while rankM 5 is highest sales

##Looking at RFM tables
table(salesRFM[,5:6])
table(salesRFM[,6:7])
table(salesRFM[,5:7])

Code Highlighted by Pretty R at inside-R.org

Note-you can also use quantile function instead of cut function. This changes cut to equal length instead of equal interval. or  see other methods for finding breaks for categories.

 

Random Sampling a Dataset in R

A common example in business  analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns)  in structure or database bound.This is partly due to a legacy of traditional analytics software.

Here is how we do it in R-

• Refering to parts of data.frame rather than whole dataset.

Using square brackets to reference variable columns and rows

The notation dataset[i,k] refers to element in the ith row and jth column.

The notation dataset[i,] refers to all elements in the ith row .or a record for a data.frame

The notation dataset[,j] refers to all elements in the jth column- or a variable for a data.frame.

For a data.frame dataset

> nrow(dataset) #This gives number of rows

> ncol(dataset) #This gives number of columns

An example for corelation between only a few variables in a data.frame.

> cor(dataset1[,4:6])

Splitting a dataset into test and control.

ts.test=dataset2[1:200] #First 200 rows

ts.control=dataset2[201:275] #Next 75 rows

• Sampling

Random sampling enables us to work on a smaller size of the whole dataset.

use sample to create a random permutation of the vector x.

Suppose we want to take a 5% sample of a data frame with no replacement.

Let us create a dataset ajay of random numbers

ajay=matrix( round(rnorm(200, 5,15)), ncol=10)

#This is the kind of code line that frightens most MBAs!!

Note we use the round function to round off values.

ajay=as.data.frame(ajay)

 nrow(ajay)

[1] 20

> ncol(ajay)

[1] 10

This is a typical business data scenario when we want to select only a few records to do our analysis (or test our code), but have all the columns for those records. Let  us assume we want to sample only 5% of the whole data so we can run our code on it

Then the number of rows in the new object will be 0.05*nrow(ajay).That will be the size of the sample.

The new object can be referenced to choose only a sample of all rows in original object using the size parameter.

We also use the replace=FALSE or F , to not the same row again and again. The new_rows is thus a 5% sample of the existing rows.

Then using the square backets and ajay[new_rows,] to get-

b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]

 

You can change the percentage from 5 % to whatever you want accordingly.

On a Hiatus

No blogging (except for interviews)

No poetry (unless I get really inspired and my scrapbook fills up)

No random internet browsing (except search for research)

Hell no, Facebook

No TV

No Movies

No goofing off

No wasting time using creative juice stewing as an excuse

Write the book

Write the book

Write the book, dammit

Internet Encryption Algols are flawed- too little too late!

Some news from a paper I am reading- not surprised that RSA has a problem .

http://eprint.iacr.org/2012/064.pdf

Abstract. We performed a sanity check of public keys collected on the web. Our main goal was to test the validity of the assumption that di erent random choices are made each time keys are generated.We found that the vast majority of public keys work as intended. A more disconcerting fi nding is that two out of every one thousand RSA moduli that we collected off er no security.

 

Our conclusion is that the validity of the assumption is questionable and that generating keys in the real world for multiple-secrets” cryptosystems such as RSA is signi cantly riskier than for single-secret” ones such as ElGamal or (EC)DSA which are based on Die-Hellman.

Keywords: Sanity check, RSA, 99.8% security, ElGamal, DSA, ECDSA, (batch) factoring, discrete logarithm, Euclidean algorithm, seeding random number generators, K9.

and

 

99.8% Security. More seriously, we stumbled upon 12720 di erent 1024-bit RSA moduli that o ffer no security. Their secret keys are accessible to anyone who takes the trouble to redo our work. Assuming access to the public key collection, this is straightforward compared to more

traditional ways to retrieve RSA secret keys (cf. [5,15]). Information on the a ected X.509 certi cates and PGP keys is given in the full version of this paper, cf. below. Overall, over the data we collected 1024-bit RSA provides 99.8% security at best (but see Appendix A).

 

However no algol is perfect and even Elliptic Based Crypto ( see http://en.wikipedia.org/wiki/Elliptic_curve_cryptography#Fast_reduction_.28NIST_curves.29 )has a flaw called Shor http://en.wikipedia.org/wiki/Shor%27s_algorithm

Funny thing is ECC is now used for Open DNS


http://dnscurve.org/crypto.html

The DNSCurve project adds link-level public-key protection to DNS packets. This page discusses the cryptographic tools used in DNSCurve.

ELLIPTIC-CURVE CRYPTOGRAPHY

DNSCurve uses elliptic-curve cryptography, not RSA.

RSA is somewhat older than elliptic-curve cryptography: RSA was introduced in 1977, while elliptic-curve cryptography was introduced in 1985. However, RSA has shown many more weaknesses than elliptic-curve cryptography. RSA’s effective security level was dramatically reduced by the linear sieve in the late 1970s, by the quadratic sieve and ECM in the 1980s, and by the number-field sieve in the 1990s. For comparison, a few attacks have been developed against some rare elliptic curves having special algebraic structures, and the amount of computer power available to attackers has predictably increased, but typical elliptic curves require just as much computer power to break today as they required twenty years ago.

IEEE P1363 standardized elliptic-curve cryptography in the late 1990s, including a stringent list of security criteria for elliptic curves. NIST used the IEEE P1363 criteria to select fifteen specific elliptic curves at five different security levels. In 2005, NSA issued a new “Suite B” standard, recommending the NIST elliptic curves (at two specific security levels) for all public-key cryptography and withdrawing previous recommendations of RSA.

Some specific types of elliptic-curve cryptography are patented, but DNSCurve does not use any of those types of elliptic-curve cryptography.

No wonder college kids are hacking defense databases easily nowadays!!

Some Ways Anonymous Could Disrupt the Internet if SOPA is passed

This is a piece of science fiction. I wrote while reading Isaac Assimov’s advice to writers in GOLD, while on a beach in Anjuna.

1) Identify senators, lobbyists, senior executives of companies advocating for SOPA. Go for selective targeting of these people than massive Denial of Service Attacks.

This could also include election fund raising websites in the United States.

2) Create hacking tools with simple interfaces to probe commonly known software errors, to enable wider audience including the Occupy Movement students to participate in hacking. thus making hacking more democratic. What are the top 25 errors as per  http://cwe.mitre.org/cwss/

http://www.decisionstats.com/top-25-most-dangerous-software-errors/ ?

 

Easy interface tools to check vulnerabilities would be the next generation to flooding tools like HOIC, LOIC – Massive DDOS atttacks make good press coverage but not so good technically

3) Disrupt digital payment mechanisms for selected targets (in step1) using tools developed in Step 2, and introduce random noise errors in payment transfers.

4) Help create a better secure internet by embedding Tor within Chromium with all tools for anonymity embedded for easy usage – a more secure peer to peer browser (like a mashup of Opera , tor and chromium).

or maybe embed bit torrents within a browser.

5) Disrupt media companies and cloud computing based companies like iTunes, Spotify or Google Music, just like virus, ant i viruses disrupted the desktop model of computing. After that offer solutions to the problems like companies of anti virus software did for decades.

6) Hacking websites is fine fun, but hacking internet databases and massively parallel data scrapers can help disrupt some of the status quo.

This applies to databases that offer data for sale, like credit bureaus etc. Making this kind of data public will eliminate data middlemen.

7) Use cross border, cross country regulatory arbitrage for better risk control of hacker attacks.

8) recruiting among universities using easy to use hacking tools to expand the pool of dedicated hacker armies.

9) using operations like those targeting child pornography to increase political acceptability of the hacker sub culture. Refrain from overtly negative and unimaginative bad Press Relations

10) If you cant convince  them to pass SOPA, confuse them 😉 Use bots for random clicks on ads to confuse internet commerce.