Home » Posts tagged 'statistics'
Tag Archives: statistics
Data Frame in Python
Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline
—————————————
If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.
Here are three packages that enable you to do so-
(1) pydataframe http://code.google.com/p/pydataframe/
An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)
Usage:
u = DataFrame( { "Field1": [1, 2, 3], "Field2": ['abc', 'def', 'hgi']}, optional: ['Field1', 'Field2'] ["rowOne", "rowTwo", "thirdRow"])
A DataFrame is basically a table with rows and columns.
Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().
DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:
#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6
#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']
#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]
Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)
(2) pandas http://pandas.pydata.org/
Library Highlights
- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.
Why not R?
First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:
- R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
- R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
- Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
- The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.
(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8
datamatrix 0.8
A Pythonic implementation of R’s data.frame structure.
Latest Version: 0.9
This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed
—————————————————————–
Modeling in Python
JMP Student Edition
I really liked the initiatives at JMP/Academic. Not only they offer the software bundled with a textbook, which is both good common sense as well as business sense given how fast students can get confused
(Rant 1 Bundling with textbooks is something I think is Revolution Analytics should think of doing instead of just offering the academic version for free downloading- it would be interesting to see the penetration of R academic market with Revolution’s version and the open source version with the existing strategy)
From http://www.jmp.com/academic/textbooks.shtml
Major publishers of introductory statistics textbooks offer a 12-month license to JMP Student Edition, a streamlined version of JMP, with their textbooks.
and a glance through this http://www.jmp.com/academic/pdf/jmp_se_comparison.pdf shows it is a credible and not extremely whittled down version which would be just dishonest.
And I loved this Reference Card at http://www.jmp.com/academic/pdf/jmp10_se_quick_guide.pdf
Oracle, SAP- Hana, Revolution Analytics and even SAS/STAT itself can make more reference cards like this- elegant solutions for students and new learners!
More- creative-rants Honestly why do corporate sites use PDFs anymore when they can use Instapaper , or any of these SlideShare/Scribd formats to show information in a better way without diverting the user from the main webpage.
But I digress, back to JMP
Resources for Faculty Using JMP® Student Edition
Faculty who select a JMP Student Edition bundle for their courses may be eligible for additional resources, including course materials and training.
Special JMP® Student Edition for AP Statistics
JMP Student Edition is available in a convenient five-year license for qualified Advanced Placement statistics programs.
Try and have a look yourself at http://www.jmp.com/academic/student.shtml
Interview Rob J Hyndman Forecasting Expert #rstats
Here is an interview with Prof Rob J Hyndman who has created many time series forecasting methods and authored books as well as R packages on the same.
Probably the biggest impact I’ve had is in helping the Australian government forecast the national health budget. In 2001 and 2002, they had underestimated health expenditure by nearly $1 billion in each year which is a lot of money to have to find, even for a national government. I was invited to assist them in developing a new forecasting method, which I did. The new method has forecast errors of the order of plus or minus $50 million which is much more manageable. The method I developed for them was the basis of the ETS models discussed in my 2008 book on exponential smoothing (www.exponentialsmoothing.net)
Interview John Myles White , Machine Learning for Hackers
Here is an interview with one of the younger researchers and rock stars of the R Project, John Myles White, co-author of Machine Learning for Hackers.
Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?
John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.
Ajay- What are the key things that a potential reader can learn from this book?
John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.
Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?
John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?
Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?
John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.
(TIL he has played in several rock bands!)
Who made Who in #Rstats
While Bob M, my old mentor and fellow TN man maintains the website http://r4stats.com/ how popular R is across various forums, I am interested in who within R community of 3 million (give or take a few) is contributing more. I am very sure by 2014, we can have a new fork of R called Hadley R, in which all packages would be made by Hadley Wickham and you wont need anything else.
But jokes apart, since I didnt have the time to
1) scrape CRAN for all package authors
2) scrape for lines of code across all packages
3) allocate lines of code (itself a dubious software productivity metric) to various authors of R packages-
OR
1) scraping the entire and 2011′s R help list
2) determine who is the most frequent r question and answer user (ala SAS-L’s annual MVP and rookie of the year awards)
I did the following to atleast who is talking about R across easily scrapable Q and A websites
Stack Overflow still rules over all.
http://stackoverflow.com/tags/r/topusers shows the statistics on who made whom in R on Stack Overflow
All in all, initial ardour seems to have slowed for #Rstats on Stack Overflow ? or is it just summer?
No the answer- credit to Rob J Hyndman is most(?) activity is shifting to Stats Exchange
http://stats.stackexchange.com/tags/r/topusers
You could also paste this in Notepad and some graphs on Average Score / Answer or even make a social network graph if you had the time.
Do NOT (Go/Bi) search for Stack Overflow API or web scraping stack overflow- it gives you all the answers on the website but 0 answers on how to scrape these websites.
I have added a new website called Meta Optimize to this list based on Tal G’s interview of Joseph Turian, at http://www.r-statistics.com/2010/07/statistical-analysis-qa-website-did-stackoverflow-just-lose-it-to-metaoptimize-and-is-it-good-or-bad/
http://metaoptimize.com/qa/tags/r/?sort=hottest
There are only 17 questions tagged R but it seems a lot of views is being generated.
I also decided to add views from Quora since it is Q and A site (and one which I really like)
http://www.quora.com/R-software
Again very few questions but lot many followers
Revolution R Enterprise 6.0 launched!
Just got the email-more software is good news!
Revolution R Enterprise 6.0 for 32-bit and 64-bit Windows and 64-bit Red Hat Enterprise Linux (RHEL 5.x and RHEL 6.x) features an updated release of the RevoScaleR package that provides fast, scalable data management and data analysis: the same code scales from data frames to local, high-performance .xdf files to data distributed across a Windows HPC Server cluster or IBM Platform Computing LSF cluster. RevoScaleR also allows distribution of the execution of essentially any R function across cores and nodes, delivering the results back to the user.
Detailed information on what’s new in 6.0 and known issues:
http://www.revolutionanalytics.com/doc/README_RevoEnt_Windows_6.0.0.pdf
and from the manual-lots of function goodies for Big Data
- IBM Platform LSF Cluster support [Linux only]. The new RevoScaleR function, RxLsfCluster, allows you to create a distributed compute context for the Platform LSF workload manager.
- Azure Burst support added for Microsoft HPC Server [Windows only]. The new RevoScaleR function, RxAzureBurst, allows you to create a distributed compute context to have computations performed in the cloud using Azure Burst
- The rxExec function allows distributed execution of essentially any R function across cores and nodes, delivering the results back to the user.
- functions RxLocalParallel and RxLocalSeq allow you to create compute context objects for local parallel and local sequential computation, respectively.
- RxForeachDoPar allows you to create a compute context using the currently registered foreach parallel backend (doParallel, doSNOW, doMC, etc.). To execute rxExec calls, simply register the parallel backend as usual, then set your compute context as follows: rxSetComputeContext(RxForeachDoPar())
- rxSetComputeContext and rxGetComputeContext simplify management of compute contexts.
- rxGlm, provides a fast, scalable, distributable implementation of generalized linear models. This expands the list of full-featured high performance analytics functions already available: summary statistics (rxSummary), cubes and cross tabs (rxCube,rxCrossTabs), linear models (rxLinMod), covariance and correlation matrices (rxCovCor),
binomial logistic regression (rxLogit), and k-means clustering (rxKmeans)example: a Tweedie family with 1 million observations and 78 estimated coefficients (categorical data)
took 17 seconds with rxGlm compared with 377 seconds for glm on a quadcore laptopand easier working with R’s big brother SAS language
RevoScaleR high-performance analysis functions will now conveniently work directly with a variety of external data sources (delimited and fixed format text files, SAS files, SPSS files, and ODBC data connections). New functions are provided to create data source objects to represent these data sources (RxTextData, RxOdbcData, RxSasData, and RxSpssData), which in turn can be specified for the ‘data’ argument for these RevoScaleR analysis functions: rxHistogram, rxSummary, rxCube, rxCrossTabs, rxLinMod, rxCovCor, rxLogit, and rxGlm.
example,
you can analyze a SAS file directly as follows:
# Create a SAS data source with information about variables and # rows to read in each chunk
sasDataFile <- file.path(rxGetOption(“sampleDataDir”),”claims.sas7bdat”)
sasDS <- RxSasData(sasDataFile, stringsAsFactors = TRUE,colClasses = c(RowNum = “integer”),rowsPerRead = 50)# Compute and draw a histogram directly from the SAS file
rxHistogram( ~cost|type, data = sasDS)
# Compute summary statistics
rxSummary(~., data = sasDS)
# Estimate a linear model
linModObj <- rxLinMod(cost~age + car_age + type, data = sasDS)
summary(linModObj)
# Import a subset into a data frame for further inspection
subData <- rxImport(inData = sasDS, rowSelection = cost > 400,
varsToKeep = c(“cost”, “age”, “type”))
subData
The installation instructions and instructions for getting started with Revolution R Enterprise & RevoDeployR for Windows: http://www.revolutionanalytics.com/downloads/instructions/windows.php
Cricinfo StatsGuru Database for Statistical and Graphical Analysis
Data from the ESPN Cricinfo website is available from the STATSGURU website.
The url is of the form-
http://stats.espncricinfo.com/ci/engine/stats/index.html?
class=1;team=6;template=results;type=batting
If you break down this URL to get more statistics on cricket, you can choose the following parameters.
class
1=Test
2=ODI
3=T20I
11=Test+ODI+T20I
team
1=England
2=Australia
3=South America
4-West Indies
5=New Zealand
6=India ,7=Pakistan and 8=Sri Lanka
type
batting
bowling
fielding
allround
fow
official
team
aggregate
ESPN Terms of Use are here-you may need to check this before trying any web scraping.
http://www.espncricinfo.com/ci/content/site/company/terms_use.html
However ESPN has unleashed the API (including both free and premium)for Developers at http://developer.espn.com/docs.
and especially these sports http://developer.espn.com/docs/headlines#parameters
| /sports | News across all sports/sections | ||||
| /sports/baseball/mlb | Major League Baseball (MLB) | ||||
| /sports/basketball/mens-college-basketball | NCAA Men’s College Basketball | ||||
| /sports/basketball/nba | National Basketball Association (NBA) | ||||
| /sports/basketball/wnba | Women’s National Basketball Association (WNBA) | ||||
| /sports/basketball/womens-college-basketball | NCAA Women’s College Basketball | ||||
| /sports/boxing | Boxing | ||||
| /sports/football/college-football | NCAA College Football | ||||
| /sports/football/nfl | National Football League (NFL) | ||||
| /sports/golf | Golf | ||||
| /sports/hockey/nhl | National Hockey League (NHL) | ||||
| /sports/horse-racing | Horse Racing | ||||
| /sports/mma | Mixed Martial Arts | ||||
| /sports/racing | Auto Racing | ||||
| /sports/racing/nascar | NASCAR Racing | ||||
| /sports/soccer | Professional soccer (US focus) | ||||
| /sports/tennis | Tennis |
I wonder when this can be enabled for Cricket as well (including APIs free,academic,premium,partner ).
(Note you can use R packages XML , RCurl , rjson, to get data from the web among others).
Plotting is best done using ggplot2 http://had.co.nz/ggplot2/ or d3.js at http://mbostock.github.com/d3/, and the current status of cricket graphics can surely look a change- they are mostly a single radial plot of shots played /runs scored or a combined barplot/line graph.




