Home » Posts tagged 'open'
Tag Archives: open
Data Frame in Python
Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline
—————————————
If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.
Here are three packages that enable you to do so-
(1) pydataframe http://code.google.com/p/pydataframe/
An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)
Usage:
u = DataFrame( { "Field1": [1, 2, 3], "Field2": ['abc', 'def', 'hgi']}, optional: ['Field1', 'Field2'] ["rowOne", "rowTwo", "thirdRow"])
A DataFrame is basically a table with rows and columns.
Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().
DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:
#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6
#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']
#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]
Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)
(2) pandas http://pandas.pydata.org/
Library Highlights
- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.
Why not R?
First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:
- R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
- R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
- Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
- The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.
(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8
datamatrix 0.8
A Pythonic implementation of R’s data.frame structure.
Latest Version: 0.9
This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed
—————————————————————–
Modeling in Python
JMP Student Edition
I really liked the initiatives at JMP/Academic. Not only they offer the software bundled with a textbook, which is both good common sense as well as business sense given how fast students can get confused
(Rant 1 Bundling with textbooks is something I think is Revolution Analytics should think of doing instead of just offering the academic version for free downloading- it would be interesting to see the penetration of R academic market with Revolution’s version and the open source version with the existing strategy)
From http://www.jmp.com/academic/textbooks.shtml
Major publishers of introductory statistics textbooks offer a 12-month license to JMP Student Edition, a streamlined version of JMP, with their textbooks.
and a glance through this http://www.jmp.com/academic/pdf/jmp_se_comparison.pdf shows it is a credible and not extremely whittled down version which would be just dishonest.
And I loved this Reference Card at http://www.jmp.com/academic/pdf/jmp10_se_quick_guide.pdf
Oracle, SAP- Hana, Revolution Analytics and even SAS/STAT itself can make more reference cards like this- elegant solutions for students and new learners!
More- creative-rants Honestly why do corporate sites use PDFs anymore when they can use Instapaper , or any of these SlideShare/Scribd formats to show information in a better way without diverting the user from the main webpage.
But I digress, back to JMP
Resources for Faculty Using JMP® Student Edition
Faculty who select a JMP Student Edition bundle for their courses may be eligible for additional resources, including course materials and training.
Special JMP® Student Edition for AP Statistics
JMP Student Edition is available in a convenient five-year license for qualified Advanced Placement statistics programs.
Try and have a look yourself at http://www.jmp.com/academic/student.shtml
Obfuscate using Rapid Miner
ob·fus·cate/ˈäbfəˌskāt/
| Verb: |
|
A nice geeky function in Rapid Miner is the Obfuscator
This operator can be used to anonymize your data. It is possible to save the obfuscating map into a file which can be used to remap the old values and names. Please use the operator Deobfuscator for this
Click screenshot to enlarge-
RapidMiner is free for download here (its open source)
http://rapid-i.com/content/view/26/201/
Using Rapid Miner and R for Sports Analytics #rstats
Ajay- Why did you choose Rapid Miner and R? What were the other software alternatives you considered and discarded?
Analyst- We considered most of the other major players in statistics/data mining or enterprise BI. However, we found that the value proposition for an open source solution was too compelling to justify the premium pricing that the commercial solutions would have required. The widespread adoption of R and the variety of packages and algorithms available for it, made it an easy choice. We liked RapidMiner as a way to design structured, repeatable processes, and the ability to optimize learner parameters in a systematic way. It also handled large data sets better than R on 32-bit Windows did. The GUI, particularly when 5.0 was released, made it more usable than R for analysts who weren’t experienced programmers.
Ajay- What analytics do you do think Rapid Miner and R are best suited for?
Analyst- We use RM+R mainly for sports analysis so far, rather than for more traditional business applications. It has been quite suitable for that, and I can easily see how it would be used for other types of applications.
Ajay- Any experiences as an enterprise customer? How was the installation process? How good is the enterprise level support?
Analyst- Rapid-I has been one of the most responsive tech companies I’ve dealt with, either in my current role or with previous employers. They are small enough to be able to respond quickly to requests, and in more than one case, have fixed a problem, or added a small feature we needed within a matter of days. In other cases, we have contracted with them to add larger pieces of specific functionality we needed at reasonable consulting rates. Those features are added to the mainline product, and become fully supported through regular channels. The longer consulting projects have typically had a turnaround of just a few weeks.
Ajay- What challenges if any did you face in executing a pure open source analytics bundle ?
Analyst- As Rapid-I is a smaller company based in Europe, the availability of training and consulting in the USA isn’t as extensive as for the major enterprise software players, and the time zone differences sometimes slow down the communications cycle. There were times where we were the first customer to attempt a specific integration point in our technical environment, and with no prior experiences to fall back on, we had to work with Rapid-I to figure out how to do it. Compared to the what traditional software vendors provide, both R and RM tend to have sparse, terse, occasionally incomplete documentation. The situation is getting better, but still lags behind what the traditional enterprise software vendors provide.
Ajay- What are the things you can do in R ,and what are the things you prefer to do in Rapid Miner (comparison for technical synergies)
Analyst- Our experience has been that RM is superior to R at writing and maintaining structured processes, better at handling larger amounts of data, and more flexible at fine-tuning model parameters automatically. The biggest limitation we’ve had with RM compared to R is that R has a larger library of user-contributed packages for additional data mining algorithms. Sometimes we opted to use R because RM hadn’t yet implemented a specific algorithm. The introduction the R extension has allowed us to combine the strengths of both tools in a very logical and productive way.
In particular, extending RapidMiner with R helped address RM’s weakness in the breadth of algorithms, because it brings the entire R ecosystem into RM (similar to how Rapid-I implemented much of the Weka library early on in RM’s development). Further, because the R user community releases packages that implement new techniques faster than the enterprise vendors can, this helps turn a potential weakness into a potential strength. However, R packages tend to be of varying quality, and are more prone to go stale due to lack of support/bug fixes. This depends heavily on the package’s maintainer and its prevalence of use in the R community. So when RapidMiner has a learner with a native implementation, it’s usually better to use it than the R equivalent.
RCOMM 2012 goes live in August
An awesome conference by an awesome software Rapid Miner remains one of the leading enterprise grade open source software , that can help you do a lot of things including flow driven data modeling ,web mining ,web crawling etc which even other software cant.
Presentations include:
- Mining Machine 2 Machine Data (Katharina Morik, TU Dortmund University)
- Handling Big Data (Andras Benczur, MTA SZTAKI)
- Introduction of RapidAnalytics at Telenor (Telenor and United Consult)
- and more
Here is a list of complete program
Program
Time
|
Tuesday
|
Wednesday
|
Thursday
|
Friday
|
09:00 – 10:30 |
Introductory Speech Ingo Mierswa (Rapid-I)Resource-aware Data Mining or M2M Mining (Invited Talk) Katharina Morik (TU Dortmund University)
Data Analysis
NeurophRM: Integration of the Neuroph framework into RapidMiner |
To be announced (Invited Talk) Andras Benczur Recommender Systems
Extending RapidMiner with Recommender Systems Algorithms Implementation of User Based Collaborative Filtering in RapidMiner |
Parallel Training / Workshop Session
Advanced Data Mining and Data Transformations or |
|
10:30 – 11:00 |
Coffee Break |
Coffee Break |
Coffee Break |
|
11:00 – 12:30 |
Data Analysis
Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner Customers’ LifeStyle Targeting on Big Data using Rapid Miner Robust GPGPU Plugin Development for RapidMiner |
Extensions
Optimization Plugin For RapidMiner
Image Mining Extension – Year After Incorporating R Plots into RapidMiner Reports |
||
12:30 – 13:30 |
Lunch |
Lunch |
Lunch |
|
13:30 – 15:30 |
Parallel Training / Workshop Session
Basic Data Mining and Data Transformations or |
Applications
Introduction of RapidAnalyticy Enterprise Edition at Telenor Hungary
Application of RapidMiner in Steel Industry Research and Development A Comparison of Data-driven Models for Forecast River Flow Portfolio Optimization Using Local Linear Regression Ensembles in Rapid Miner |
Extensions
An Octave Extension for RapidMiner
Unstructured Data
Processing Data Streams with the RapidMiner Streams-Plugin Automated Creation of Corpuses for the Needs of Sentiment Analysis
Demonstration: News from the Rapid-I Labs This short session demonstrates the latest developments from the Rapid-I lab and will let you how you can build powerful analysis processes and routines by using those RapidMiner tools. |
Certification Exam |
15:30 – 16:00 |
Coffee Break |
Coffee Break |
Coffee Break |
|
16:00 – 18:00 |
Book Presentation and Game Show
Data Mining for the Masses: A New Textbook on Data Mining for Everyone Matthew North presents his new book “Data Mining for the Masses” introducing data mining to a broader audience and making use of RapidMiner for practical data mining problems.
Game Show |
User Support
Get some Coffee for free – Writing Operators with RapidMiner Beans Meta-Modeling Execution Times of RapidMiner operators Conference day ends at ca. 17:00. |
||
19:30 |
Social Event (Conference Dinner) |
Social Event (Visit of Bar District) |
and you should have a look at https://rapid-i.com/rcomm2012f/index.php?option=com_content&view=article&id=65
Conference is in Budapest, Hungary,Europe.
( Disclaimer- Rapid Miner is an advertising sponsor of Decisionstats.com in case you didnot notice the two banner sized ads.)
New Free Online Book by Rob Hyndman on Forecasting using #Rstats
From the creator of some of the most widely used packages for time series in the R programming language comes a brand new book, and its online!
This time the book is free, will be updated and 7 chapters are ready (to read!)
. If you do forecasting professionally, now is the time to suggest your own use cases to be featured as the book gets ready by end- 2012. The book is intended as a replacement for Makridakis, Wheelwright and Hyndman (Wiley 1998).
The book is written for three audiences:
(1) people finding themselves doing forecasting in business when they may not have had any formal training in the area;
(2) undergraduate students studying business;
(3) MBA students doing a forecasting elective.
The book is different from other forecasting textbooks in several ways.
- It is free and online, making it accessible to a wide audience.
- It is continuously updated. You don’t have to wait until the next edition for errors to be removed or new methods to be discussed. We will update the book frequently.
- There are dozens of real data examples taken from our own consulting practice. We have worked with hundreds of businesses and organizations helping them with forecasting issues, and this experience has contributed directly to many of the examples given here, as well as guiding our general philosophy of forecasting.
- We emphasise graphical methods more than most forecasters. We use graphs to explore the data, analyse the validity of the models fitted and present the forecasting results.
A print version and a downloadable e-version of the book will be available to purchase on Amazon, but not until a few more chapters are written.
Contents
(Ajay-Support the open textbook movement!)
If you’ve found this book helpful, please consider helping to fund free, open and online textbooks. (Donations via PayPal.)
RevoDeployR and commercial BI using R and R based cloud computing using Open CPU
Revolution Analytics has of course had RevoDeployR, and in a webinar strive to bring it back to center spotlight.
BI is a good lucrative market, and visualization is a strength in R, so it is matter of time before we have more R based BI solutions. I really liked the two slides below for explaining RevoDeployR better to newbies like me (and many others!)
Integrating R into 3rd party and Web applications using RevoDeployR
Please click here to download the PDF.
Here are some additional links that may be of interest to you:
- RevoDeployR web page: http://www.revolutionanalytics.com/products/enterprise-deployment.php
- RevoDeployR data sheet: http://www.revolutionanalytics.com/products/pdf/RevoDeployR.pdf
- RevoDeployR whitepaper: http://www.revolutionanalytics.com/why-revolution-r/whitepapers/DeployR_White_Paper.pdf
( I still think someone should make a commercial version of Jeroen Oom’s web interfaces and Jeff Horner’s web infrastructure (see below) for making customized Business Intelligence (BI) /Data Visualization solutions , UCLA and Vanderbilt are not exactly Stanford when it comes to deploying great academic solutions in the startup-tech world). I kind of think Google or someone at Revolution should atleast dekko OpenCPU as a credible cloud solution in R.
I still cant figure out whether Revolution Analytics has a cloud computing strategy and Google seems to be working mysteriously as usual in broadening access to the Google Compute Cloud to the rest of R Community.
Open CPU provides a free and open platform for statistical computing in the cloud. It is meant as an open, social analysis environment where people can share and run R functions and objects. For more details, visit the websit: www.opencpu.org
and esp see
https://public.opencpu.org/userapps/opencpu/opencpu.demo/runcode/
Jeff Horner’s
Jerooen Oom’s
-
/webapps
- /stockplot
- /lme4
- /ggplot2
- /puberty plot
- /IRT tool






