Just a single click on a check mark to enable tweeting from your every blog post (similar to a Tweetmeme button)
Year: 2010
Indian Offshoring IPOs dismal performance
Using Yahoo Finance, I plotted the past three years stock price of Indian Offshores (Genpact, Wns, Exl) and in comparison with Indian Software companies (Infosys, Wipro, TCS, Sify) and market index.
The following insights emerge-
1) Indian Software companies have constantly created wealth.
2) Indian Offshoring companies have constantly lost market value – perhaps because they were able to dump IPO prices at much higher prices by creating hype.
3) You are much better off investing in Indian stock market or a blue chip Indian software company than take part in an Indian offshorers IPO.
4) SIFY lost most value and its founder CEO is now in jail for fraud. The fraud was he added phantom employees, and phantom revenue to boost balance sheet. Auditors from PwC (were jailed) included a board member of Indian Chartered Accountants and Satyam (SIFY) had won awards for corporate governance. It makes sense to do rigorous cash flow due diligence this side of the pond.
5) I won no stock in any of this companies (not surprisingly) but do have a portfolio of mutual funds (index).
So the next time you are promised the moon by an Indian IPO- KPO, remember to do the math 😉
Mapreduce Book
Here is a new book on learning MapReduce and it has a free downloadable version as well.
Data-Intensive Text Processing with MapReduce
Jimmy Lin and Chris Dyer
ABSTRACT
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.
You can download the book here
This book is part of the Morgan & Claypool Synthesis Lectures on Human Language Technologies. If you’re at a university, your institution may already subscribe to the series, in which case you can access the electronic version directly without cost (see this page for a list of institutional subscribers). Otherwise, to purchase:
- Electronic and print copies from Morgan & Claypool (publisher’s site)
- Print copies from Amazon.com
Quite explicitly, this book focuses on MapReduce algorithm design, not Hadoop programming. Tom White’s Hadoop: The Definitive Guide
is a great resource for learning Hadoop.
Want to be notified of updates? Interested in MapReduce algorithm design? Follow @lintool on Twitter here!
Review: Once upon a time in Mumbaai
This is a 70’s era Bollywood movie with two fine actors pitted in a classic genre- stylish mafia drama. An ensemble supporting cast, pretty images to see a classic not so crowded Bombay (as it was called)- it actually draws inspiration from real life gangsters. With fine music and good action as well, this movie can be good for your time-
IPSUR – A Free R Textbook
Here is a free R textbook called IPSUR-
http://ipsur.r-forge.r-project.org/book/index.php
IPSUR stands for Introduction to Probability and Statistics Using R, ISBN: 978-0-557-24979-4, which is a textbook written for an undergraduate course in probability and statistics. The approximate prerequisites are two or three semesters of calculus and some linear algebra in a few places. Attendees of the class include mathematics, engineering, and computer science majors.
IPSUR is FREE, in the GNU sense of the word. Hard copies are available for purchase here from Lulu and will be available (coming soon) from the other standard online retailers worldwide. The price of the book is exactly the manufacturing cost plus the retailers’ markup. You may be able to get it even cheaper by downloading an electronic copy and printing it yourself, but if you elect this route then be sure to get the publisher-quality PDF from theDownloads page. And double check the price. It was cheaper for my students to buy a perfect-bound paperback from Lulu and have it shipped to their door than it was to upload the PDF to Fed-Ex Kinkos and Xerox a coil-bound copy (and on top of that go pick it up at the store).
If you are going to buy from anywhere other than Lulu then be sure to check the time-stamp on the copyright page. There is a 6 to 8 week delay from Lulu to Amazon and you may not be getting the absolute latest version available.
Refer to the Installation page for instructions to install an electronic copy of IPSUR on your personal computer. See the Feedback page for guidance about questions or comments you may have about IPSUR.
Also see http://ipsur.r-forge.r-project.org/rcmdrplugin/index.php for the R Cmdr Plugin
This plugin for the R Commander accompanies the text Introduction to Probability and Statistics Using R by G. Jay Kerns. The plugin contributes functions unique to the book as well as specific configuration and functionality to R Commander, the pioneering work by John Fox of McMaster University.
RcmdrPlugin.IPSUR’s primary goal is to provide a user-friendly graphical user interface (GUI) to the open-source and freely available R statistical computing environment. RcmdrPlugin.IPSUR is equipped to handle many of the statistical analyses and graphical displays usually encountered by upper division undergraduate mathematics, statistics, and engineering majors. Available features are comparable to many expensive commercial packages such as Minitab, SPSS, and JMP-IN.
Since the audience of RcmdrPlugin.IPSUR is slightly different than Rcmdr’s, certain functionality has been added and selected error-checks have been disabled to permit the student to explore alternative regions of the statistical landscape. The resulting benefit of increased flexibility is balanced by somewhat increased vulnerability to syntax errors and misuse; the instructor should keep this and the academic audience in mind when usingRcmdrPlugin.IPSUR in the classroom
GNU PSPP- The Open Source SPSS
If you are SPSS user (for statistics/ not data mining) you can also try 0ut GNU PSPP- which is the open source equivalent and quite eerily impressive in performance. It is available at http://www.gnu.org/software/pspp/ or http://pspp.awardspace.com/ and you can also read more at http://en.wikipedia.org/wiki/PSPP
PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions.
The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.
PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.
A brief list of some of the features of PSPP follows:
- Supports over 1 billion cases.
- Supports over 1 billion variables.
- Syntax and data files are compatible with SPSS.
- Choice of terminal or graphical user interface.
- Choice of text, postscript or html output formats.
- Inter-operates with Gnumeric, OpenOffice.Org and other free software.
- Easy data import from spreadsheets, text files and database sources.
- Fast statistical procedures, even on very large data sets.
- No license fees.
- No expiration period.
- No unethical “end user license agreements”.
- Fully indexed user manual.
- Free Software; licensed under GPLv3 or later.
- Cross platform; Runs on many different computers and many different operating systems.
PSPP is particularly aimed at statisticians, social scientists and students requiring fast convenient analysis of sampled data.
and
Features
This software provides a basic set of capabilities: frequencies, cross-tabs comparison of means (T-tests and one-way ANOVA); linear regression, reliability (Cronbach’s Alpha, not failure or Weibull), and re-ordering data, non-parametric tests, factor analysis and more.
At the user’s choice, statistical output and graphics are done in ascii, pdf, postscript or html formats. A limited range of statistical graphs can be produced, such as histograms, pie-charts and np-charts.
PSPP can import Gnumeric, OpenDocument and Excel spreadsheets, Postgres databases, comma-separated values– and ASCII-files. It can export files in the SPSS ‘portable’ and ‘system’ file formats and to ASCII files. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the libraries used by PSPP.
Origins
The PSPP project (originally called “Fiasco”) is a free, open-source alternative to the proprietary statistics package SPSS. SPSS is closed-source and includes a restrictive licence anddigital rights management. The author of PSPP considered this ethically unacceptable, and decided to write a program which might with time become functionally identical to SPSS, except that there would be no licence expiry, and everyone would be permitted to copy, modify and share the program.
Release history
- 0.7.5 June 2010 http://pspp.awardspace.com/
- 0.6.2 October 2009
- 0.6.1 October 2008
- 0.6.0 June 2008
- 0.4.0.1 August 2007
- 0.4.0 August 2005
- 0.3.0 April 2004
- 0.2.4 January 2000
- 0.1.0 August 1998
Third Party Reviews
In the book “SPSS For Dummies“, the author discusses PSPP under the heading of “Ten Useful Things You Can Find on the Internet” [1]. In 2006, the South African Statistical Association presented a conference which included an analysis of how PSPP can be used as a free replacement to SPSS [2].
Citation-
Please send FSF & GNU inquiries to gnu@gnu.org. There are also other ways to contact the FSF. Please send broken links and other corrections (or suggestions) to bug-gnu-pspp@gnu.org.
Copyright © 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007 Free Software Foundation, Inc., 51 Franklin St – Suite 330, Boston, MA 02110, USA – Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.
Q&A with David Smith, Revolution Analytics.
Here’s a group of questions and answers that David Smith of Revolution Analytics was kind enough to answer post the launch of the new R Package which integrates Hadoop and R- RevoScaleR






