Home » Uncategorized

Category Archives: Uncategorized

Scribd Analytics

I really liked Scribd Analytics feature (and that we have racked up 3200 reads in 11 months for Poets and Hackers). I think Google Docs /Drive should really incorporate more Scribd like document sharing features including social (now that Slideshare is off the market for a relatively cheap $119m) and turn on analytics by default

I really liked the heatmap on the document feature in the second screenshot.

Anyways nice to see someone out there cares for Poets &….



Interview John Myles White , Machine Learning for Hackers

Here is an interview with one of the younger researchers  and rock stars of the R Project, John Myles White,  co-author of Machine Learning for Hackers.

Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?

John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.

We once said that Machine Learning for Hackers  is supposed to be a chemistry set for Machine Learning and I still think that’s the right description: it’s meant to get readers excited about Machine Learning and hopefully expose them to enough ideas and tools that they can start to explore on their own more effectively. It’s like a warmup for standard academic books like Bishop’s.
The public response to the book has been phenomenal. It’s been amazing to see how many people have bought the book and how many people have told us they found it helpful. Even friends with substantial expertise in statistics have said they’ve found a few nuggets of new information in the book, especially regarding text analysis and social network analysis — topics that Drew and I spend a lot of time thinking about, but are not thoroughly covered in standard statistics and Machine Learning  undergraduate curricula.
I hope we write a second edition. It was our first book and we learned a ton about how to write at length from the experience. I’m about to announce later this week that I’m writing a second book, which will be a very short eBook for O’Reilly. Stay tuned for details.

Ajay-  What are the key things that a potential reader can learn from this book?

John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.

Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?

John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?

Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?

John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.

The changes that are required in academia to prepare students for this kind of work are pretty numerous, but the most obvious required change is that quantitative people need to be learning how to program properly, which is rare in academia, even in many CS departments. Writing one-off programs that no one will ever have to reuse and that only work on toy data sets doesn’t prepare you for working with huge amounts of messy data that exhibit shifting patterns. If you need to learn how to program seriously before you can do useful work, you’re not very valuable to companies who need employees that can hit the ground running. The companies that have done best in building up data teams, like LinkedIn, have learned to train people as they come in since the proper training isn’t typically available outside those companies.
Of course, on the flipside, the people who do know how to program well need to start learning more about theory and need to start to have a better grasp of basic mathematical models like linear and logistic regressions. Lots of CS students seem not to enjoy their theory classes, but theory really does prepare you for thinking about what you can learn from data. You may not use automata theory if you work at Foursquare, but you will need to be able to reason carefully and analytically. Doing math is just like lifting weights: if you’re not good at it right now, you just need to dig in and get yourself in shape.
John Myles White is a Phd Student in  Ph.D. student in the Princeton Psychology Department, where he studies human decision-making both theoretically and experimentally. Along with the political scientist Drew Conway, he is  the author of a book published by O’Reilly Media entitled “Machine Learning for Hackers”, which is meant to introduce experienced programmers to the machine learning toolkit. He is also working with Mark Hansenon a book for laypeople about exploratory data analysis.John is the lead maintainer for several R packages, including ProjectTemplate and log4r.

(TIL he has played in several rock bands!)

You can read more in his own words at his blog at http://www.johnmyleswhite.com/about/
He can be contacted via social media at Google Plus at https://plus.google.com/109658960610931658914 or twitter at twitter.com/johnmyleswhite/

Decisionstats.com is back from a dDOS

  1. Servers were okay, it was the DNS server that got swamped.
  2. I am sorry for the downtime- hopefully you didnt even notice
  3. I have faced challenges like domain name hijacking, sql injection , malicious WP plugins and thats why shifted to a professional hosting. I stand by my vendors and their professional judgement, moving away would mean the hackers won.
  4. This was very clever to swamp the DNS provider- my compliments to the tech talent behind this.
  5. You would think that every webmaster would have a back up plan in case his site went dDOS, but surprisingly even corporate websites dont have a back up (under attack) plan


Protected: Converting SAS language code to Java

This content is password protected. To view it please enter your password below:

Who made Libre Office






513 individuals contributed to OpenOffice.org (and whose contributions were imported into LibreOffice) or LibreOffice until 2011-11-11 09:02:38.

Developers committing code since 2010-09-28

Ruediger Timm
Commits: 89832
Joined: 2000-10-10
Kurt Zenker
Commits: 32763
Joined: 2000-09-25
Oliver Bolte
Commits: 31795
Joined: 2000-09-19
Vladimir Glazunov
Commits: 30289
Joined: 2000-12-04
Jens-Heiner Rechtien [hr]
Commits: 29314
Joined: 2000-09-18
Ivo Hinkelmann
Commits: 10228
Joined: 2002-09-09
Caolán McNamara
Commits: 5952
Joined: 2000-10-10
Frank Schoenheit [fs]
Commits: 5019
Joined: 2000-09-19
Hans-Joachim Lankenau
Commits: 3077
Joined: 2000-09-19
Ocke Janssen [oj]
Commits: 2861
Joined: 2000-09-20
Mathias Bauer
Commits: 2606
Joined: 2000-09-20
Oliver Specht
Commits: 2458
Joined: 2000-09-21
Philipp Lohmann [pl]
Commits: 2132
Joined: 2000-09-21
Tor Lillqvist
Commits: 2035
Joined: 2010-03-23
Stephan Bergmann
Commits: 1993
Joined: 2000-10-04
Christian Lippka ORACLE
Commits: 1811
Joined: 2000-09-25

We do not distinguish between commits that were imported from the OOo code base and those that went directly into the LibreOffice code base as:
a) it is technically not possible to distinguish between commits that go directly into the LibreOffice code base and commits that were merged in from the OpenOffice.org code base, and
b) contributers to the OOo code base should also be credited for the excellent work they do.

Do note that LibreOffice is divided into 20 git repositories. Pushing a change into all repositories will be counted as 20 commits as there is no way to distinguish this from 20 separate commits.

Total contributions to the TDF Wiki

1223 individuals contributed:

Quantitative Modeling for Arbitrage Positions in Ad KeyWords Internet Marketing

Assume you treat an ad keyword as an equity stock. There are slight differences in the cost for advertising for that keyword across various locations (Zurich vs Delhi) and various channels (Facebook vs Google) . You get revenue if your website ranks naturally in organic search for the keyword, and you have to pay costs for getting traffic to your website for that keyword.
An arbitrage position is defined as a riskless profit when cost of keyword is less than revenue from keyword. We take examples of Adsense  and Adwords primarily.
There are primarily two types of economic curves on the foundation of which commerce of the  internet  resides-
1) Cost Curve- Cost of Advertising to drive traffic into the website  (Google Adwords, Twitter Ads, Facebook , LinkedIn ads)
2) Revenue Curve - Revenue from ads clicked by the incoming traffic on website (like Adsense, LinkAds, Banner Ads, Ad Sharing Programs , In Game Ads)
The cost and revenue curves are primarily dependent on two things
1) Type of KeyWord-Also subdependent on
a) Location of Prospective Customer, and
b) Net Present Value of Good and Service to be eventually purchased
For example , keyword for targeting sales of enterprise “business intelligence software” should ideally be costing say X times as much as keywords for “flower shop for birthdays” where X is the multiple of the expected payoffs from sales of business intelligence software divided by expected payoff from sales of flowers (say in Location, Daytona Beach ,Florida or Austin, Texas)
2) Traffic Volume – Also sub-dependent on Time Series and
a) Seasonality -Annual Shoppping Cycle
b) Cyclicality- Macro economic shifts in time series
The cost and revenue curves are not linear and ideally should be continuous in a definitive exponential or polynomial manner, but in actual reality they may have sharp inflections , due to location, time, as well as web traffic volume thresholds
Type of Keyword – For example ,keywords for targeting sales for Eminem Albums may shoot up in a non linear manner after the musician dies.
The third and not so publicly known component of both the cost and revenue curves is factoring in internet industry dynamics , including relative market share of internet advertising platforms, as well as percentage splits between content creator and ad providing platforms.
For example, based on internet advertising spend, people belive that the internet advertising is currently heading for a duo-poly with Google and Facebook are the top two players, while Microsoft/Skype/Yahoo and LinkedIn/Twitter offer niche options, but primarily depend on price setting from Google/Bing/Facebook.
It is difficut to quantify  the elasticity and efficiency of market curves as most literature and research on this is by in-house corporate teams , or advisors or mentors or consultants to the primary leaders in a kind of incesteous fraternal hold on public academic research on this.
It is recommended that-
1) a balance be found in the need for corporate secrecy to protest shareholder value /stakeholder value maximization versus the need for data liberation for innovation and grow the internet ad pie faster-
2) Cost and Revenue Curves between different keywords, time,location, service providers, be studied by quants for hedging inetrent ad inventory or /and choose arbitrage positions This kind of analysis is done for groups of stocks and commodities in the financial world, but as commerce grows on the internet this may need more specific and independent quants.
3) attention be made to how cost and revenue curves mature as per level of sophistication of underlying economy like Brazil, Russia, China, Korea, US, Sweden may be in different stages of internet ad market evolution.
For example-
A study in cost and revenue curves for certain keywords across domains across various ad providers across various locations from 2003-2008 can help academia and research (much more than top ten lists of popular terms like non quantitative reports) as well as ensure that current algorithmic wightings are not inadvertently given away.
Part 2- of this series will explore the ways to create third party re-sellers of keywords and measuring impacts of search and ad engine optimization based on keywords.

LibreOffice – Extensions and Templates

Just an announcement from The Document Foundation (which has notable supporters including Google etc at http://www.documentfoundation.org/supporters/)

With both Google Docs and Libre Office – it seems like a flank attack on Office productivity software (from the cloud and from the PC/tablet ground)- however Microsoft’s Sharepoint is much better in collobration compared to the Google Docs and it has huge number of templates (more than the 38 extensions and 13 templates right now at the links below (just like WordPress has huge number of themes compared to Blogger)

Anyways, check out- it is an interesting start


Extension Releases 

Extensions for all program modules
Gallery Contents for all program modules
Language Tools for all program modules
Dictionaries of different languages for all program modules


and http://templates.libreoffice.org/

Template Releases

Accounting -Templates
Business POS-Templates
Business Shipping-Templates
Elementary/Secondary School-Templates
To Do List-Templates



Get every new post delivered to your Inbox.

Join 735 other followers