CloudNumbers.com – #Rstats gets real in the cloud

I came across Cloudnumbers.com . Awesome name , I didnt know groovy domain names existed anymore.

What is cloudnumbers.com – The website which looks like the salesforce.com website in style and design-

says-

Things are still very raw here- but its an awesome concept. With 68 GB of Memory, I am sure R can blow away everything out of the water.

Probably the competition needs to ahem launch that private cloud soon, before they lose the momentum.

and you Get 2GB of storage, 2GB of traffic and 10h computation cost per month for free! I think this German startup has hit the nail on the head and it would be interesting to see what the future holds.

 

Check out http://cloudnumbers.com/product yourself and/or see the video

https://www.youtube.com/v/0ZNEpR_ElV0?version=3&hl=en_US&hd=1

 

Spam Analysis Akismet-WPStats-Blogging

Here is a brief dataset I out after one hour of cutting and pasting from WordPress.com’s creative data style formats. It shows spam,comments,traffic, and number of posts written monthly.

Clearly monthly traffic is directly related to number I write (suppose A + B* Posts)

But Spam is showing a discontinuous growth especially after a big month (in which Reddit helped)

Akismet had some missing historical values (which is curious)

So what can we do with this dataframe in R or any other statistical software.

Spam Analysis
Month Spam detected Traffic excluding spam Posts Written Traffic /Post Spam /Post Spam/Traffic Ham detected Missed spam False positives
Feb-11 1848 5079 18 282.17 102.6667 36.39% 4.00 6.00 0.0%
Jan-11 3724 10238 35 292.51 106.4 36.37% 0.00 3.00 0.0%
Dec-10 3676 10345 35 295.57 105.0286 35.53% 8.00 6.00 0.0%
Nov-10 3680 11723 71 165.11 51.83099 31.39% 24.00 3.00 0.0%
Oct-10 2292 16430 71 231.41 32.28169 13.95% 24.00 18.00 0.0%
Sep-10 0 17913 63 284.33 0 0.00% 0.00 0.00 0.0%
Aug-10 0 5403 17 317.82 0 0.00% 0.00 0.00 0.0%
Jul-10 2 5041 10 504.1 0.2 0.04% 0.00 0.00 0.0%
Jun-10 5 4271 11 388.27 0.454545 0.12% 10.00 1.00 0.0%

Interview Ajay Ohri Decisionstats.com with DMR

From-

http://www.dataminingblog.com/data-mining-research-interview-ajay-ohri/

Here is the winner of the Data Mining Research People Award 2010: Ajay Ohri! Thanks to Ajay for giving some time to answer Data Mining Research questions. And all the best to his blog, Decision Stat!

Data Mining Research (DMR): Could you please introduce yourself to the readers of Data Mining Research?

Ajay Ohri (AO): I am a business consultant and writer based out of Delhi- India. I have been working in and around the field of business analytics since 2004, and have worked with some very good and big companies primarily in financial analytics and outsourced analytics. Since 2007, I have been writing my blog at http://decisionstats.com which now has almost 10,000 views monthly.

All in all, I wrote about data, and my hobby is also writing (poetry). Both my hobby and my profession stem from my education ( a masters in business, and a bachelors in mechanical engineering).

My research interests in data mining are interfaces (simpler interfaces to enable better data mining), education (making data mining less complex and accessible to more people and students), and time series and regression (specifically ARIMAX)
In business my research interests software marketing strategies (open source, Software as a service, advertising supported versus traditional licensing) and creation of technology and entrepreneurial hubs (like Palo Alto and Research Triangle, or Bangalore India).

DMR: I know you have worked with both SAS and R. Could you give your opinion about these two data mining tools?

AO: As per my understanding, SAS stands for SAS language, SAS Institute and SAS software platform. The terms are interchangeably used by people in industry and academia- but there have been some branding issues on this.
I have not worked much with SAS Enterprise Miner , probably because I could not afford it as business consultant, and organizations I worked with did not have a budget for Enterprise Miner.
I have worked alone and in teams with Base SAS, SAS Stat, SAS Access, and SAS ETS- and JMP. Also I worked with SAS BI but as a user to extract information.
You could say my use of SAS platform was mostly in predictive analytics and reporting, but I have a couple of projects under my belt for knowledge discovery and data mining, and pattern analysis. Again some of my SAS experience is a bit dated for almost 1 year ago.

I really like specific parts of SAS platform – as in the interface design of JMP (which is better than Enterprise Guide or Base SAS ) -and Proc Sort in Base SAS- I guess sequential processing of data makes SAS way faster- though with computing evolving from Desktops/Servers to even cheaper time shared cloud computers- I am not sure how long Base SAS and SAS Stat can hold this unique selling proposition.

I dislike the clutter in SAS Stat output, it confuses me with too much information, and I dislike shoddy graphics in the rendering output of graphical engine of SAS. Its shoddy coding work in SAS/Graph and if JMP can give better graphics why is legacy source code preventing SAS platform from doing a better job of it.

I sometimes think the best part of SAS is actually code written by Goodnight and Sall in 1970’s , the latest procs don’t impress me much.

SAS as a company is something I admire especially for its way of treating employees globally- but it is strange to see the rest of tech industry not following it. Also I don’t like over aggression and the SAS versus Rest of the Analytics /Data Mining World mentality that I sometimes pick up when I deal with industry thought leaders.

I think making SAS Enterprise Miner, JMP, and Base SAS in a completely new web interface priced at per hour rates is my wishlist but I guess I am a bit sentimental here- most data miners I know from early 2000’s did start with SAS as their first bread earning software. Also I think SAS needs to be better priced in Business Intelligence- it seems quite cheap in BI compared to Cognos/IBM but expensive in analytical licensing.

If you are a new stats or business student, chances are – you may know much more R than SAS today. The shift in education at least has been very rapid, and I guess R is also more of a platform than a analytics or data mining software.

I like a lot of things in R- from graphics, to better data mining packages, modular design of software, but above all I like the can do kick ass spirit of R community. Lots of young people collaborating with lots of young to old professors, and the energy is infectious. Everybody is a CEO in R ’s world. Latest data mining algols will probably start in R, published in journals.

Which is better for data mining SAS or R? It depends on your data and your deadline. The golden rule of management and business is -it depends.

Also I have worked with a lot of KXEN, SQL, SPSS.

DMR: Can you tell us more about Decision Stats? You have a traffic of 120′000 for 2010. How did you reach such a success?

AO: I don’t think 120,000 is a success. Its not a failure. It just happened- the more I wrote, the more people read.In 2007-2008 I used to obsess over traffic. I tried SEO, comments, back linking, and I did some black hat experimental stuff. Some of it worked- some didn’t.

In the end, I started asking questions and interviewing people. To my surprise, senior management is almost always more candid , frank and honest about their views while middle managers, public relations, marketing folks can be defensive.

Social Media helped a bit- Twitter, Linkedin, Facebook really helped my network of friends who I suppose acted as informal ambassadors to spread the word.
Again I was constrained by necessity than choices- my middle class finances ( I also had a baby son in 2007-my current laptop still has some broken keys :) – by my inability to afford traveling to conferences, and my location Delhi isn’t really a tech hub.

The more questions I asked around the internet, the more people responded, and I wrote it all down.

I guess I just was lucky to meet a lot of nice people on the internet who took time to mentor and educate me.

I tried building other websites but didn’t succeed so i guess I really don’t know. I am not a smart coder, not very clever at writing but I do try to be honest.

Basic economics says pricing is proportional to demand and inversely proportional to supply. Honest and candid opinions have infinite demand and an uncertain supply.

DMR: There is a rumor about a R book you plan to publish in 2011 :-) Can you confirm the rumor and tell us more?

AO: I just signed a contract with Springer for ” R for Business Analytics”. R is a great software, and lots of books for statistically trained people, but I felt like writing a book for the MBAs and existing analytics users- on how to easily transition to R for Analytics.

Like any language there are tricks and tweaks in R, and with a focus on code editors, IDE, GUI, web interfaces, R’s famous learning curve can be bent a bit.

Making analytics beautiful, and simpler to use is always a passion for me. With 3000 packages, R can be used for a lot more things and a lot more simply than is commonly understood.
The target audience however is business analysts- or people working in corporate environments.

Brief Bio-
Ajay Ohri has been working in the field of analytics since 2004 , when it was a still nascent emerging Industries in India. He has worked with the top two Indian outsourcers listed on NYSE,and with Citigroup on cross sell analytics where he helped sell an extra 50000 credit cards by cross sell analytics .He was one of the very first independent data mining consultants in India working on analytics products and domestic Indian market analytics .He regularly writes on analytics topics on his web site www.decisionstats.com and is currently working on open source analytical tools like R besides analytical software like SPSS and SAS.

How to balance your online advertising and your offline conscience

Google in 1998, showing the original logo
Image via Wikipedia

I recently found an interesting example of  a website that both makes a lot of money and yet is much more efficient than any free or non profit. It is called ECOSIA

If you see a website that wants to balance administrative costs  plus have a transparent way to make the world better- this is a great example.

  • http://ecosia.org/how.php
  • HOW IT WORKS
    You search with Ecosia.
  • Perhaps you click on an interesting sponsored link.
  • The sponsoring company pays Bing or Yahoo for the click.
  • Bing or Yahoo gives the bigger chunk of that money to Ecosia.
  • Ecosia donates at least 80% of this income to support WWF’s work in the Amazon.
  • If you like what we’re doing, help us spread the word!
  • Key facts about the park:

    • World’s largest tropical forest reserve (38,867 square kilometers, or about the size of Switzerland)
    • Home to about 14% of all amphibian species and roughly 54% of all bird species in the Amazon – not to mention large populations of at least eight threatened species, including the jaguar
    • Includes part of the Guiana Shield containing 25% of world’s remaining tropical rainforests – 80 to 90% of which are still pristine
    • Holds the last major unpolluted water reserves in the Neotropics, containing approximately 20% of all of the Earth’s water
    • One of the last tropical regions on Earth vastly unaltered by humans
    • Significant contributor to climatic regulation via heat absorption and carbon storage

     

    http://ecosia.org/statistics.php

    They claim to have donated 141,529.42 EUR !!!

    http://static.ecosia.org/files/donations.pdf

     

     

     

     

     

     

     

     

     

     

    Well suppose you are the Web Admin of a very popular website like Wikipedia or etc

    One way to meet server costs is to say openly hey i need to balance my costs so i need some money.

    The other way is to use online advertising.

    I started mine with Google Adsense.

    Click per milli (or CPM)  gives you a very low low conversion compared to contacting ad sponsor directly.

    But its a great data experiment-

    as you can monitor which companies are likely to be advertised on your site (assume google knows more about their algols than you will)

    which formats -banner or text or flash have what kind of conversion rates

    what are the expected pay off rates from various keywords or companies (like business intelligence software, predictive analytics software and statistical computing software are similar but have different expected returns (if you remember your eco class)

     

    NOW- Based on above data, you know whats your minimum baseline to expect from a private advertiser than a public, crowd sourced search engine one (like Google or Bing)

    Lets say if you have 100000 views monthly. and assume one out of 1000 page views will lead to a click. Say the advertiser will pay you 1 $ for every 1 click (=1000 impressions)

    Then your expected revenue is $100.But if your clicks are priced at 2.5$ for every click , and your click through rate is now 3 out of 1000 impressions- (both very moderate increases that can done by basic placement optimization of ad type, graphics etc)-your new revenue is  750$.

    Be a good Samaritan- you decide to share some of this with your audience -like 4 Amazon books per month ( or I free Amazon book per week)- That gives you a cost of 200$, and leaves you with some 550$.

    Wait! it doesnt end there- Adam Smith‘s invisible hand moves on .

    You say hmm let me put 100 $ for an annual paper writing contest of $1000, donate $200 to one laptop per child ( or to Amazon rain forests or to Haiti etc etc etc), pay $100 to your upgraded server hosting, and put 350$ in online advertising. say $200 for search engines and $150 for Facebook.

    Woah!

    Month 1 would should see more people  visiting you for the first time. If you have a good return rate (returning visitors as a %, and low bounce rate (visits less than 5 secs)- your traffic should see atleast a 20% jump in new arrivals and 5-10 % in long term arrivals. Ignoring bounces- within  three months you will have one of the following

    1) An interesting case study on statistics on online and social media advertising, tangible motivations for increasing community response , and some good data for study

    2) hopefully better cost management of your server expenses

    3)very hopefully a positive cash flow

     

    you could even set a percentage and share the monthly (or annually is better actions) to your readers and advertisers.

    go ahead- change the world!

    the key paradigms here are sharing your traffic and revenue openly to everyone

    donating to a suitable cause

    helping increase awareness of the suitable cause

    basing fixed percentages rather than absolute numbers to ensure your site and cause are sustained for years.

    The Year 2010

    Nokia N800 internet tablet, with open source s...
    Image via Wikipedia

    My annual traffic to this blog was almost 99,000 . Add in additional views on networking sites plus the 400 plus RSS readers- so I can say traffic was 1,20,000 for 2010. Nice. Thanks for reading and hope it was worth your time. (this is a long post and will take almost 440 secs to read but the summary is just given)

    My intent is either to inform you, give something useful or atleast something interesting.

    see below-

    Jan Feb Mar Apr May Jun
    2010 6,311 4,701 4,922 5,463 6,493 4,271
    Jul Aug Sep Oct Nov Dec Total
    5,041 5,403 17,913 16,430 11,723 10,096 98,767

     

     

    Sandro Saita from http://www.dataminingblog.com/ just named me for an award on his blog (but my surname is ohRi , Sandro left me without an R- What would I be without R :)) ).

    Aw! I am touched. Google for “Data Mining Blog” and Sandro is the best that it is in data mining writing.

    DMR People Award 2010
    There are a lot of active people in the field of data mining. You can discuss with them on forums. You can read their blogs. You can also meet them in events such as PAW or KDD. Among the people I follow on a regular basis, I have elected:

    Ajay Ori

    He has been very active in 2010, especially on his blog . Good work Ajay and continue sharing your experience with us!”

    What did I write in 2010- stuff.

    What did you read on this blog- well thats the top posts list.

    2009-12-31 to Today

    Title Views
    Home page More stats 21,150
    Top 10 Graphical User Interfaces in Statistical Software More stats 6,237
    Wealth = function (numeracy, memory recall) More stats 2,014
    Matlab-Mathematica-R and GPU Computing More stats 1,946
    The Top Statistical Softwares (GUI) More stats 1,405
    About DecisionStats More stats 1,352
    Using Facebook Analytics (Updated) More stats 1,313
    Test drive a Chrome notebook. More stats 1,170
    Top ten RRReasons R is bad for you ? More stats 1,157
    Libre Office More stats 1,151
    Interview Hadley Wickham R Project Data Visualization Guru More stats 1,007
    Using Red R- R with a Visual Interface More stats 854
    SAS Institute files first lawsuit against WPS- Episode 1 More stats 790
    Interview Professor John Fox Creator R Commander More stats 764
    R Package Creating More stats 754
    Windows Azure vs Amazon EC2 (and Google Storage) More stats 726
    Norman Nie: R GUI and More More stats 716
    Startups for Geeks More stats 682
    Google Maps – Jet Ski across Pacific Ocean More stats 670
    Not so AWkward after all: R GUI RKWard More stats 579
    Red R 1.8- Pretty GUI More stats 570
    Parallel Programming using R in Windows More stats 569
    R is an epic fail or is it just overhyped More stats 559
    Enterprise Linux rises rapidly:New Report More stats 537
    Rapid Miner- R Extension More stats 518
    Creating a Blog Aggregator for free More stats 504
    So which software is the best analytical software? Sigh- It depends More stats 473
    Revolution R for Linux More stats 465
    John Sall sets JMP 9 free to tango with R More stats 460

    So how do people come here –

    well I guess I owe Tal G for almost 9000 views ( incidentally I withdrew posting my blog from R- Bloggers and Analyticbridge blogs – due to SEO keyword reasons and some spam I was getting see (below))

    http://r-bloggers.com is still the CAT’s whiskers and I read it  a lot.

    I still dont know who linked my blog to a free sex movie site with 400 views but I have a few suspects.

    2009-12-31 to Today

    Referrer Views
    r-bloggers.com 9,131
    Reddit 3,829
    rattle.togaware.com 1,500
    Twitter 1,254
    Google Reader 1,215
    linkedin.com 717
    freesexmovie.irwanaf.com 422
    analyticbridge.com 341
    Google 327
    coolavenues.com 322
    Facebook 317
    kdnuggets.com 298
    dataminingblog.com 278
    en.wordpress.com 185
    google.co.in 151
    xianblog.wordpress.com 130
    inside-r.org 124
    decisionstats.com 119
    ifreestores.com 117
    bits.blogs.nytimes.com 108

    Still reading this post- gosh let me sell you some advertising. It is only $100 a month (yes its a recession)

    Advertisers are treated on First in -Last out (FILO)

    I have been told I am obsessed with SEO , but I dont care much for search engines apart from Google, and yes SEO is an interesting science (they should really re name it GEO or Google Engine Optimization)

    Apparently Hadley Wickham and Donald Farmer are big keywords for me so I should be more respectful I guess.

    Search Terms for 365 days ending 2010-12-31 (Summarized)

    2009-12-31 to Today

    Search Views
    libre office 925
    facebook analytics 798
    test drive a chrome notebook 467
    test drive a chrome notebook. 215
    r gui 203
    data mining 163
    wps sas lawsuit 158
    wordle.net 133
    wps sas 123
    google maps jet ski 123
    test drive chrome notebook 96
    sas wps 89
    sas wps lawsuit 85
    chrome notebook test drive 83
    decision stats 83
    best statistics software 74
    hadley wickham 72
    google maps jetski 72
    libreoffice 70
    doug savage 65
    hive tutorial 58
    funny india 56
    spss certification 52
    donald farmer microsoft 51
    best statistical software 49

    What about outgoing links? Apparently I need to find a way to ask Google to pay me for the free advertising I gave their chrome notebook launch. But since their search engine and browser is free to me, guess we are even steven.

    Clicks for 365 days ending 2010-12-31 (Summarized)

    2009-12-31 to Today

    URL Clicks
    rattle.togaware.com 378
    facebook.com/Decisionstats 355
    rapid-i.com/content/view/182/196 319
    services.google.com/fb/forms/cr48basic 313
    red-r.org 228
    decisionstats.wordpress.com/2010/05/07/the-top-statistical-softwares-gui 199
    teamwpc.co.uk/products/wps 162
    r4stats.com/popularity 148
    r-statistics.com/2010/04/r-and-the-google-summer-of-code-2010-accepted-students-and-projects 138
    socserv.mcmaster.ca/jfox/Misc/Rcmdr 138
    spss.com/certification 116
    learnr.wordpress.com 114
    dudeofdata.com/decisionstats 108
    r-project.org 107
    documentfoundation.org/faq 104
    goo.gl/maps/UISY 100
    inside-r.org/download 96
    en.wikibooks.org/wiki/R_Programming 92
    nytimes.com/external/readwriteweb/2010/12/07/07readwriteweb-report-google-offering-chrome-notebook-test-11919.html 92
    sourceforge.net/apps/mediawiki/rkward/index.php?title=Main_Page 92
    analyticdroid.togaware.com 88
    yeroon.net/ggplot2 87

    so in 2010,

    SAS remained top daddy in business analytics,

    R made revolutionary strides in terms of new packages,

    JMP  launched a new version,

    SPSS got integrated with Cognos,

    Oracle sued Google and did build a great Data Mining GUI,

    Libre Office gave you a non Oracle Open office ( or open even more office)

    2011 looks like  a fun year. Have safe partying .

    2011 Forecast-ying

    Free twitter badge
    Image via Wikipedia

    I had recently asked some friends from my Twitter lists for their take on 2011, atleast 3 of them responded back with the answer, 1 said they were still on it, and 1 claimed a recent office event.

    Anyways- I take note of the view of forecasting from

    http://www.uiah.fi/projekti/metodi/190.htm

    The most primitive method of forecasting is guessing. The result may be rated acceptable if the person making the guess is an expert in the matter.

    Ajay- people will forecast in end 2010 and 2011. many of them will get forecasts wrong, some very wrong, but by Dec 2011 most of them would be writing forecasts on 2012. almost no one will get called on by irate users-readers- (hey you got 4 out of 7 wrong last years forecast!) just wont happen. people thrive on hope. so does marketing. in 2011- and before

    and some forecasts from Tom Davenport’s The International Institute for Analytics (IIA) at

    http://iianalytics.com/2010/12/2011-predictions-for-the-analytics-industry/

    Regulatory and privacy constraints will continue to hamper growth of marketing analytics.

    (I wonder how privacy and analytics can co exist in peace forever- one view is that model building can use anonymized data suppose your IP address was anonymized using a standard secret Coco-Cola formula- then whatever model does get built would not be of concern to you individually as your privacy is protected by the anonymization formula)

    Anyway- back to the question I asked-

    What are the top 5 events in your industry (events as in things that occured not conferences) and what are the top 3 trends in 2011.

    I define my industry as being online technology writing- research (with a heavy skew on stat computing)

    My top 5 events for 2010 were-

    1) Consolidation- Big 5 software providers in BI and Analytics bought more, sued more, and consolidated more.  The valuations rose. and rose. leading to even more smaller players entering. Thus consolidation proved an oxy moron as total number of influential AND disruptive players grew.

     

    2) Cloudy Computing- Computing shifted from the desktop but to the mobile and more to the tablet than to the cloud. Ipad front end with Amazon Ec2 backend- yup it happened.

    3) Open Source grew louder- yes it got more clients. and more revenue. did it get more market share. depends on if you define market share by revenues or by users.

    Both Open Source and Closed Source had a good year- the pie grew faster and bigger so no one minded as long their slices grew bigger.

    4) We didnt see that coming –

    Technology continued to surprise with events (thats what we love! the surprises)

    Revolution Analytics broke through R’s Big Data Barrier, Tableau Software created a big Buzz,  Wikileaks and Chinese FireWalls gave technology an entire new dimension (though not universally popular one).

    people fought wars on emails and servers and social media- unfortunately the ones fighting real wars in 2009 continued to fight them in 2010 too

    5) Money-

    SAP,SAS,IBM,Oracle,Google,Microsoft made more money than ever before. Only Facebook got a movie named on itself. Venture Capitalists pumped in money in promising startups- really as if in a hurry to park money before tax cuts expired in some countries.

     

    2011 Top Three Forecasts

    1) Surprises- Expect to get surprised atleast 10 % of the time in business events. As internet grows the communication cycle shortens, the hype cycle amplifies buzz-

    more unstructured data  is created (esp for marketing analytics) leading to enhanced volatility

    2) Growth- Yes we predict technology will grow faster than the automobile industry. Game changers may happen in the form of Chrome OS- really its Linux guys-and customer adaptability to new USER INTERFACES. Design will matter much more in technology on your phone, on your desktop and on your internet. Packaging sells.

    False Top Trend 3) I will write a book on business analytics in 2011. yes it is true and I am working with A publisher. No it is not really going to be a top 3 event for anyone except me,publisher and lucky guys who read it.

    3) Creating technology and technically enabling creativity will converge at an accelerated rate. use of widgets, guis, snippets, ide will ensure creative left brains can code easier. and right brains can design faster and better due to a global supply chain of techie and artsy professionals.