#rstats Review of my book R for Cloud Computing in JSS Journal of Statistical Software

A review of R for Cloud Computing is on at Journal of Statistical Software


This is a lively book on a timely topic – or rather, a pair of topics, as the book is as much about R as it is on cloud computing. It should prove useful for those interested in the confluence of the two subject areas


The book features a number of interviews with prominent figures in data science. Though arguably a bit out of place, I believe that most readers will find them interesting and worth inclusion. This book should be of interest to anyone who is new to data storage and analysis in the cloud, especially with R, and even veteran users will find something new here and there.

and areas where the author needs to work much much harder

The book aims to provide step-by-step instructions for painlessly and quickly getting the novice user into the cloud. It does succeed in this for the most part, but any such effort will not be 100% painless after all. Readers who lack background in the cloud may feel overwhelmed at times at the beginning, given all the possible choices and myriad terms. In fact, some terms seem to be undefined, and there is no index (though there is a good bibliography). The figures are inline rather than referenced via numbers, and in some cases they are rather distant from the associated text. The font size in the figures may be too small for comfortable reading for some people.

Read the full review here http://www.jstatsoft.org/v66/b04/paper

and get a look at the full book here http://www.springer.com/book/9781493917013



Many thanks to the encouragement from Dr Matloff.

I may have been forced to drop out of U Tennessee Knoxville MS Stats on health grounds in 2010 but I get by with hard work and chutzpah.


Trying to improve the supply of Data Scientists without ripping young people

In a previous post, I said that many corporate are trying to benefit from the demand for data science as applied to their sector or company but not many are doing enough to improve the supply of data scientists.


In anecdotal arguments for students In India and USA , many have  argued that many training companies are charging exorbitant amounts and misguided promises to essentially teach tools and techniques but not the essential analytical mindset for splicing and dicing of data as well as enough information to reach balance between the three skills for data scientists- statistics, programming and business perspective.

Added to this, many people building tools for data scientists have not worked in data science consulting them self but are addicted to one platform or product due to commercial or intellectual compulsions.


Here is what I think could be a supply side solution to the problem of demand of data scientists hindering actual data science benefits to humanity regardless of commercial or social sectors.

  1. Build up a pool of curated best practice training
  2. Get them validated and verified across different business sectors by industry experts
  3. Add hardware or cloud training to software training
  4. Offer them on accessible platforms like mobile, tablet and web
  5. Offer them on accessible languages like Spanish Swahili Chinese Arabic as well
  6. Gamify some of the content to make it interesting, basically start creating data science hackers at an earlier age than just post graduate students
  7. Tie up with industry to offer internships that are fair balanced and demand equal commitment
  8. Tie in soft skill training for better professionalism
  9. Offer all this for free but use data generated for improving this not only on a human intervention basis but computer adaptive training and testing
  10. Monetize only after you reach a huge scale not prematurely
  11. Make it interactive using videos, 15 minute weekly personalized help on Skype from support, webinars but capture data continuously to drive engagement metrics

Do you want to just make money on the demand (uncertain) for data science but do you want to make more money on the supply side of data science too?


How to get an internship in data science

Many students want to get an internship in data science.

Here is a list of free resources and THINGS TO DO to help you prepare BEFORE the interview



http://swirlstats.com/ (do basic course and regression course in swirlstats)
4. write one blog post a week for atleast 400 words on what you have learnt in the week before

Social Media Networking for Data Scientists Top Groups On LinkedIn Facebook Twitter and Google Plus

A list of places on the Internet where you want to hang out if you want to build your name, fame as well as read and share content for data science and big data analytics

Do you want to add a list or group- just put it on comments on DecisionStats.com post

(list compiled by our Data Science Intern, Prerna Sahay)


Facebook Groups:

  1. Analytics, data mining, predictive modeling, big data

(https://www.facebook.com/groups/data.analytics/ )

  1. Apache Hadoop (https://www.facebook.com/groups/158386177549436/ )
  1. Apache Hadoop Ecosystem (https://www.facebook.com/groups/hadoop.group/ )
  1. Big Data (https://www.facebook.com/groups/BigDataisonline/ )
  1. Big Data Analytics using R (https://www.facebook.com/groups/434352233255448/ )
  1. Big Data Analytics with R and Hadoop (https://www.facebook.com/groups/rhadoop/ )
  1. Big Data Hadoop NOSQL Hive Hbase (https://www.facebook.com/groups/bigdatahadoop/ )
  1. Big Data Learnings (https://www.facebook.com/groups/bigdatalearnings/ )
  1. Big Data Malaysia (https://www.facebook.com/groups/bigdatamy/ )
  1. Big Data , Data Science , Data Mining & Statistics (https://www.facebook.com/groups/bigdatastatistics/ )
  1. BigData/Hadoop Expert (https://www.facebook.com/groups/BigDataExpert/ )
  1. Chennai Hadoop and Big Data User Group (https://www.facebook.com/groups/chennaihadoop/ )
  1. Data Mining / Machine Learning /AI (https://www.facebook.com/groups/machinelearningforum/ )
  1. Data Mining/ Big Data (https://www.facebook.com/groups/dataminingsocialnetworks/ )
  1. Hadoop Administrators (https://www.facebook.com/groups/hadoop.admins/ )
  1. Hadoop Developers India (https://www.facebook.com/groups/423391947699826/ )
  1. Hadoop in Action (https://www.facebook.com/groups/haddopinaction/ )
  1. Hadoop Jobs (https://www.facebook.com/groups/hadoopjobs/ )


  1. Hadoop Material (https://www.facebook.com/groups/416616701771842/ )
  1. Hadoop User Group (https://www.facebook.com/groups/hadoopcrunch/ )
  1. Tackling the Challenges of Big Data

(https://www.facebook.com/groups/tcobd/?ref=browser )

  1. MapReduce (https://www.facebook.com/groups/mapreducegroup/ )
  1. Tableau Software User Group (https://www.facebook.com/groups/181682408543566/ )
  1. Spotfire Group (https://www.facebook.com/groups/766623530030197/ )
  1. Hadoop Big Data- The next Big Thing (https://www.facebook.com/groups/hadoop.big.data/ )
  1. Coursera (https://www.facebook.com/groups/CourseraConnections/?ref=browser )









































Linkedin Groups:

  1. KDnuggets Analytics, Data Mining and Data Science (https://www.linkedin.com/grp/home?gid=54257 )
  2. Cloud Computing (https://www.linkedin.com/grp/home?gid=61513 )
  3. Quantitative Analysis Professional (https://www.linkedin.com/grp/home?gid=71149 )
  4. Online Data Visualisation (https://www.linkedin.com/grp/home?gid=3707334 )
  5. Big Data and analytics (https://www.linkedin.com/grp/home?gid=4332669 )
  6. Data Scientists (https://www.linkedin.com/grp/home?gid=2013423 )
  7. Predictive Analytics Network (PAN) (https://www.linkedin.com/grp/home?gid=1849479 )
  8. Data Mining Pioneers (https://www.linkedin.com/groups/Data-Mining-Pioneers-64585/about )
  9. Big Data | Analytics | Strategy | Finance | Innovation (https://www.linkedin.com/grp/home?gid=1814785 )
  10. Business Intelligence and Analytics (3527380) (https://www.linkedin.com/grp/home?gid=3527380 )
  11. Indian Internet of Things (IIoT) (https://www.linkedin.com/grp/home?gid=8198420 )
  12. Big Data , Analytics , Business Intelligence & Visualization Experts Community (https://www.linkedin.com/grp/home?gid=23006 )
  13. Data Science , Big Data and Analytics Executives (https://www.linkedin.com/groups/Data-Science-Big-Data-Analytics-5074372/about )
  14. People Learning R (https://www.linkedin.com/grp/home?gid=5150073 )
  15. Predictive Analytics World (https://www.linkedin.com/grp/home?gid=1005097 )
  16. Springer Network (https://www.linkedin.com/groups/Springer-Network-29206/about )
  17. India Analytics Network (https://www.linkedin.com/grp/home?gid=1858675 )
  18. RDataMining: R and Data Mining (https://www.linkedin.com/grp/home?gid=4066593 )
  19. Business Intelligence
  20. R/Finance (https://www.linkedin.com/grp/home?gid=155029 )
  21. Big Data, Analytics and Data Science Training (https://www.linkedin.com/grp/home?gid=4989164 )
  22. R Developers and Users Group (https://www.linkedin.com/grp/home?gid=3740742 )
  23. Information Security Community (https://www.linkedin.com/grp/home?gid=38412 )
  24. Predictive Model Markup Language (PMML) (https://www.linkedin.com/grp/home?gid=2328634 )
  25. The R Project for Statistical Computing (https://www.linkedin.com/grp/home?gid=77616 )
  26. Spotfire User Group – SFUG for Spotfire Analytics Developers , Enthusiasts and Practioners (https://www.linkedin.com/grp/home?gid=3984312 )
  27. Spotfire Developers , Consultants and Partners (https://www.linkedin.com/groups?gid=2480584 )
  28. Spotfire Enthusiasts (https://www.linkedin.com/grp/home?gid=3752855 )
  29. Machine Learning Connection (https://www.linkedin.com/groups?gid=70219)
  30. SAS Analytics & BI (https://www.linkedin.com/groups?gid=130238 )

Twitter # Tags:

  1. #rstats (https://twitter.com/search?q=%23rstats )
  2. #datascience (https://twitter.com/hashtag/datascience?src=rela )
  3. #bigdata (https://twitter.com/hashtag/bigdata?src=rela )
  4. #iot (https://twitter.com/hashtag/iot?src=rela )
  5. #bigdata (https://twitter.com/hashtag/bigdata?src=rela )
  6. #analytics (https://twitter.com/hashtag/analytics?src=rela )
  7. #internet of things (https://twitter.com/hashtag/internetofthings?src=rela )
  8. #tableau (https://twitter.com/search?q=%23tableau)
  9. #dataviz (https://twitter.com/hashtag/dataviz?src=rela )
  10. #machinelearning (https://twitter.com/hashtag/machinelearning?src=rela )
  11. #spotfire (https://twitter.com/search?q=%23spotfire)
  12. @tibco (https://twitter.com/search?q=%40tibco )
  13. #businessintelligence (https://twitter.com/search?q=%23businessintelligence )
  14. #deeplearning (https://twitter.com/search?q=%23deeplearning )
  15. #ai (https://twitter.com/hashtag/ai?src=rela )
  16. #hadoop (https://twitter.com/search?q=%23hadoop )
  17. #cloud (https://twitter.com/search?q=%23cloud )
  18. #python (https://twitter.com/search?q=%23python )
  19. #django (https://twitter.com/search?q=%23python )
  20. #statistics (https://twitter.com/search?q=%23statistics )

Google plus communities:

  1. Data Science – Data , Knowledge, Action (https://plus.google.com/u/0/communities/104673320232127474190 )
  2. Big Data – Big Questions? Big Data = Big Answers (https://plus.google.com/u/0/communities/118194042397414247987 )
  3. Data Science – making big data small (https://plus.google.com/u/0/communities/113253740387558560113 )
  4. Machine Learning – The beauty of the artificial mind (https://plus.google.com/u/0/communities/101342316728284418850 )
  5. Hadoop – Articles , discussion and learning (https://plus.google.com/communities/105735667520214958344 )
  6. Google Analytics – The largest GA user community (https://plus.google.com/communities/114481059214254340537 )
  7. Statistics and R – interested in R and statistics? Join us! (https://plus.google.com/communities/117681470673972651781 )
  8. Python – Unofficial Python Community (https://plus.google.com/communities/103393744324769547228 )
  9. Machine Learning – The beauty of the artificial mind (https://plus.google.com/communities/101342316728284418850 )
  10. Machine learning , IR , Mining , Big Data – ML,IR,KDD,Big Data Mining,Search,Social Networks (https://plus.google.com/communities/112064568745102322361 )
  11. Machine Learning – Academia , Industry and anyone who has an interest on ML and Data (https://plus.google.com/communities/107785538899595981479 )
  12. Big Data – Big Data , Analytics and Data Science (https://plus.google.com/communities/107156514183161811383 )
  13. Big Data professionals – This G+ page is for everyone involved in the development of applications using Big Data. Innovate together with Big Data professionals!

(https://plus.google.com/communities/101646309652442505961 )

  1. Big Data. Artificial Intelligence. Bi. – Internet of Things / IoT / M2M / Machine Learning ▪ Crypto Currencies ▪ Bitcoin ▪ Artificial Intelligence ▪ Digital Currencies (https://plus.google.com/communities/114206007718004250940 )
  2. Big Data – Exploring how big data is changing the world (https://plus.google.com/communities/109707855685220573696 )
  3. Big Data R&D – (https://plus.google.com/communities/103487294531677099010 )
  4. Big Data, Economy & Technology – Checking innovations and using Data to build efficient decision platforms and amazing visualizations. (https://plus.google.com/communities/104041697322064738236 )
  5. Business Intelligence – Big Data, Data Visualization, Actionable Insights, Software, Tools & Solutions (https://plus.google.com/communities/110061721590251903650 )

Top five unethical companies in India

not including political parties or state governments here is a list of top 5 unethical companies in India in terms of controversies

  1. Company 1 -cant say because they bought ads on my media channel
  2. Company 2- censored because they are friends to the government
  3. Company 3- threatened with goondas as well as law suits
  4. Company 4- included only so we can play with it’s stock price
  5. Company 5- a company on its way down and no friends to help

That is how the way the financial media reports news on companies in India. This is because unlike the United States , our SEBI ( equivalent to SEC) does not investigate insider trading with the same zeal as Preet Bharara does.


Sorry for the spam.






The Supply Side of Data Science

People all over tell me how big the demand for data science is, and how much of a shortage of data scientists they see.

Screenshot from 2015-08-24 10:05:27


and a new survey by MIT (sponsored by SAS) points out to this looming shortage between the demand and supply of data scientists (side note-  still surprised why companies insist on registration in this era of OpenID for download of white papers like these

The Sloan Paper is very nice and points to this- the image above is from it . You can look here

People like IBM, Oracle, SAP, HP, SAS , Revolution Analytics, RStudio , Cloudera, Continiuum Analytics are focussing more on capturing on the demand for data science as it is very lucrative. They do so by providing enough resources in marketing to help explain their offerings, sponsoring though leaders , white papers. Training remains a back end activity- considered non critical to a software vendor in data science. Quite disappointingly these training are often expensive and lack customization for international audiences. Why not capture your training on videos and sell them for $20 , dear people.

But here lies the catch, if you train data scientists in your platform early on , you own them for life.

Perhaps software vendors can focus on their core competencies of data science demand satisfaction and invest in training collateral too.


Some thoughts on this-

  • People need a human touch. Not everything can be automated via apps, videos, quizes. That is partly why Coursera  has a low pass rate.
  • Demand for data science teachers is even more tough than demand for plain data scientists
  • If you train people in your platform they champion that software wherever they go
  • Increasingly people want to be trained in multiple software to hedge risk to their career.
  • Independent cross platform trainers are even fewer than trainers who can train in one language or data science platform
  • Most training tends to be in English including MOOCs. This leaves out a big chunk of humanity who could have helped create the necessary data scientists including Chinese Arabic and Spanish speaking people
  • Governments have helped improve literacy but are ignorant on data science skill shortage. Partly because Governments find it even more tough to attract people skilled enough who can make data science policy.
  • The country with the best and maximum number of data scientists would win the race in the next few decades or atleast have a superb edge in innovation
  • Ask not what you can get from data science, ask what you can do to make more copies of yourself as a superb data scientist. This goes out to the data science celebrities
  • Machine learning continues to be woefully under taught in colleges especially in Asia (and I suspect in USA)
  • Many many Universities struggle to keep professors with tenure for life, updated for skills and new languages pertinent to data science
  • Some parts of the data science ecosystem remain prone to corruption and self centred tactics including influencing data science writers or analysts  . The sum of many local optima (vendors in software or training education) is not a global optima (for the industry, country, humanity)

everybody wants to use data science but nobody wants to help create more data scientists. do you agree or do you disagree?