Why American government needs better data science than provided currently?

  1. Government spends a lot of money tackling the toughest least profitable problems
  2. Govt has trouble recruiting the best hackers , computer scientists and statisticians (data science community) as they generally get a lot more salary in private sector for far more easy problems ( which ad do I want them to click)
  3. Private companies in USA can also outsource or get H1 visa workers for analytical needs while even USA government has to rely on US citizen data scientists for small non-sensitive departments like calculating subsidies for factory farms for Department of Agriculture.
  4. Meanwhile the budget for IT digitization for electronic government and Data Science is quite small
  5. Govt has lot more bureaucracy and lack of speed to get things done which is a big turn off for companies trying to be new data science vendors thus leaving a big hand to pricey players like AH BEE HUMMhttp://www.youtube.com/watch?v=25QyCxVkXwQ
  6. http://www.youtube.com/watch?v=25QyCxVkXwQ
  7. In election season, data scientists are in even more shortage as they work for analyzing and calculating odds for winning states or even work in teams for candidates ( political parties pay everyone working for them by cheque in US )
  8. Information security is one more area where they lack enough recruitment strategy
  9. Hackers have a general aversion to working for any Government ( for less salary) unless they are endowed with equity ( here successes companies like INQTEL (https://www.iqt.org/) can be replicated not just for Intelligence but for other departments as well by startup funds in hacker
  10. Software interfaces need to be updated for better data visualization and analytical communication across departments
  11. More money can be invested in training existing Federal Employees in analytics, analytical way of thinking or even basics of data science

One more note- US government can repair its relationship by the hacker activist community even by small courtesies and track 2 diplomacy. That can help not just with business as usual data science (like where is rain going to fall in Florida and Lousiana for Department of Agriculture) but also special areas of mutual concern (identifying hateful events through crowd sourced intelligence across public social media dataScreenshot from 2016-03-13 04:29:21

Much ado about nothing

P Values have now become controversial. P here does not stand for President Trump but this.

After 150 Years, the ASA Says No to p-values

https://matloff.wordpress.com/2016/03/07/after-150-years-the-asa-says-no-to-p-values/

Sadly, the concept of p-values and significance testing forms the very core of statistics. A number of us have been pointing out for decades that p-values are at best underinformative and often misleading. Almost all statisticians agree on this, yet they all continue to use it and, worse, teach it. I recall a few years ago, when Frank Harrell and I suggested that R place less emphasis on p-values in its output, there was solid pushback. One can’t blame the pusherbackers, though, as the use of p-values is so completely entrenched that R would not be serving its users well with such a radical move.

Click to access P-ValueStatement.pdf

The American Statistical Association (ASA) has released a “Statement on Statistical Significance and P-Values” with six principles underlying the proper use and interpretation of the p-value [http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.Vt2XIOaE2MN]. The ASA releases this guidance on p-values to improve the conduct and interpretation of quantitative science and inform the growing emphasis on reproducibility of science research. The statement also notes that the increased quantification of scientific research and a proliferation of large, complex data sets has expanded the scope for statistics and the importance of appropriately chosen techniques, properly conducted analyses, and correct interpretation.

 

I personally think Big Data needs Bigger Thinking among statisticians about a new era of inference.

However as always these guys are the best

Algorithm to deal with a broken heart

  • Abort (A): Terminate the operation/program and return to the system command prompt.[2] In hindsight this was not a good idea as the program would not do any cleanup (such as completing writing of other files). “Abort” was necessary because early DOS did not implement “Fail”. It may have remained necessary for poorly written software for which “Fail” would have caused a loop that would have repeatedly invoked the critical error handler with no other way to exit.
  • Retry (R): DOS would attempt the operation again.[2] “Retry” made sense if the user could rectify the problem. To continue the example above, if the user simply forgot to close the drive latch, they could close it, retry, and the system would continue where it left off.
  • Ignore (I) (older versions of DOS): Return success status to the calling program/routine, despite the failure of the operation.[2] For instance, a disk read error could be ignored and DOS would return whatever data was in the read buffer, which might contain some of the correct data from the disk. Attempting to use results after an “Ignore” was an undefined behavior.[2] “Ignore” did not appear in cases where it was impossible for the data to be used; for instance, a missing disk could not be ignored because that would require DOS to construct and return some kind of file descriptor that worked in further “read” calls. This is not available if DOS cannot read any sector from the first sector of a floppy disk or a partition of a hard disk to the last sector of the root directory.
  • Fail (F) (DOS 3.3 and later): Return failure status to the calling program/routine.[2] “Fail” returned an error code to the program, similar to other errors such as file not found. The program could then gracefully recover from the problem.

from https://en.wikipedia.org/wiki/Abort,_Retry,_Fail%3F

In the lines above replace DOS with LOVER, and you have the algorithm

Interview Questions for Budding Data Scientists for Edureka Blog

I created a list of questions and answers I have seen for data science interviews. Since everyone claims to be an expert in data science, let me assure I am obediently learning new things with bemused humility after 12 years.

Interview Questions for Budding Data Scientists

Background– I have been working into analytics since February 2004. From 2004 to 2007 I worked only in SAS language. From 2008 onwards I started working with R and SAS languages. From 2013 I started working with Python. Since around 2009, we had a term called Big Data thanks to Hadoop and since 2013 we had a term called data science. What used to be called just analytics is now called data science, with added variants of Big Data Analytics and Business Analytics to refer to the same. For techniques in building models, we have used the terms predictive analytics, data mining and machine learning to mean roughy the same thing ( but actually they might be different). I had a MBA for business training and have written two books on R by now. I am currently writing my third book but on Python for Wiley.
During this journey I have taught, trained, mentored hundreds of budding data scientists and also interviewed a few of them, while giving a few interviews myself.  Based on this experience, here are a few questions to help you clear an entry level interview for data science roles and become data scientists.  Since data science itself is an intersection of business perspective, statistics and coding, I have accordingly labeled them in specific sections. This will not give you a sure shot chance of clearing an interview but learning a few of these questions and answers will definitely help increase the probability of you clearing the interview process.

http://www.edureka.co/blog/top-data-science-interview-questions-for-budding-data-scientists

I helped create Edureka’s R course  in 2013 (before I did it for Collabera in 2015 and after I did it for Jigsaw in 2011)- you can see a video of the initial class here which has gotten 100,000 views.

Edureka remains one of the true believers in customer centered online education without fooling young people of too much money, by a mass market, mass course approach with actual teacher student interaction than the cold robotic automation of MOOCs.

Screenshot from 2016-03-02 12:41:58

(related-

Jigsaw Completes training of 300 students on R

http://analyticstraining.com/author/ajay-ohri/

http://www.collaberatact.com/online-training-courses/analytics-with-r-certification/

 

 

 

 

 

PYTHON FOR R USERS : ; Come September

I am writing a new book on a new language for me (python) for a new publisher ( Wiley)

 

This book is the first of its kind to provide a reference that enables students and practitioners to easily learn to code in Python if they are familiar with R and vice versa, even if they are beginners in the second language. It also provides a detailed introduction and overview of each language to the reader who might be unfamiliar with the other. While R has better statistical and graphical tools, Python has good machine learning tools and proves to be more useful software for the analysis of Big Data. A unique feature of this book is how it provides a command-by-command translation between R and Python for many mathematical, visualization and machine learning techniques. The intended audience is statistical practitioners and data scientists trying to learn one of R or Python or both, as well as students that are familiar with one of the languages.

http://www.amazon.co.uk/Python-R-Users-Ajay-Ohri/dp/1119126762