Year: 2016
Because it is May 4
Big Data and Data Science should help Google detect copyright infringements or detect terror motivational videos on Youtube
Hemingway versus ISIS’s cyber army
What the world might need (in this writer’s much derided opinion) is a global set of volunteers to help find ISIS cyber army which uses the internet for motivation through videos and pictures in Instagram and Youtube, recruitment of sleeper lone wolves among carefully selected demographics among Western and Eastern locations , cyber retaliation targeted at law enforcement and military personnel culled from social media, propaganda for political and fundraising ( and real-time communication with both potential terrorists as well as internally using both encryption, and unorthodox methods of communicating) .
Crowd sourced cyber intelligence which can be incentivized to prevent members of Anonymous being recruited by North Korea or ISIS bitcoins. Think of volunteers that tried to fight in Russia’s fall to communism or volunteers that tried to fight against the Fascists in Spain. The West’s half hearted efforts in both these conflicts led the much bigger conflicts later on. Well does the free world need volunteers in the cyber terror fight against ISIS. Unfortunately the same people with a particular set of skills that can help FBI encrypt or decrypt phones, are people that have been aggressively prosecuted in the past. There is no cyber witness program and indeed no effort to reach out by counter-terrorism infrastructure to the hacker activist cyber infrastructure. This despite mutual suspicions of tax money wastage and cyber criminality. A house divided against itself will fall , in the real world and on the Internet.
Well what about Hemingway? From my favorite website
https://en.wikipedia.org/wiki/Ernest_Hemingway#Spanish_Civil_War
Spanish Civil War
Hemingway (center) with Dutch filmmaker Joris Ivens and German writer Ludwig Renn(serving as an International Brigades officer) in Spain during Spanish Civil War, 1937.
In 1937, Hemingway agreed to report on the Spanish Civil War for the North American Newspaper Alliance (NANA),[86] arriving in Spain in March with Dutch filmmaker Joris Ivens.[87] Ivens, who was filming The Spanish Earth, wanted Hemingway to replace John Dos Passos as screenwriter, since Dos Passos had left the project when his friend José Robleswas arrested and later executed.[88] The incident changed Dos Passos’ opinion of the leftist republicans, creating a rift between him and Hemingway, who later spread a rumor that Dos Passos left Spain out of cowardice.
Late in 1937, while in Madrid with Martha, Hemingway wrote his only play, The Fifth Column, as the city was being bombarded.
The Spanish Civil War took place from 1936 to 1939 and was fought between the Republicans, who were loyal to the democratic, left-leaningSecond Spanish Republic, and the Nationalists, a falangist group led by General Francisco Franco. The Nationalists won, and Franco then ruled Spain for the next 36 years, from April 1939 until his death in November 1975.
The Spanish Civil War seized the fears and hopes of the world, including not just diplomats and politicians, but intellectuals, religious leaders, and labor unions, as well. Opinion divided three ways. The right and the Catholics supported the Nationalists as a way to stop the expansion of Bolshevism. On the left, including labor unions, students and intellectuals, the war represented a necessary battle to stop the spread of fascism. Antiwar and pacifist sentiment was strong in many countries
Early Bird Prices for PAW Chicago
Early bird prices for passes to Predictive Analytics World for Business in Chicago – June 20-23 – end Friday, May 6th. Be sure to register for your pass at the best rate available before the early bird deadline flies away.
Early Bird Prices:
All Access Pass: $3,450
Two-Day Pass: $1,700
All Access Combo Pass: $3,740
Two-Day Combo Pass: $1,990
Regular Prices:
All Access Pass: $3,850
Two-Day Pass: $2,100
All Access Combo Pass: $4,040
Two-Day Combo Pass: $2,290
Check out this video overview of PAW Business:
When you register by Friday, May 6th, enjoy early bird rates that keep $300-400 in your pocket.
Some terms for budding data scientists
Quandl just came up with a list of seven deadly sins for Data Scientists. Their site provides a wide collection of data that would be beneficial for anyone looking to become successful in the analytical field. I interviewed their founder some time back here
I would add lack of reading as the biggest sin, and lack of writing /blogging as a big sin too. I guess that would be covered in Sloth.

Some terms that a data scientist should not be slothful about learning
Overfitting-
Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations
What it leads to- Model explains your existing data fine but wont work on fresh data
Lift–
lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.
For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%).
Typically, the modeller seeks to divide the population into quantiles, and rank the quantiles by lift.
Hosmer-Lemeshow Goodness-of-Fit Test
The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic regression models. It is used frequently in risk prediction models. The test assesses whether or not the observed event rates match expected event rates in subgroups of the model population. The Hosmer–Lemeshow test specifically identifies subgroups as the deciles of fitted risk values. Models for which expected and observed event rates in subgroups are similar are called well calibrated.
First, the observations are sorted in increasing order of their estimated event probability. The event is the response level specified in the response variable option EVENT=, or the response level that is not specified in the REF= option, or, if neither of these options was specified, then the event is the response level identified in the “Response Profiles” table as “Ordered Value 1”. The observations are then divided into approximately 10 groups according to the following scheme.
Bayes Theorem
Bayes’ theorem is stated mathematically as the following equation:[2]
where A and B are events.
- P(A) and P(B) are the probabilities of A and B without regard to each other.
- P(A | B), a conditional probability, is the probability of observing event A given that B is true.
- P(B | A) is the probability of observing event B given that A is true.
Examples
Cancer at age 65
Suppose we want to know an individual’s probability of having cancer, but we know nothing about them. Despite not knowing anything about that person, a probability can be assigned based on the general prevalence of cancer. For the sake of this example, suppose it is 1%. This is known as the base rate or prior probability of having cancer. “Prior” refers to the time before being informed about the particular case at hand.
Next, suppose we find out that person is 65 years old. If we assume that cancer and age are related, this new piece of information can be used to better assess that person’s risk of having cancer. More precisely, we’d like to know the probability that a person has cancer when it is known that they are 65 years old. This quantity is known as the current probability, where “current” refers to the theorised situation upon finding out information about the particular case at hand.
In order to apply knowledge of that person’s age in conjunction with Bayes’ Theorem, two additional pieces of information are needed. Note, however, that the additional information is not specific to that person. The needed information is as follows:
- The probability of being 65 years old. Suppose it is 0.2%
- The probability that a person with cancer is 65 years old. Suppose it is 0.5%. Note that this is greater than the previous value. This reflects that people with cancer are disproportionately 65 years old.
Knowing this, along with the base rate, we can calculate that a person who is age 65 has a probability of having cancer equal to

Gradient Descent for Machine Learning
tochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimizationmethod for minimizing an objective function
Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum:
where the parameter
which minimizes
is to be estimated. Each summand function
is typically associated with the
-th observation in the data set (used for training).
In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations).
https://www.coursera.org/learn/machine-learning/lecture/kCvQc/gradient-descent-for-linear-regression

Source
https://en.wikipedia.org/wiki/Lift_(data_mining)
http://www.newyorker.com/culture/culture-desk/remembering-prince
https://en.wikipedia.org/wiki/Bayes%27_theorem
https://en.wikipedia.org/wiki/Overfitting
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/statug_logistic_sect039.htm
https://en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test
How Youtube and Google search is a bigger problem for copyright than actual video content
A typical argument put by Google when confronted by music producers or artists (like Prince or Metallica) is that the number of youtube videos is too many to be taken down. But the problem is that search enables copyright infringement. Just make it tougher to search for copyrighted material available free than try to play whack a mole with the content one by one.

This can be easily prevented by following basic principles of web analytics-
- identify the web searches which lead to copyright infringement ( for music or movies)
- put a warning message that the search query leads to a copyrighted material which user does not have access
- create a separate section for subscribers that allows users access to it for money and share money with content creators (some streaming service)
even for my books (see https://www.google.com/search?q=free+download+r+for+business+analytics&oq=free+download+of+r+for+busine&aqs=chrome.1.69i57j0.11735j0j7&sourceid=chrome ) where I had to struggle to get search results purged one by one for ” Free Download of R for Business Analytics).
rather than purge the actual content- why not purge the search query- that means a search query is clearly malicious
Apparently Google can take a stand against guns and Youtube can do it against Porn but they are not willing to do so for Music! You are welcome, Musicians!
(This article was inspired by Prince)
http://blogs.wsj.com/law/2016/04/21/the-prince-of-copyright-enforcement/
Choices and Ethics for Data Scientists in the new age
- Should I cooperate with my company/the government for technology to help find terrorists, knowing that the same company/ government will sell this technology to some regimes /allies to help find political dissidents
- Should I help with statistics to help more people click more ads and buy more products or should I help with statistics to help local government do more with less corruption
- Should I help save the planet with donating time and ideas for climate change or should I help save my job with donating time and ideas for product changes
- Should I donate my time and energy to create and train more data scientists or should I use the shortage of data scientists to get a better salary
- Should I be hypocritical about open source when it suits my career so I can push my startup and its investors to a profitable exit to sell out to companies that are not open source
- Should I use open source for free forever and give talks for free only when it helps boost up my career or my company’s products
- Should I help create very profitable data science for helping kill more people with lesser bombs or should I help create less profitable data science for helping find water and resources for more people




