## Some terms for budding data scientists

Quandl just came up with a list of seven deadly sins for Data Scientists.  Their site provides a wide collection of data that would be beneficial for anyone looking to become successful in the analytical field. I interviewed their founder some time back here

I would add lack of reading as the biggest sin, and lack of writing /blogging as a big sin too. I guess that would be covered in Sloth.

Some terms that a data scientist should not be slothful about learning

Overfitting-

Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations

What it leads to- Model explains your existing data fine but wont work on fresh data

Lift

lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.

For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%).

Typically, the modeller seeks to divide the population into quantiles, and rank the quantiles by lift.

Hosmer-Lemeshow Goodness-of-Fit Test

The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic regression models. It is used frequently in risk prediction models. The test assesses whether or not the observed event rates match expected event rates in subgroups of the model population. The Hosmer–Lemeshow test specifically identifies subgroups as the deciles of fitted risk values. Models for which expected and observed event rates in subgroups are similar are called well calibrated.

First, the observations are sorted in increasing order of their estimated event probability. The event is the response level specified in the response variable option EVENT=, or the response level that is not specified in the REF= option, or, if neither of these options was specified, then the event is the response level identified in the “Response Profiles” table as “Ordered Value 1”. The observations are then divided into approximately 10 groups according to the following scheme.

Bayes Theorem

Bayes’ theorem is stated mathematically as the following equation:[2]

$P(A|B) = \frac{P(B | A) \, P(A)}{P(B)},$

where A and B are events.

• P(A) and P(B) are the probabilities of A and B without regard to each other.
• P(A | B), a conditional probability, is the probability of observing event A given that B is true.
• P(B | A) is the probability of observing event B given that A is true.

## Examples

### Cancer at age 65

Suppose we want to know an individual’s probability of having cancer, but we know nothing about them. Despite not knowing anything about that person, a probability can be assigned based on the general prevalence of cancer. For the sake of this example, suppose it is 1%. This is known as the base rate or prior probability of having cancer. “Prior” refers to the time before being informed about the particular case at hand.

Next, suppose we find out that person is 65 years old. If we assume that cancer and age are related, this new piece of information can be used to better assess that person’s risk of having cancer. More precisely, we’d like to know the probability that a person has cancer when it is known that they are 65 years old. This quantity is known as the current probability, where “current” refers to the theorised situation upon finding out information about the particular case at hand.

In order to apply knowledge of that person’s age in conjunction with Bayes’ Theorem, two additional pieces of information are needed. Note, however, that the additional information is not specific to that person. The needed information is as follows:

1. The probability of being 65 years old. Suppose it is 0.2%
2. The probability that a person with cancer is 65 years old. Suppose it is 0.5%. Note that this is greater than the previous value. This reflects that people with cancer are disproportionately 65 years old.

Knowing this, along with the base rate, we can calculate that a person who is age 65 has a probability of having cancer equal to

$(0.5% * 1%) \div 0.2% = 2.5%$

tochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimizationmethod for minimizing an objective function

Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum:

$Q(w) = \sum_{i=1}^n Q_i(w),$

where the parameter $w$ which minimizes $Q(w)$ is to be estimated. Each summand function $Q_i$ is typically associated with the $i$-th observation in the data set (used for training).

In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations).

Source

https://en.wikipedia.org/wiki/Lift_(data_mining)

http://www.newyorker.com/culture/culture-desk/remembering-prince

https://en.wikipedia.org/wiki/Bayes%27_theorem

https://en.wikipedia.org/wiki/Overfitting

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/statug_logistic_sect039.htm

https://en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test

A typical argument put by Google when confronted by music producers or artists (like Prince or Metallica) is that the number of youtube videos is too many to be taken down. But the problem is that search enables copyright infringement. Just make it tougher to search for copyrighted material available free than try to play whack a mole with the content one by one.

This can be easily prevented by following basic principles of web analytics-

1. identify the web searches which lead to copyright infringement ( for music or movies)
2. put a warning message that the search query leads to a copyrighted material which user does not have access
3. create a separate section for subscribers that allows users access to it  for money and share money with content creators (some streaming service)

rather than purge the actual content- why not purge the search query- that means a search query is clearly malicious

Apparently Google can take a stand against guns and Youtube can do it against Porn but they are not willing to do so for Music! You are welcome, Musicians!

## Choices and Ethics for Data Scientists in the new age

1. Should I cooperate with  my company/the government for technology to help find terrorists, knowing that the same company/ government will sell this technology to some regimes /allies to help find political dissidents
2. Should I help with statistics to help more people click more ads and buy more products or should I help with statistics to help local government do more with less corruption
3. Should I help save the planet with donating time and ideas for climate change or should I help save my job with donating time and ideas for product changes
4. Should I donate my time and energy to create and train more data scientists or should I use the shortage of data scientists to get a better salary
5. Should I be hypocritical about open source when it suits my career so I can push my startup and its investors to a profitable exit to sell out to companies that are not open source
6. Should I use open source for free forever and give talks for free only when it helps boost up my career or my company’s products
7. Should I help create very profitable data science for helping kill more people with lesser bombs or should I help create less profitable data science for helping find water and resources for more people

## How to hack online dating

2. Create your response. Dont send immediately. Use the draft folder.
3. Finally make a response that is not LAZY, not CREEPY, and actually shows you are interested in meeting the person

Hi, I’m insert name here

start 1 an message explaining why I feels we have something in common

2 what those things are

3 funny or cheesy pick up line

4 question to make you see my profile

end regards, insert name here

ps I am real, and REALLY interested ( or some other thing from her online profile!)

## Predictive Analytics Available for Everyone by DMWay

Message from Blog Partner follows——————————————————–

Predictive Analytics Available for Everyone
Featured in Forbes.com, April 2016
DMWAY is highlighted for its ability to empower organizations to streamline predictive modeling and improve their competitive edge!

Understanding your data is more important than ever as a means of differentiation. Industries as diverse as Fintech, Ecommerce, Marketing, Digital Advertising, Utilities, Health Care, and Communication services are all investing in these new transformative techniques. However, building predictive analytics models can still time-consuming, costly and risky. DMWAY transforms the way predictive analytics is perceived by giving everyone the ability to build better predictive models in hours; accessible and affordable both large and small organizations.

Forbes Quote
“DMWAY is a good example of how automation is best discussed as human augmentation rather than human replacement, as it facilitates analyst-machine collaboration. The human race may indeed go places when data scientists-both of the highly skilled and of the “citizen” varieties-are supplied with tools that increase their productivity and the accuracy of models that drive decisions”

Gil Press