Lie factor in Gun Deaths Visualization

Edward Tufte in his seminal book talked of lie factor. See image below, and how columbine seems higher than virginia tech thanks to the dotted line even though it had 50% less casualities

http://www.infovis-wiki.net/index.php/Lie_Factor

The “Lie Factor” is a value to describe the relation between the size of effect shown in a graphic and the size of effect shown in the data.

Edward Tufte, Prof. at the Yale University, defined the “Lie Factor” in his book “The Visual Display of Quantitative Information” in 1983.

He states the principle that

The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented.
Image from http://www.huffingtonpost.com/entry/what-will-happen-to-the-las-vegas-shooters-suite-at-mandalay-bay_us_59d6721ae4b0f6eed34ef753?section=us_politics

What exactly does a Data Scientist do as a job?

Got an interesting query on LinkedIn

  • What exactly does a Data Scientist do as a job

  • what are roles of a data scientist

    Now since I have been for almost 14 years doing something related to data science even before data science became a term on Wikipedia https://en.wikipedia.org/wiki/Data_science – here are my views

    a data scientist is simply a person who can

      write code = in R,Python,Java, SQL, Hadoop (Pig,HQL,MR)   etc

                         = for data storage, querying, summarization,  visualization

                         = how efficiently, and in time (fast results?)

                                                     = where on databases, on cloud, servers

       and understand  enough statistics

             to                              derive insights from data

        so            business can make decisions

    It involves coding, it involves presenting insights, it involves gathering requirements like a consultant. So you need the following

    ability to write complex SQL queries

    ability to move ,create,delete files on command prompt in Linux

    code in Python and in R and in SAS

    do machine learning (in R caret/party/e1071 packages and in Python scikit learn and in Spark MLLIB) and SAS Enterprise Miner

    ability to  learn new languages quickly (Hadoop, Hive , Pyspark)

    do analysis on small data using statistics (R/Python/SAS) and on big data

    make presentations on insights to senior management

    > Lots of roles for a single term -data scientist

Be a data scientist online with economical courses from Udemy

Sorry if it sounds like an ad, (but it isnt) I was just blown away from these courses for Rs 450 ~ 8 USD   each. Apparently 100,000 students have taken the course which I find dubious since I dont really see a 100,000 data scientists when we try and hire . Still it is a good economical package!

  1. https://lnkd.in/faCJ4mS

  2. https://lnkd.in/fbkWzEY

  3. https://lnkd.in/fv7Dama

  4. https://lnkd.in/f4a76cp

  5. https://lnkd.in/fmfrngn

  6. https://lnkd.in/fgZyyVS

Basic Data Analysis using Iris and PySpark

Collecting pyspark
  Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
Collecting py4j==0.10.4 (from pyspark)
  Downloading py4j-0.10.4-py2.py3-none-any.whl (186kB)
Building wheels for collected packages: pyspark
  Running setup.py bdist_wheel for pyspark: started
  Running setup.py bdist_wheel for pyspark: finished with status 'done'
  Stored in directory: C:\Users\Dell\AppData\Local\pip\Cache\wheels\5f\0b\b3\5cb16b15d28dcc32f8e7ec91a044829642874bb7586f6e6cbe
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.4 pyspark-2.2.0
In [3]:
from pyspark import SparkContext,SparkConf
sc=SparkContext()
In [4]:
import os
In [5]:
os.getcwd()
Out[5]:
'C:\\Users\\Dell'
In [6]:
os.chdir('C:\\Users\\Dell\\Desktop')
In [8]:
os.listdir()
Out[8]:
['desktop.ini',
 'dump 2582017',
 'Fusion Church.html',
 'Fusion Church_files',
 'iris.csv',
 'KOG',
 'NF22997109906610.ETicket.pdf',
 'R Packages',
 'Telegram.lnk',
 'twitter_share.jpg',
 'winutils.exe',
 '~$avel Reimbursements.docx',
 '~$thonajay.docx']
In [10]:
#load data
data=sc.textFile('C:\\Users\\Dell\\Desktop\\iris.csv')
In [11]:
type(data)
Out[11]:
pyspark.rdd.RDD
In [12]:
data.top(1)
Out[12]:
['7.9,3.8,6.4,2,"virginica"']
In [13]:
data.first()
Out[13]:
'"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"'
In [14]:
from pyspark.sql import SparkSession
In [16]:
spark= SparkSession.builder \
    .master("local") \
    .appName("Data Exploration") \
    .getOrCreate()
In [17]:
#load data as Spark DataFrame
data2=spark.read.format("csv") \
    .option("header","true") \
    .option("mode","DROPMALFORMED") \
    .load('C:\\Users\\Dell\\Desktop\\iris.csv')
In [18]:
type(data2)
Out[18]:
pyspark.sql.dataframe.DataFrame
In [19]:
data2.printSchema()
root
 |-- Sepal.Length: string (nullable = true)
 |-- Sepal.Width: string (nullable = true)
 |-- Petal.Length: string (nullable = true)
 |-- Petal.Width: string (nullable = true)
 |-- Species: string (nullable = true)

In [25]:
data2.columns
Out[25]:
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
In [28]:
data2.schema.names
Out[28]:
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
In [27]:
newColumns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']
In [30]:
from functools import reduce
In [32]:
data2 = reduce(lambda data2, idx: data2.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), data2)
data2.printSchema()
data2.show()
root
 |-- Sepal_Length: string (nullable = true)
 |-- Sepal_Width: string (nullable = true)
 |-- Petal_Length: string (nullable = true)
 |-- Petal_Width: string (nullable = true)
 |-- Species: string (nullable = true)

+------------+-----------+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|          3|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|           5|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|           5|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|          3|         1.4|        0.1| setosa|
|         4.3|          3|         1.1|        0.1| setosa|
|         5.8|          4|         1.2|        0.2| setosa|
|         5.7|        4.4|         1.5|        0.4| setosa|
|         5.4|        3.9|         1.3|        0.4| setosa|
|         5.1|        3.5|         1.4|        0.3| setosa|
|         5.7|        3.8|         1.7|        0.3| setosa|
|         5.1|        3.8|         1.5|        0.3| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 20 rows

In [33]:
data2.dtypes
Out[33]:
[('Sepal_Length', 'string'),
 ('Sepal_Width', 'string'),
 ('Petal_Length', 'string'),
 ('Petal_Width', 'string'),
 ('Species', 'string')]
In [35]:
data3 = data2.select('Sepal_Length', 'Sepal_Width', 'Species')
data3.cache()
data3.count()
Out[35]:
150
In [36]:
data3.show()
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
|         5.1|        3.5| setosa|
|         4.9|          3| setosa|
|         4.7|        3.2| setosa|
|         4.6|        3.1| setosa|
|           5|        3.6| setosa|
|         5.4|        3.9| setosa|
|         4.6|        3.4| setosa|
|           5|        3.4| setosa|
|         4.4|        2.9| setosa|
|         4.9|        3.1| setosa|
|         5.4|        3.7| setosa|
|         4.8|        3.4| setosa|
|         4.8|          3| setosa|
|         4.3|          3| setosa|
|         5.8|          4| setosa|
|         5.7|        4.4| setosa|
|         5.4|        3.9| setosa|
|         5.1|        3.5| setosa|
|         5.7|        3.8| setosa|
|         5.1|        3.8| setosa|
+------------+-----------+-------+
only showing top 20 rows

In [37]:
data3.limit(5)
Out[37]:
DataFrame[Sepal_Length: string, Sepal_Width: string, Species: string]
In [50]:
data3.limit(5).show()
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
|         5.1|        3.5| setosa|
|         4.9|          3| setosa|
|         4.7|        3.2| setosa|
|         4.6|        3.1| setosa|
|           5|        3.6| setosa|
+------------+-----------+-------+

In [45]:
data3.limit(5).limit(2).show()
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
|         5.1|        3.5| setosa|
|         4.9|          3| setosa|
+------------+-----------+-------+

In [61]:
data4=data2.selectExpr('CAST(Sepal_Length AS INT) AS Sepal_Length')
In [62]:
data4
Out[62]:
DataFrame[Sepal_Length: int]
In [63]:
from pyspark.sql.functions import *
In [65]:
data4.select('Sepal_Length').agg(mean('Sepal_Length')).show()
+-----------------+
|avg(Sepal_Length)|
+-----------------+
|5.386666666666667|
+-----------------+

In [66]:
data5=data2.selectExpr('CAST(Sepal_Length AS INT) AS Sepal_Length','CAST(Petal_Width AS INT) AS Petal_Width','CAST(Sepal_Width AS INT) AS Sepal_Width','CAST(Petal_Length AS INT) AS Petal_Length','Species')
In [67]:
data5
Out[67]:
DataFrame[Sepal_Length: int, Petal_Width: int, Sepal_Width: int, Petal_Length: int, Species: string]
In [68]:
data5.columns
Out[68]:
['Sepal_Length', 'Petal_Width', 'Sepal_Width', 'Petal_Length', 'Species']
In [76]:
data5.select('Sepal_Length','Species').groupBy('Species').agg(mean("Sepal_Length")).show()
+----------+-----------------+
|   Species|avg(Sepal_Length)|
+----------+-----------------+
| virginica|             6.08|
|versicolor|             5.48|
|    setosa|              4.6|
+----------+-----------------+

Why do people go to America

Why do people go to America? Moving involves tremendous emotional, financial and physical expenditure. It is a very different and unique culture full of surprises, your savings will vanish in the expensive dollar economy, and you will have to adjust to a car driven, credit history driven existence.

Why move to America?

The answer lies in the small notion of American dream-

  1. Freedom of speech
  2. Freedom of worship
  3. Freedom from want
  4. Freedom from fear

I can criticize, mock , insult  the American president when in America- a freedom I am denied in most other countries on the planet

I can worship anyone and anything, openly, loudly and not be afraid.

I can get a good living even as a blue collar worker and have unlimited opportunities to start y own business or startup with atmost ease.

But above all,

I am free from fear when I am in America. The FBI needs a warrant and the CIA wont hurt me. That is not true of other countries.

Yet with threats of deportation, there is no more freedom from fear.  Even to children. Hate crimes against minorities is an unprecedented high (by American standards) and even the press is now mocked (reversing traditional American politics or even  politics in other democracies). The Putinization of America is on  way and all we do is twiddle on Twitter.

So why move for the American dream?

As someone said, to dream the American dream, you first need to go to sleep.

Amen