What exactly does a Data Scientist do as a job?

Got an interesting query on LinkedIn

  • What exactly does a Data Scientist do as a job

  • what are roles of a data scientist

    Now since I have been for almost 14 years doing something related to data science even before data science became a term on Wikipedia https://en.wikipedia.org/wiki/Data_science – here are my views

    a data scientist is simply a person who can

      write code = in R,Python,Java, SQL, Hadoop (Pig,HQL,MR)   etc

                         = for data storage, querying, summarization,  visualization

                         = how efficiently, and in time (fast results?)

                                                     = where on databases, on cloud, servers

       and understand  enough statistics

             to                              derive insights from data

        so            business can make decisions

    It involves coding, it involves presenting insights, it involves gathering requirements like a consultant. So you need the following

    ability to write complex SQL queries

    ability to move ,create,delete files on command prompt in Linux

    code in Python and in R and in SAS

    do machine learning (in R caret/party/e1071 packages and in Python scikit learn and in Spark MLLIB) and SAS Enterprise Miner

    ability to  learn new languages quickly (Hadoop, Hive , Pyspark)

    do analysis on small data using statistics (R/Python/SAS) and on big data

    make presentations on insights to senior management

    > Lots of roles for a single term -data scientist

Be a data scientist online with economical courses from Udemy

Sorry if it sounds like an ad, (but it isnt) I was just blown away from these courses for Rs 450 ~ 8 USD   each. Apparently 100,000 students have taken the course which I find dubious since I dont really see a 100,000 data scientists when we try and hire . Still it is a good economical package!

  1. https://lnkd.in/faCJ4mS

  2. https://lnkd.in/fbkWzEY

  3. https://lnkd.in/fv7Dama

  4. https://lnkd.in/f4a76cp

  5. https://lnkd.in/fmfrngn

  6. https://lnkd.in/fgZyyVS

Basic Data Analysis using Iris and PySpark

Collecting pyspark
  Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
Collecting py4j==0.10.4 (from pyspark)
  Downloading py4j-0.10.4-py2.py3-none-any.whl (186kB)
Building wheels for collected packages: pyspark
  Running setup.py bdist_wheel for pyspark: started
  Running setup.py bdist_wheel for pyspark: finished with status 'done'
  Stored in directory: C:\Users\Dell\AppData\Local\pip\Cache\wheels\5f\0b\b3\5cb16b15d28dcc32f8e7ec91a044829642874bb7586f6e6cbe
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.4 pyspark-2.2.0
In [3]:
from pyspark import SparkContext,SparkConf
sc=SparkContext()
In [4]:
import os
In [5]:
os.getcwd()
Out[5]:
'C:\\Users\\Dell'
In [6]:
os.chdir('C:\\Users\\Dell\\Desktop')
In [8]:
os.listdir()
Out[8]:
['desktop.ini',
 'dump 2582017',
 'Fusion Church.html',
 'Fusion Church_files',
 'iris.csv',
 'KOG',
 'NF22997109906610.ETicket.pdf',
 'R Packages',
 'Telegram.lnk',
 'twitter_share.jpg',
 'winutils.exe',
 '~$avel Reimbursements.docx',
 '~$thonajay.docx']
In [10]:
#load data
data=sc.textFile('C:\\Users\\Dell\\Desktop\\iris.csv')
In [11]:
type(data)
Out[11]:
pyspark.rdd.RDD
In [12]:
data.top(1)
Out[12]:
['7.9,3.8,6.4,2,"virginica"']
In [13]:
data.first()
Out[13]:
'"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"'
In [14]:
from pyspark.sql import SparkSession
In [16]:
spark= SparkSession.builder \
    .master("local") \
    .appName("Data Exploration") \
    .getOrCreate()
In [17]:
#load data as Spark DataFrame
data2=spark.read.format("csv") \
    .option("header","true") \
    .option("mode","DROPMALFORMED") \
    .load('C:\\Users\\Dell\\Desktop\\iris.csv')
In [18]:
type(data2)
Out[18]:
pyspark.sql.dataframe.DataFrame
In [19]:
data2.printSchema()
root
 |-- Sepal.Length: string (nullable = true)
 |-- Sepal.Width: string (nullable = true)
 |-- Petal.Length: string (nullable = true)
 |-- Petal.Width: string (nullable = true)
 |-- Species: string (nullable = true)

In [25]:
data2.columns
Out[25]:
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
In [28]:
data2.schema.names
Out[28]:
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
In [27]:
newColumns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']
In [30]:
from functools import reduce
In [32]:
data2 = reduce(lambda data2, idx: data2.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), data2)
data2.printSchema()
data2.show()
root
 |-- Sepal_Length: string (nullable = true)
 |-- Sepal_Width: string (nullable = true)
 |-- Petal_Length: string (nullable = true)
 |-- Petal_Width: string (nullable = true)
 |-- Species: string (nullable = true)

+------------+-----------+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|          3|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|           5|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|           5|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|          3|         1.4|        0.1| setosa|
|         4.3|          3|         1.1|        0.1| setosa|
|         5.8|          4|         1.2|        0.2| setosa|
|         5.7|        4.4|         1.5|        0.4| setosa|
|         5.4|        3.9|         1.3|        0.4| setosa|
|         5.1|        3.5|         1.4|        0.3| setosa|
|         5.7|        3.8|         1.7|        0.3| setosa|
|         5.1|        3.8|         1.5|        0.3| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 20 rows

In [33]:
data2.dtypes
Out[33]:
[('Sepal_Length', 'string'),
 ('Sepal_Width', 'string'),
 ('Petal_Length', 'string'),
 ('Petal_Width', 'string'),
 ('Species', 'string')]
In [35]:
data3 = data2.select('Sepal_Length', 'Sepal_Width', 'Species')
data3.cache()
data3.count()
Out[35]:
150
In [36]:
data3.show()
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
|         5.1|        3.5| setosa|
|         4.9|          3| setosa|
|         4.7|        3.2| setosa|
|         4.6|        3.1| setosa|
|           5|        3.6| setosa|
|         5.4|        3.9| setosa|
|         4.6|        3.4| setosa|
|           5|        3.4| setosa|
|         4.4|        2.9| setosa|
|         4.9|        3.1| setosa|
|         5.4|        3.7| setosa|
|         4.8|        3.4| setosa|
|         4.8|          3| setosa|
|         4.3|          3| setosa|
|         5.8|          4| setosa|
|         5.7|        4.4| setosa|
|         5.4|        3.9| setosa|
|         5.1|        3.5| setosa|
|         5.7|        3.8| setosa|
|         5.1|        3.8| setosa|
+------------+-----------+-------+
only showing top 20 rows

In [37]:
data3.limit(5)
Out[37]:
DataFrame[Sepal_Length: string, Sepal_Width: string, Species: string]
In [50]:
data3.limit(5).show()
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
|         5.1|        3.5| setosa|
|         4.9|          3| setosa|
|         4.7|        3.2| setosa|
|         4.6|        3.1| setosa|
|           5|        3.6| setosa|
+------------+-----------+-------+

In [45]:
data3.limit(5).limit(2).show()
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
|         5.1|        3.5| setosa|
|         4.9|          3| setosa|
+------------+-----------+-------+

In [61]:
data4=data2.selectExpr('CAST(Sepal_Length AS INT) AS Sepal_Length')
In [62]:
data4
Out[62]:
DataFrame[Sepal_Length: int]
In [63]:
from pyspark.sql.functions import *
In [65]:
data4.select('Sepal_Length').agg(mean('Sepal_Length')).show()
+-----------------+
|avg(Sepal_Length)|
+-----------------+
|5.386666666666667|
+-----------------+

In [66]:
data5=data2.selectExpr('CAST(Sepal_Length AS INT) AS Sepal_Length','CAST(Petal_Width AS INT) AS Petal_Width','CAST(Sepal_Width AS INT) AS Sepal_Width','CAST(Petal_Length AS INT) AS Petal_Length','Species')
In [67]:
data5
Out[67]:
DataFrame[Sepal_Length: int, Petal_Width: int, Sepal_Width: int, Petal_Length: int, Species: string]
In [68]:
data5.columns
Out[68]:
['Sepal_Length', 'Petal_Width', 'Sepal_Width', 'Petal_Length', 'Species']
In [76]:
data5.select('Sepal_Length','Species').groupBy('Species').agg(mean("Sepal_Length")).show()
+----------+-----------------+
|   Species|avg(Sepal_Length)|
+----------+-----------------+
| virginica|             6.08|
|versicolor|             5.48|
|    setosa|              4.6|
+----------+-----------------+

Why do people go to America

Why do people go to America? Moving involves tremendous emotional, financial and physical expenditure. It is a very different and unique culture full of surprises, your savings will vanish in the expensive dollar economy, and you will have to adjust to a car driven, credit history driven existence.

Why move to America?

The answer lies in the small notion of American dream-

  1. Freedom of speech
  2. Freedom of worship
  3. Freedom from want
  4. Freedom from fear

I can criticize, mock , insult  the American president when in America- a freedom I am denied in most other countries on the planet

I can worship anyone and anything, openly, loudly and not be afraid.

I can get a good living even as a blue collar worker and have unlimited opportunities to start y own business or startup with atmost ease.

But above all,

I am free from fear when I am in America. The FBI needs a warrant and the CIA wont hurt me. That is not true of other countries.

Yet with threats of deportation, there is no more freedom from fear.  Even to children. Hate crimes against minorities is an unprecedented high (by American standards) and even the press is now mocked (reversing traditional American politics or even  politics in other democracies). The Putinization of America is on  way and all we do is twiddle on Twitter.

So why move for the American dream?

As someone said, to dream the American dream, you first need to go to sleep.

Amen

Importing data from csv file using PySpark

There are two ways to import the csv file, one as a RDD and the other as Spark Dataframe(preferred).  MLLIB is built around RDDs while ML is generally built around dataframes. https://spark.apache.org/docs/latest/mllib-clustering.html and https://spark.apache.org/docs/latest/ml-clustering.html

!pip install pyspark

from pyspark import SparkContext, SparkConf
sc =SparkContext()

A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.  https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview

To create a SparkContext you first need to build a SparkConf object that contains information about your application.Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.

dir(SparkContext)

[‘PACKAGE_EXTENSIONS’,
class‘,
delattr‘,
dict‘,
dir‘,
doc‘,
enter‘,
eq‘,
exit‘,
format‘,
ge‘,
getattribute‘,
getnewargs‘,
gt‘,
hash‘,
init‘,
init_subclass‘,
le‘,
lt‘,
module‘,
ne‘,
new‘,
reduce‘,
reduce_ex‘,
repr‘,
setattr‘,
sizeof‘,
str‘,
subclasshook‘,
weakref‘,
‘_active_spark_context’,

dictToJavaMap’,
‘_do_init’,
‘_ensure_initialized’,
‘_gateway’,
‘_getJavaStorageLevel’,
‘_initialize_context’,
‘_jvm’,
‘_lock’,
‘_next_accum_id’,
‘_python_includes’,
‘_repr_html
‘,
‘accumulator’,
‘addFile’,
‘addPyFile’,
‘applicationId’,
‘binaryFiles’,
‘binaryRecords’,
‘broadcast’,
‘cancelAllJobs’,
‘cancelJobGroup’,
‘defaultMinPartitions’,
‘defaultParallelism’,
‘dump_profiles’,
’emptyRDD’,
‘getConf’,
‘getLocalProperty’,
‘getOrCreate’,
‘hadoopFile’,
‘hadoopRDD’,
‘newAPIHadoopFile’,
‘newAPIHadoopRDD’,
‘parallelize’,
‘pickleFile’,
‘range’,
‘runJob’,
‘sequenceFile’,
‘setCheckpointDir’,
‘setJobGroup’,
‘setLocalProperty’,
‘setLogLevel’,
‘setSystemProperty’,
‘show_profiles’,
‘sparkUser’,
‘startTime’,
‘statusTracker’,
‘stop’,
‘textFile’,
‘uiWebUrl’,
‘union’,
‘version’,
‘wholeTextFiles’]

# Loads data.
data = sc.textFile(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(data)

pyspark.rdd.RDD
 # Loads data. Be careful of indentations and whitespace

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master(“local”) \
.appName(“Data cleaning”) \
.getOrCreate()

dataframe2 = spark.read.format(“csv”).option(“header”,”true”).option(“mode”,”DROPMALFORMED”).load(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(dataframe2)

pyspark.sql.dataframe.DataFrame
dataframe2.printSchema() (same as str(dataframe) in R and dataframe.info() in Pandas)

Random Sample of RDD in Spark

2) To get a random sample of your RDD (named data) say with 100000 rows and to get 20% values

data.sample(False,0.02,None).collect()

where data.sample takes the parameters

?data.sample
Signature: data.sample(withReplacement, fraction, seed=None)

and .collect helps in getting data

2) takeSample when I specify  by size of sample (say 100)

data.takeSample(False,100)

data.takeSample(withReplacement, num, seed=None)
Docstring:
Return a fixed-size sampled subset of this RDD.

 

 

Simple isnt it