Importing data from csv file using PySpark

There are two ways to import the csv file, one as a RDD and the other as Spark Dataframe(preferred).  MLLIB is built around RDDs while ML is generally built around dataframes. https://spark.apache.org/docs/latest/mllib-clustering.html and https://spark.apache.org/docs/latest/ml-clustering.html

!pip install pyspark

from pyspark import SparkContext, SparkConf
sc =SparkContext()

A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.  https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview

To create a SparkContext you first need to build a SparkConf object that contains information about your application.Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.

dir(SparkContext)

[‘PACKAGE_EXTENSIONS’,
class‘,
delattr‘,
dict‘,
dir‘,
doc‘,
enter‘,
eq‘,
exit‘,
format‘,
ge‘,
getattribute‘,
getnewargs‘,
gt‘,
hash‘,
init‘,
init_subclass‘,
le‘,
lt‘,
module‘,
ne‘,
new‘,
reduce‘,
reduce_ex‘,
repr‘,
setattr‘,
sizeof‘,
str‘,
subclasshook‘,
weakref‘,
‘_active_spark_context’,

dictToJavaMap’,
‘_do_init’,
‘_ensure_initialized’,
‘_gateway’,
‘_getJavaStorageLevel’,
‘_initialize_context’,
‘_jvm’,
‘_lock’,
‘_next_accum_id’,
‘_python_includes’,
‘_repr_html
‘,
‘accumulator’,
‘addFile’,
‘addPyFile’,
‘applicationId’,
‘binaryFiles’,
‘binaryRecords’,
‘broadcast’,
‘cancelAllJobs’,
‘cancelJobGroup’,
‘defaultMinPartitions’,
‘defaultParallelism’,
‘dump_profiles’,
’emptyRDD’,
‘getConf’,
‘getLocalProperty’,
‘getOrCreate’,
‘hadoopFile’,
‘hadoopRDD’,
‘newAPIHadoopFile’,
‘newAPIHadoopRDD’,
‘parallelize’,
‘pickleFile’,
‘range’,
‘runJob’,
‘sequenceFile’,
‘setCheckpointDir’,
‘setJobGroup’,
‘setLocalProperty’,
‘setLogLevel’,
‘setSystemProperty’,
‘show_profiles’,
‘sparkUser’,
‘startTime’,
‘statusTracker’,
‘stop’,
‘textFile’,
‘uiWebUrl’,
‘union’,
‘version’,
‘wholeTextFiles’]

# Loads data.
data = sc.textFile(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(data)

pyspark.rdd.RDD
 # Loads data. Be careful of indentations and whitespace

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master(“local”) \
.appName(“Data cleaning”) \
.getOrCreate()

dataframe2 = spark.read.format(“csv”).option(“header”,”true”).option(“mode”,”DROPMALFORMED”).load(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(dataframe2)

pyspark.sql.dataframe.DataFrame
dataframe2.printSchema() (same as str(dataframe) in R and dataframe.info() in Pandas)

Random Sample of RDD in Spark

2) To get a random sample of your RDD (named data) say with 100000 rows and to get 20% values

data.sample(False,0.02,None).collect()

where data.sample takes the parameters

?data.sample
Signature: data.sample(withReplacement, fraction, seed=None)

and .collect helps in getting data

2) takeSample when I specify  by size of sample (say 100)

data.takeSample(False,100)

data.takeSample(withReplacement, num, seed=None)
Docstring:
Return a fixed-size sampled subset of this RDD.

 

 

Simple isnt it

Column names of a RDD in Spark using PySpark

Just use the following to get names of columns

data.take(1) or data.first() or data.top(1)

where data is the name of your RDD

To get the total number of rows data.count()

 

 

Source https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

 

Random Sample with Hive and Download results

Random Selection

select * from data_base.table_name
where rand() <=0.01
distribute by rand()
sort by rand()
limit 100000;

Download Manually

Run the Hive Query.

When it is finished, scroll down to where results are and use the download icon (fourth from top)

 

 

 

 

 

 

 

 

Download from Hive Programmatically

Use MOBAXTERM to connect to server

Use VI/VIM to put query in a .hql file. Use i to insert and :wq to save and exit

Use nohup to run and direct the .hql results to a file

[ajayuser@server~]$ mkdir ajay

[ajayuser@server~]$ cd ajay

[ajayuser@serverajay]$ ls

[ajayuser@serverajay]$ vi agesex.hql

[ajayuser@serverajay]$ mv agesex.hql customer_demo.hql

[ajayuser@serverajay]$ ls

customer_demo.hql

[ajayuser@serverajay]$ nohup hive -f customer_demo.hql >>  log_cust.${date}.log;

[ajayuser@serverajay]$ nohup: ignoring input and redirecting stderr to stdout

 

To check progress

[ajayuser@serverajay]$ tail -f log_cust.${date}.log

The transformation trend of data science

  1. Data science (Python , R )  is incomplete without Big Data (Hadoop , Spark) software
  2. SAS continues to  be the most profitable stack in data science because of much better customer support ( than say the customer support available for configuring Big  Data Analytics for other stacks)
  3. Cloud Computing is increasingly an option for companies that are scaling up but it is not an option for sensitive data (Telecom, Banking) and there is enough juice in Moore’s law and server systems
  4. Data scientists (stats + coding in R/Py/SAS +business) and Data engineers (Linux+ Hadoop +Spark) are increasingly expected to have cross domain skills from each other
  5. Enterprises are at a massive inflection point for digital transformation (apps, websites to get data), cloud to process data, Hadoop/Spark/Kafka to store data, and Py/ R/ SAS to analyze data in a parallel processing environment
  6. BI and Data Visualization will continue to be relevant just because of huge data and limited human cognition. So will be traditional statisticians for designing test and control experiments
  7. Data science will move from tools to insights requiring much shorter cycle times from data ingestion to data analysis to business action

These are my personal views only