There are two ways to import the csv file, one as a RDD and the other as Spark Dataframe(preferred). MLLIB is built around RDDs while ML is generally built around dataframes. https://spark.apache.org/docs/latest/mllib-clustering.html and https://spark.apache.org/docs/latest/ml-clustering.html

!pip install pyspark

from pyspark import SparkContext, SparkConf
sc =SparkContext()

A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster. https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview

To create a SparkContext you first need to build a SparkConf object that contains information about your application.Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.

dir(SparkContext)

[‘PACKAGE_EXTENSIONS’,
‘class‘,
‘delattr‘,
‘dict‘,
‘dir‘,
‘doc‘,
‘enter‘,
‘eq‘,
‘exit‘,
‘format‘,
‘ge‘,
‘getattribute‘,
‘getnewargs‘,
‘gt‘,
‘hash‘,
‘init‘,
‘init_subclass‘,
‘le‘,
‘lt‘,
‘module‘,
‘ne‘,
‘new‘,
‘reduce‘,
‘reduce_ex‘,
‘repr‘,
‘setattr‘,
‘sizeof‘,
‘str‘,
‘subclasshook‘,
‘weakref‘,
‘_active_spark_context’,

‘dictToJavaMap’,
‘_do_init’,
‘_ensure_initialized’,
‘_gateway’,
‘_getJavaStorageLevel’,
‘_initialize_context’,
‘_jvm’,
‘_lock’,
‘_next_accum_id’,
‘_python_includes’,
‘_repr_html‘,
‘accumulator’,
‘addFile’,
‘addPyFile’,
‘applicationId’,
‘binaryFiles’,
‘binaryRecords’,
‘broadcast’,
‘cancelAllJobs’,
‘cancelJobGroup’,
‘defaultMinPartitions’,
‘defaultParallelism’,
‘dump_profiles’,
’emptyRDD’,
‘getConf’,
‘getLocalProperty’,
‘getOrCreate’,
‘hadoopFile’,
‘hadoopRDD’,
‘newAPIHadoopFile’,
‘newAPIHadoopRDD’,
‘parallelize’,
‘pickleFile’,
‘range’,
‘runJob’,
‘sequenceFile’,
‘setCheckpointDir’,
‘setJobGroup’,
‘setLocalProperty’,
‘setLogLevel’,
‘setSystemProperty’,
‘show_profiles’,
‘sparkUser’,
‘startTime’,
‘statusTracker’,
‘stop’,
‘textFile’,
‘uiWebUrl’,
‘union’,
‘version’,
‘wholeTextFiles’]

# Loads data.
data = sc.textFile(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(data)

pyspark.rdd.RDD

# Loads data. Be careful of indentations and whitespace

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master(“local”) \
.appName(“Data cleaning”) \
.getOrCreate()

dataframe2 = spark.read.format(“csv”).option(“header”,”true”).option(“mode”,”DROPMALFORMED”).load(“C:/Users/Ajay/Desktop/test/new_sample.csv”)

type(dataframe2)

pyspark.sql.dataframe.DataFrame

dataframe2.printSchema() (same as str(dataframe) in R and dataframe.info() in Pandas)

Author: Ajay Ohri

http://about.me/ajayohri View all posts by Ajay Ohri

Importing data from csv file using PySpark

Author: Ajay Ohri

Leave a comment Cancel reply

Please share:

Related

Author: Ajay Ohri

Leave a comment Cancel reply