There are two ways to import the csv file, one as a RDD and the other as Spark Dataframe(preferred). MLLIB is built around RDDs while ML is generally built around dataframes. https://spark.apache.org/docs/latest/mllib-clustering.html and https://spark.apache.org/docs/latest/ml-clustering.html
!pip install pyspark
from pyspark import SparkContext, SparkConf
sc =SparkContext()
A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster. https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview
To create a SparkContext you first need to build a SparkConf object that contains information about your application.Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
dir(SparkContext)
[‘PACKAGE_EXTENSIONS’,
‘class‘,
‘delattr‘,
‘dict‘,
‘dir‘,
‘doc‘,
‘enter‘,
‘eq‘,
‘exit‘,
‘format‘,
‘ge‘,
‘getattribute‘,
‘getnewargs‘,
‘gt‘,
‘hash‘,
‘init‘,
‘init_subclass‘,
‘le‘,
‘lt‘,
‘module‘,
‘ne‘,
‘new‘,
‘reduce‘,
‘reduce_ex‘,
‘repr‘,
‘setattr‘,
‘sizeof‘,
‘str‘,
‘subclasshook‘,
‘weakref‘,
‘_active_spark_context’,
‘dictToJavaMap’,
‘_do_init’,
‘_ensure_initialized’,
‘_gateway’,
‘_getJavaStorageLevel’,
‘_initialize_context’,
‘_jvm’,
‘_lock’,
‘_next_accum_id’,
‘_python_includes’,
‘_repr_html‘,
‘accumulator’,
‘addFile’,
‘addPyFile’,
‘applicationId’,
‘binaryFiles’,
‘binaryRecords’,
‘broadcast’,
‘cancelAllJobs’,
‘cancelJobGroup’,
‘defaultMinPartitions’,
‘defaultParallelism’,
‘dump_profiles’,
’emptyRDD’,
‘getConf’,
‘getLocalProperty’,
‘getOrCreate’,
‘hadoopFile’,
‘hadoopRDD’,
‘newAPIHadoopFile’,
‘newAPIHadoopRDD’,
‘parallelize’,
‘pickleFile’,
‘range’,
‘runJob’,
‘sequenceFile’,
‘setCheckpointDir’,
‘setJobGroup’,
‘setLocalProperty’,
‘setLogLevel’,
‘setSystemProperty’,
‘show_profiles’,
‘sparkUser’,
‘startTime’,
‘statusTracker’,
‘stop’,
‘textFile’,
‘uiWebUrl’,
‘union’,
‘version’,
‘wholeTextFiles’]
# Loads data.
data = sc.textFile(“C:/Users/Ajay/Desktop/test/new_sample.csv”)
type(data)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master(“local”) \
.appName(“Data cleaning”) \
.getOrCreate()
dataframe2 = spark.read.format(“csv”).option(“header”,”true”).option(“mode”,”DROPMALFORMED”).load(“C:/Users/Ajay/Desktop/test/new_sample.csv”)
type(dataframe2)