bar-chart in python

In Seaborn a bar-chart can be created using the sns.countplot method and passing it the data

https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed

https://seaborn.pydata.org/generated/seaborn.barplot.html

Split and Substring in Hive QL

Suppose you have a variable like AccountID

split(trim(AccountID),’-‘)[0]

trim- removes spaces

split using – , splits the string into multiple parts based on delimiter –

and [0] gives the first part of the split string ([1] will give the second part, etc .)

Creating Buckets in Pandas using Query

It is quite easy to create buckets based on one column using query. Note you can also use qcut

 

bucket1 =df.query(‘Column_1 >=0 and Column_1 <0.25’)

bucket2 =df.query(‘Column_1 >=0.25 and Column_1 <0.5’)

bucket3 =df.query(‘Column_1 >=0.5 and Column_1 <0.75’)

bucket4 =df.query(‘Column_1 >=0.75 and Column_1 <=1’)

Load Multipe CSV files in PySpark

spark= SparkSession.builder \
.master(“local”) \
.appName(“Data Exploration”) \
.getOrCreate()

#load data as Spark DataFrame
data2=spark.read.format(“csv”) \
.option(“header”,”true”) \
.option(“mode”,”DROPMALFORMED”) \
.load(‘/home/Desktop//input/*.csv’)

Convert Many Columns to Float in PySpark

from pyspark.sql.functions import col
for col_name in data7.columns:
data7 = data7.withColumn(col_name, col(col_name).cast(‘float’))

 

data7.printSchema()

How to run PySpark through Jupyter notebook via Docker

docker run -it -p 8888:8888 jupyter/pyspark-notebook

Install Docker before that

 

source -https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867