Category: Analytics
bar-chart in python
In Seaborn a bar-chart can be created using the sns.countplot method and passing it the data
https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed
Split and Substring in Hive QL
Suppose you have a variable like AccountID
split(trim(AccountID),’-‘)[0]
trim- removes spaces
split using – , splits the string into multiple parts based on delimiter –
and [0] gives the first part of the split string ([1] will give the second part, etc .)
Creating Buckets in Pandas using Query
It is quite easy to create buckets based on one column using query. Note you can also use qcut
bucket1 =df.query(‘Column_1 >=0 and Column_1 <0.25’)
bucket2 =df.query(‘Column_1 >=0.25 and Column_1 <0.5’)
bucket3 =df.query(‘Column_1 >=0.5 and Column_1 <0.75’)
bucket4 =df.query(‘Column_1 >=0.75 and Column_1 <=1’)
Load Multipe CSV files in PySpark
spark= SparkSession.builder \
.master(“local”) \
.appName(“Data Exploration”) \
.getOrCreate()
#load data as Spark DataFrame
data2=spark.read.format(“csv”) \
.option(“header”,”true”) \
.option(“mode”,”DROPMALFORMED”) \
.load(‘/home/Desktop//input/*.csv’)
Convert Many Columns to Float in PySpark
from pyspark.sql.functions import col
for col_name in data7.columns:
data7 = data7.withColumn(col_name, col(col_name).cast(‘float’))
data7.printSchema()
How to run PySpark through Jupyter notebook via Docker
docker run -it -p 8888:8888 jupyter/pyspark-notebook
Install Docker before that
source -https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867