DECISION STATS

a few criteria that define a good data science project

Learnability- What did you learn in the Project
Capability – What capabilities were showcased in the project
Difficulty- How difficult or easy was the project
Potential Hireability- How likely are you going to be hired based on that project
Ability- What creative approaches did you bring to the solution

A few datasets I liked only from a teaching purpose- iris, Boston, mtcars, Titanic, German Credit and mnist handwriting

A project is more than just a Kaggle dataset. hashtagdatascience hashtagdatasets hashtagkaggle hashtagmachinelearning

Saving Dataframe as a table

ModelData2=ModelData.toPandas() #CONVERTS SPARK DF TO PANDAS DF
table_model = spark.createDataFrame(ModelData2) # CREATES SPARK DF
table_model.write.saveAsTable(‘LIBRARYPATH.model_data’) #SAVES AS TABLE

AND

new_df = transformed_chrn2[[‘Var1’, ‘Var2’, ‘Var3’, ‘Var4′,’Var5’]]

table_df = spark.createDataFrame(new_df)

table_df.write.saveAsTable(‘directory_name.table_name’)

SOURCE

https://stackoverflow.com/questions/30664008/how-to-save-dataframe-directly-to-hive

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-connect-to-sql-database

https://docs.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes

https://stackoverflow.com/questions/53212396/pyspark-saveastable-how-to-insert-new-data-to-existing-table

Why a mentor

bar-chart in python

In Seaborn a bar-chart can be created using the sns.countplot method and passing it the data

https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed

https://seaborn.pydata.org/generated/seaborn.barplot.html

Split and Substring in Hive QL

Suppose you have a variable like AccountID

split(trim(AccountID),’-‘)[0]

trim- removes spaces

split using – , splits the string into multiple parts based on delimiter –

and [0] gives the first part of the split string ([1] will give the second part, etc .)

Creating Buckets in Pandas using Query

It is quite easy to create buckets based on one column using query. Note you can also use qcut

bucket1 =df.query(‘Column_1 >=0 and Column_1 <0.25’)

bucket2 =df.query(‘Column_1 >=0.25 and Column_1 <0.5’)

bucket3 =df.query(‘Column_1 >=0.5 and Column_1 <0.75’)

bucket4 =df.query(‘Column_1 >=0.75 and Column_1 <=1’)