A project is more than just a Kaggle dataset

a few criteria that define a good data science project

  • Learnability- What did you learn in the Project
  • Capability – What capabilities were showcased in the project
  • Difficulty- How difficult or easy was the project
  • Potential Hireability- How likely are you going to be hired based on that project
  • Ability- What creative approaches did you bring to the solution


A few datasets I liked only from a teaching purpose- iris, Boston, mtcars, Titanic, German Credit and mnist handwriting

Saving Dataframe as a table

  • ModelData2=ModelData.toPandas()  #CONVERTS SPARK DF TO PANDAS DF
  • table_model = spark.createDataFrame(ModelData2) # CREATES SPARK DF
  • table_model.write.saveAsTable(‘LIBRARYPATH.model_data’) #SAVES AS TABLE


new_df = transformed_chrn2[[‘Var1’, ‘Var2’, ‘Var3’, ‘Var4′,’Var5’]]

table_df = spark.createDataFrame(new_df)







Creating Buckets in Pandas using Query

It is quite easy to create buckets based on one column using query. Note you can also use qcut


bucket1 =df.query(‘Column_1 >=0 and Column_1 <0.25’)

bucket2 =df.query(‘Column_1 >=0.25 and Column_1 <0.5’)

bucket3 =df.query(‘Column_1 >=0.5 and Column_1 <0.75’)

bucket4 =df.query(‘Column_1 >=0.75 and Column_1 <=1’)