A project is more than just a Kaggle dataset

a few criteria that define a good data science project

  • Learnability- What did you learn in the Project
  • Capability – What capabilities were showcased in the project
  • Difficulty- How difficult or easy was the project
  • Potential Hireability- How likely are you going to be hired based on that project
  • Ability- What creative approaches did you bring to the solution

 

A few datasets I liked only from a teaching purpose- iris, Boston, mtcars, Titanic, German Credit and mnist handwriting

A project is more than just a Kaggle dataset. hashtagdatascience hashtagdatasets hashtagkaggle hashtagmachinelearning

Saving Dataframe as a table

  • ModelData2=ModelData.toPandas()  #CONVERTS SPARK DF TO PANDAS DF
  • table_model = spark.createDataFrame(ModelData2) # CREATES SPARK DF
  • table_model.write.saveAsTable(‘LIBRARYPATH.model_data’) #SAVES AS TABLE

AND

new_df = transformed_chrn2[[‘Var1’, ‘Var2’, ‘Var3’, ‘Var4′,’Var5’]]

table_df = spark.createDataFrame(new_df)

table_df.write.saveAsTable(‘directory_name.table_name’)

SOURCE

https://stackoverflow.com/questions/30664008/how-to-save-dataframe-directly-to-hive

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-connect-to-sql-database

https://docs.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes

https://stackoverflow.com/questions/53212396/pyspark-saveastable-how-to-insert-new-data-to-existing-table

Creating Buckets in Pandas using Query

It is quite easy to create buckets based on one column using query. Note you can also use qcut

 

bucket1 =df.query(‘Column_1 >=0 and Column_1 <0.25’)

bucket2 =df.query(‘Column_1 >=0.25 and Column_1 <0.5’)

bucket3 =df.query(‘Column_1 >=0.5 and Column_1 <0.75’)

bucket4 =df.query(‘Column_1 >=0.75 and Column_1 <=1’)