A project is more than just a Kaggle dataset
a few criteria that define a good data science project
- Learnability- What did you learn in the Project
- Capability – What capabilities were showcased in the project
- Difficulty- How difficult or easy was the project
- Potential Hireability- How likely are you going to be hired based on that project
- Ability- What creative approaches did you bring to the solution
A few datasets I liked only from a teaching purpose- iris, Boston, mtcars, Titanic, German Credit and mnist handwriting
A project is more than just a Kaggle dataset. hashtagdatascience hashtagdatasets hashtagkaggle hashtagmachinelearning
Saving Dataframe as a table
- ModelData2=ModelData.toPandas() #CONVERTS SPARK DF TO PANDAS DF
- table_model = spark.createDataFrame(ModelData2) # CREATES SPARK DF
- table_model.write.saveAsTable(‘LIBRARYPATH.model_data’) #SAVES AS TABLE
AND
new_df = transformed_chrn2[[‘Var1’, ‘Var2’, ‘Var3’, ‘Var4′,’Var5’]]
table_df = spark.createDataFrame(new_df)
table_df.write.saveAsTable(‘directory_name.table_name’)
SOURCE
https://stackoverflow.com/questions/30664008/how-to-save-dataframe-directly-to-hive
https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-connect-to-sql-database
https://docs.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes
Why a mentor
bar-chart in python
In Seaborn a bar-chart can be created using the sns.countplot method and passing it the data
https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed
Split and Substring in Hive QL
Suppose you have a variable like AccountID
split(trim(AccountID),’-‘)[0]
trim- removes spaces
split using – , splits the string into multiple parts based on delimiter –
and [0] gives the first part of the split string ([1] will give the second part, etc .)
Creating Buckets in Pandas using Query
It is quite easy to create buckets based on one column using query. Note you can also use qcut
bucket1 =df.query(‘Column_1 >=0 and Column_1 <0.25’)
bucket2 =df.query(‘Column_1 >=0.25 and Column_1 <0.5’)
bucket3 =df.query(‘Column_1 >=0.5 and Column_1 <0.75’)
bucket4 =df.query(‘Column_1 >=0.75 and Column_1 <=1’)
