A better Iris Dataset for the current era

The IRIS dataset is the curse of teaching data science. It makes applying algorithms very simple. What is needed is a BIGGER DATASET with missing values and many many variables/features that teaches the whole cycle of data science, not just plain machine learning but also data pre-processing , dimensionality, standardization, as well. Ideally the dataset should be bigger than memory (RAM) to teach efficiency as well. #datascience #machinelearning #algorithms

https://www.linkedin.com/feed/update/urn:li:activity:6423093728885465088

Varun Mahanot

That’s true Ajay Ohri . I think this dataset is a much more complicated and sophisticated dataset and has all of the things you mentioned. Unfortunately though, I have been having struggling with the analysis of this dataset. (accuracy about 55% with ANN) https://he-s3.s3.amazonaws.com/media/hackathon/machine-learning-challenge-6-1/predict-the-energy-used-612632a9-3f496e7f/a490e594-6-Dataset.zip From this challenge: https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-6-1/machine-learning/predict-the-energy-used-612632a9-3f496e7f/

Manish Gupta

Such datasets leave serious skill gap with the practitioner: of how to create your modeling data from scratch in the first place. What would a DS do when all he faces , in a practical scenario, data residing/flowing in a production system for business, spread across multiple platforms and warehouses, so much so that data threading itself requires good business process understanding…..modeling here stands at far end of the tunnel..And no online tutorial can teach this to you…you have to get in a real data environment.. It’s far far bigger than the issue of data munging.

Saurabh Dwivedy

Several solutions exist to this problem. You may refer to this (admittedly introductory level) link for some inspiration – https://machinelearningmastery.com/large-data-files-machine-learning/

You could also refer to this link for a specific use case related with R. https://www.researchgate.net/post/How_to_use_R_language_for_larger_datasets_of_size_more_than_a_machine_RAM_size Since R works in RAM this is more relevant