In this one-day intermediate-level course, attendees will learn how to clean and preprocess data.
Much of the available literature focuses on modelling, the Data Science phase that attempts to build representations of data. But in industrial settings the reality is that a majority of a project’s time is spent dealing with data issues. In fact, some of the largest improvements in model performance can come from improving the data, not improving the model.
The focus of this training is to provide you with the techniques required to make the best of your data. We approach topics such as how to find and deal with missing data, clean corrupted data, find anomalies and convert features into a more suitable format. We will also look at how to transform and “engineer” new features to boost performance. Finally we will look at methods for choosing the best features to use for your specific problem.
The training is accompanied with a comprehensive set of examples and challenges using realistic but simple data.
Attendees will leave knowing…
Attendees will be able to…
People often assume (due to over-popularisation on the internet) that most of a Data Scientist’s job is modelling, it’s not, it’s working with data. A 2017 report by CloudFlower found that 60% of a Data Scientist’s time is spent either collecting, cleaning or mining for data.
People spend an inordinate amount of time playing with models, when most of the performance improvements come from working with the data.