In this one-day intermediate-level course, attendees will learn how to clean and preprocess data.
Much of the available literature focuses on modelling, the Data Science phase that attempts to build representations of data. But in industrial settings the reality is that a majority of a project’s time is spent dealing with data issues. In fact, some of the largest improvements in model performance can come from improving the data, not improving the model.
The focus of this training is to provide you with the techniques required to make the best of your data. We approach topics such as how to find and deal with missing data, clean corrupted data, find anomalies and convert features into a more suitable format. We will also look at how to transform and “engineer” new features to boost performance. Finally we will look at methods for choosing the best features to use for your specific problem.
The training is accompanied with a comprehensive set of examples and challenges using realistic but simple data.
- Engineers who work with data (e.g. for tasks such as Monitoring, Data Warehousing, Databases)
- Experienced Data Scientists who are looking to extract more performance or robustness out of their models
- Budding Data Scientists who want to learn how and why we need to clean data
- Beginner-level knowledge of Data Science (e.g. terminology, awareness of models and issues of modelling - no expertise expected, just an awareness of)
- Familiarity with Python (the practical exercises are in Python)
- An intuitive understanding of Basic Statistics (e.g. probability distributions, simple summary statistics like the mean and standard deviation)
Attendees will leave knowing…
- How data quality impacts project performance
- The multitude of ways data can become corrupt
- How to deal with corruptions of different types
- Why derived data can be better than the original
Attendees will be able to…
- Decide when and how to clean data
- Spot different types of corruption
- Transform the data to produce better representations of the original
- Clean all types of data: categorical, continuous, time series, etc.
People often assume (due to over-popularisation on the internet) that most of a Data Scientist’s job is modelling, it’s not, it’s working with data. A 2017 report by CloudFlower found that 60% of a Data Scientist’s time is spent either collecting, cleaning or mining for data.
People spend an inordinate amount of time playing with models, when most of the performance improvements come from working with the data.