Module overview#

What you will learn#

This module will give an example of a typical predictive modeling pipeline developed using tabular data (data that can be structured in a 2-dimensional table). We will present this pipeline in a progressive way. First, we will make an analysis of the dataset used. Subsequently, we will train our first predictive pipeline with a subset of the dataset. Then, we will give particular attention to the type of data, numerical and categorical, that our model has to handle. Finally, we will extend our pipeline to use mixed types of data, i.e. numerical and categorical data.

Before getting started#

The required technical skills to carry on this module are:

  • basic knowledge of Python programming

  • some prior experience with the NumPy, pandas and Matplotlib libraries is recommended but not required

For a quick introduction on these requirements, you can use the following resources:

Objectives and time schedule#

The objective in the module are the following:

  • build intuitions regarding an unknown dataset;

  • identify and differentiate numerical and categorical features;

  • create an advanced predictive pipeline with scikit-learn.

The estimated time to go through this module is about 6 hours