Module overview#

What you will learn#

This module gives an intuitive introduction to the very fundamental concepts of overfitting and underfitting in machine learning.

Machine learning models can never make perfect predictions: the test error is never exactly zero. This failure comes from a fundamental trade-off between modeling flexibility and the limited size of the training dataset.

The first presentation will define those problems and characterize how and why they arise.

Then we will present a methodology to quantify those problems by contrasting the train error with the test error for various choice of the model family, model parameters. More importantly, we will emphasize the impact of the size of the training set on this trade-off.

Finally we will relate overfitting and underfitting to the concepts of statistical variance and bias.

Before getting started#

The required technical skills to carry on this module are:

  • skills acquired during the β€œThe Predictive Modeling Pipeline” module with basic usage of scikit-learn.

Objectives and time schedule#

The objective in the module are the following:

  • understand the concept of overfitting and underfitting;

  • understand the concept of generalization;

  • understand the general cross-validation framework used to evaluate a model.

The estimated time to go through this module is about 3 hours.