This glossary aims to describe the main terms used in this course. For terms that you don’t find in this glossary, we added useful glossaries at the bottom of this page.

Main terms used in this course#


Acronym that stands for β€œApplication Programming Interface”. It can have a slightly different meaning in different contexts: in some cases it can be used to designate an online service that can be accessed by remote programs. In the context of an online service, the term β€œAPI” can be used both to designate the service itself, and the technical specification of the programming interface used by people who write client applications that connect to this service.

In the context of an offline library such as scikit-learn, it means the list of all (public) functions, classes and methods in the library, along with their documentation via docstrings. It and can be browsed online at:

In scikit-learn we try to adopt simple conventions and limit to a minimum the number of methods an object must implement. Furthermore, scikit-learn tries to use consistent method names for different estimators of the same category: e.g. all transformers expose fit, fit_transform and transform methods and generally accept similar arguments of type and shapes for those methods.


Type of problems where the goal is to predict a target that can take finite set of values.

Examples of classification problems are:

  • predicting the type of Iris (setosa, versicolor, virginica) from their petal and sepal measurements

  • predicting whether patients has a particular disease from the result of their medical tests

  • predicting whether an email is a spam or not from the email content, sender, title, etc …

When the predicted label can have two values, it is called binary classification. This the case for the medical and spam use cases above.

When the predicted label can have at least three values, it is called multi-class classification. This is the case for the Iris use case above.

Below, we illustrate an example of binary classification.


The data provided by the user contains 2 features, represented by the x- and y-axis. This is a binary classification problem because the target contains only 2 labels, here encoded by colors with blue and orange data points. Thus, each data points represent a sample and the entire set was used to train a linear model. The decision rule learned is thus the black dotted line. This decision rule is used to predict the label of a new sample according its position with respect to the line: a sample lying on the left of the line will be predicted as a blue sample while a sample lying on the right of the line will be predicted as an orange sample. Here, we have a linear classifier because the decision rule is defined as a line (in higher dimensions this would be a hyperplane). However, the shape of the decision rule will depend on the model used.


A model used for classification. These models handle targets that contains discrete values such as 0/1 or cat/dog. For example in scikit-learn LogisticRegression or HistGradientBoostingClassifier are classification model classes.

Note: for historic reasons the LogisticRegression name is confusing. LogisticRegression is not a regression model but a classification model, in contrary with what the name would suggest.


A procedure to estimate how well a model will generalize to new data. The main idea behind this is to train a model on a dataset (called train set) and evaluate its performance on a separate dataset (called test set).

This train/evaluate performance is repeated several times on different train and test sets to get an estimate of the model’s generalization performance uncertainties.

See this scikit-learn documentation for more details.

data matrix, input data#

The data containing only the features and not the target.

The data matrix has n_samples rows and n_features columns. For example for the Iris dataset:

  • the data matrix has a number of rows equal to the number of Iris flowers in the dataset

  • the data matrix has 4 columns (for sepal length, sepal width, petal length, and petal width)

In scikit-learn a common name for the data matrix is to call it X (following the maths convention that matrices use capital letters and that input is called x as in y = f(x))

early stopping#

This consists in stopping an iterative optimization method before the convergence of the algorithm, to avoid over-fitting. This is generally done by monitoring the generalization score on a validation set.


In scikit-learn jargon: an object that has a fit method. The reasons for the name estimator is that once the fit method is called on a model, the parameters are learned (or estimated) from the data.

feature, variable, attribute, descriptor, covariate#

A quantity describing a sample (e.g. color, size, weight). You can see a features as a quantity measured during the dataset collection.

For example, in the Iris dataset, there are four features: sepal length, sepal width, petal length and petal width.

generalization performance, predictive performance, statistical performance#

The performance of a model on the test data. The test data where never seen by the model during the training procedure.


Aspects of model configuration that are not learnt from data. Examples of hyperparameters:

  • for a k-nearest neighbor approach, the number of neighbors to use is a hyperparameter

  • for a polynomial model (say of degree between 1 and 10 for example), the degree of the polynomial is a hyperparameter.

Hyperparameters will impact the generalization and computational performance of a model. Indeed, hyperparameters of a model are usually inspected with regard to their impact on the model performance and tuned to maximize model performance (usually generalization performance ). It is called hyperparameters tuning and involve grid-search and randomized-search involving model evaluation on some validation sets.

For more details, you can further read the following post

infer, inference#

This term has a different meaning in machine-learning and statistical inference.

In machine-learning and more generally in this MOOC, we refer to inference the process of making predictions by applying a trained model to unlabeled data. In other words, inference is equivalent to predict the target of unseen data using a fitted model.

In statistic inference, the notion of left-out/unseen data is not tied to the definition. Indeed, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. You can check the Wikipedia article on statistical inference for more details.

learned parameters#

In scikit-learn the convention is that learned parameters finish with \_ at the end in scikit-learn. They are only available after fit has been called.

An example for such a parameter are the slope and intercept of a linear model in one dimension see this section for more details about such a model.

Note: parameters can also be used in a general Python meaning, as in passing a parameter to a function or a class


In scikit-learn jargon: an estimator that takes another estimator as parameter. Examples of meta-estimators include Pipeline and GridSearchCV.


Generic term that refers to something that can learn prediction rules from the data.


Overfitting occurs when your model stick too closely to the training data, so that it ends up learning the noise in the dataset rather than the relevant patterns. You can tell a model is overfitting when it performs great on your train set, but poorly on your test set (or new real-world data).


An estimator (object with a fit method) with a predict and/or fit_predict method. Note a classifier or a regressor is a predictor. Example of predictor classes are KNeighborsClassifier and DecisionTreeRegressor.

predict, prediction#

One of the focus of machine learning is to learn rules from data that we can then use to make predictions on new samples that were not seen during training.

Example with a linear regression. If we do a linear regression in 1d and we learn the linear model y = 2*x - 5. Say someone comes along and says what does your model predict for x = 10 we can use y = 2*10 - 5 = 15.


The goal is to predict a target that is continuous (contrary to discrete target for classification problems). Example of regression problems are:

  • predicting house prices from their descriptions (number of rooms, surface, location, etc …)

  • predicting the age of patients from their MRI scans

Below, we illustrate an example of regression.


The data provided by the user contains 1 feature called x and we want to predict the continuous target y. Each black data points are samples used to train a model. The model here is a decision tree and thus the decision rule is defined as a piecewise constant function represented by the orange line. To predict the target for a new sample for a given value of the x-axis, the model will output the corresponding y value lying on the orange line.


A regressor is a predictor in a regression setting.

In scikit-learn, DecisionTreeRegressor or Ridge are regressor classes.

regularization, penalization#

In linear models, regularization can be used in order to shrink/constrain the weights/parameters towards zero. This can be useful to avoid overfitting.

sample, instance, observation#

A data point in a dataset.

In the 2d data matrix, a sample is a row.

For example in the Iris dataset, a sample would be the measurements of a single flower.

Note: β€œinstance” is also used in a object-oriented meaning in this course. For example, if we define clf = KNeighborsClassifier(), we say that clf is an instance of the KNeighborsClassifier class.

supervised learning#

We can give a concrete graphical example.


The plot represent a supervised classification example. The data are composed of 2 features since we can plot each data point on a 2-axis plot. The color and shape correspond to the target and we have 2 potential choices: blue circle vs. orange square.

Supervised learning learning boiled down to the fact that we have access to the target. During fitting, we exactly know if a data point will be a blue circle or an orange square.

In the contrary unsupervised learning will only have access to the data points and not the target.

Framing a machine learning problem as a supervised or unsupervised learning problem will depend of the data available and the data science problem to be solved.

target, label, annotation#

The quantity we are trying to predict from the features. Targets are available in a supervised learning setting and not in an unsupervised learning setting.

For example, in the Iris dataset, the features might include the petal length and petal width, while the label would be the Iris specie.

In scikit-learn convention: y is a variable name commonly used to denote the target. This is because the target can be seen as the output of the model and follows the convention that output is called y as in y = f(x).

Target is usually used for regression setting while label is usually used in classification setting.

test set#

The dataset used to make predictions of a model after it is trained and eventually evaluate its generalization performance.

train, learn, fit#

Find ideal model parameters given the data. Let’s give a concrete example.


On the above figure, a linear model (blue line) will be mathematically defined by y = a*x + b. The parameter a defines the slope of the line while b defines the intercept. Indeed, we can create an infinity of models by varying the parameters a and b. However, we can search for a specific linear model that would fulfill a specific requirement, for instance minimizing the sum of the errors (red lines). Training, learning, or fitting a model refers to the procedure that will find the best possible parameters a and b fulfilling this requirement.

In a more abstract manner, we can represent fitting with the following diagram:


The model state are indeed the parameters and the jockey wheels are referring to an optimization algorithm to find the best parameters.

train set#

The dataset used to train the model.


An estimator (i.e. an object that has a fit method) supporting transform and/or fit_transform. Examples for transformers are StandardScaler or ColumnTransformer.


Underfitting occurs when your model does not have enough flexibility to represent the data well. You can tell a model is underfitting when it performs poorly on both training and test sets.

The opposit of underfitting is overfitting.

unsupervised learning#

In this setting, samples are not labelled. One particular example of unsupervised learning is clustering, whose goal is to group the data into subsets of similar samples. Potential applications of clustering include:

  • using the content of articles to group them into broad topics

  • finding different types of customers from a e-commerce website data

Note that although mentioned, unsupervised learning is not covered in this course. The opposite of unsupervised learning is supervised learning.

validation set#

A machine learning model is evaluated on the following manner: the model is trained using a training set and evaluated using a testing set. In this setting, it is implied that the hyperparameters of the model are fixed.

When one would like to tune the hyperparameters of a model as well, then it is necessary to subdivide the training set into a training and a validation set: we fit several machine learning models with different hyperparameters values and select the one performing best on the validation set. Finally, once the hyperparameters fixed we can use the left-out testing set to evaluate this model.

Sometimes, we also use a validation set in context of early-stopping. It is used with machine learning using iterative optimization to be fitted and it is not clear how many iterations are needed to train the model. In this case, one will used a validation set to monitor the performance of the model on some data different from the training set. Once that some criteria are fulfilled, the model is trained. This model is finally evaluated on the left-out testing set.

Other useful glossaries#

For generic machine learning terms:

For more advanced scikit-learn related terminology: