This glossary aims to describe the main terms used in this course. For terms that you don’t find in this glossary, we added useful glossaries at the bottom of this page.
Main terms used in this course#
Acronym that stands for “Application Programming Interface”. It can have a slightly different meaning in different contexts: in some cases it can be used to designate an online service that can be accessed by remote programs. In the context of an online service, the term “API” can be used both to designate the service itself, and the technical specification of the programming interface used by people who write client applications that connect to this service.
In the context of an offline library such as scikit-learn, it means the list of all (public) functions, classes and methods in the library, along with their documentation via docstrings. It and can be browsed online at:
In scikit-learn we try to adopt simple conventions and limit to a minimum the
number of methods an object must implement. Furthermore, scikit-learn tries to
use consistent method names for different estimators of the same category: e.g.
all transformers expose
transform methods and
generally accept similar arguments of type and shapes for those methods.
Type of problems where the goal is to predict a target that can take finite set of values.
Examples of classification problems are:
predicting the type of Iris (setosa, versicolor, virginica) from their petal and sepal measurements
predicting whether patients has a particular disease from the result of their medical tests
predicting whether an email is a spam or not from the email content, sender, title, etc …
When the predicted label can have two values, it is called binary classification. This the case for the medical and spam use cases above.
When the predicted label can have at least three values, it is called multi-class classification. This is the case for the Iris use case above.
Below, we illustrate an example of binary classification.
The data provided by the user contains 2 features, represented by the x- and y-axis. This is a binary classification problem because the target contains only 2 labels, here encoded by colors with blue and orange data points. Thus, each data points represent a sample and the entire set was used to train a linear model. The decision rule learned is thus the black dotted line. This decision rule is used to predict the label of a new sample according its position with respect to the line: a sample lying on the left of the line will be predicted as a blue sample while a sample lying on the right of the line will be predicted as an orange sample. Here, we have a linear classifier because the decision rule is defined as a line (in higher dimensions this would be a hyperplane). However, the shape of the decision rule will depend on the model used.
A model used for classification. These models handle
targets that contains discrete values such as
dog. For example in scikit-learn
HistGradientBoostingClassifier are classification model classes.
Note: for historic reasons the
LogisticRegression name is confusing.
LogisticRegression is not a regression model but a classification model, in
contrary with what the name would suggest.
A procedure to estimate how well a model will generalize to new data. The main idea behind this is to train a model on a dataset (called train set) and evaluate its performance on a separate dataset (called test set).
This train/evaluate performance is repeated several times on different train and test sets to get an estimate of the model’s generalization performance uncertainties.
See this scikit-learn documentation for more details.
data matrix, input data#
The data containing only the features and not the target.
The data matrix has
n_samples rows and
n_features columns. For example for
the Iris dataset:
the data matrix has a number of rows equal to the number of Iris flowers in the dataset
the data matrix has 4 columns (for sepal length, sepal width, petal length, and petal width)
In scikit-learn a common name for the data matrix is to call it
the maths convention that matrices use capital letters and that input is called
x as in
y = f(x))
This consists in stopping an iterative optimization method before the convergence of the algorithm, to avoid over-fitting. This is generally done by monitoring the generalization score on a validation set.
In scikit-learn jargon: an object that has a
fit method. The reasons for the
name estimator is that once the
fit method is called on a model,
the parameters are learned (or estimated) from the data.
feature, variable, attribute, descriptor, covariate#
A quantity describing a sample (e.g. color, size, weight). You can see a features as a quantity measured during the dataset collection.
For example, in the Iris dataset, there are four features: sepal length, sepal width, petal length and petal width.
generalization performance, predictive performance, statistical performance#
The performance of a model on the test data. The test data where never seen by the model during the training procedure.
Aspects of model configuration that are not learnt from data. Examples of hyperparameters:
for a k-nearest neighbor approach, the number of neighbors to use is a hyperparameter
for a polynomial model (say of degree between 1 and 10 for example), the degree of the polynomial is a hyperparameter.
Hyperparameters will impact the generalization and computational performance of a model. Indeed, hyperparameters of a model are usually inspected with regard to their impact on the model performance and tuned to maximize model performance (usually generalization performance ). It is called hyperparameters tuning and involve grid-search and randomized-search involving model evaluation on some validation sets.
For more details, you can further read the following post
This term has a different meaning in machine-learning and statistical inference.
In machine-learning and more generally in this MOOC, we refer to inference the process of making predictions by applying a trained model to unlabeled data. In other words, inference is equivalent to predict the target of unseen data using a fitted model.
In statistic inference, the notion of left-out/unseen data is not tied to the definition. Indeed, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. You can check the Wikipedia article on statistical inference for more details.
In scikit-learn the convention is that learned parameters finish with
the end in scikit-learn. They are only available after
fit has been called.
An example for such a parameter are the slope and intercept of a linear model in one dimension see this section for more details about such a model.
Note: parameters can also be used in a general Python meaning, as in passing a parameter to a function or a class
In scikit-learn jargon: an estimator that takes another
estimator as parameter. Examples of meta-estimators include
Generic term that refers to something that can learn prediction rules from the data.
Overfitting occurs when your model stick too closely to the training data, so that it ends up learning the noise in the dataset rather than the relevant patterns. You can tell a model is overfitting when it performs great on your train set, but poorly on your test set (or new real-world data).
An estimator (object with a
fit method) with a
fit_predict method. Note a classifier or a
regressor is a predictor. Example of predictor classes are
One of the focus of machine learning is to learn rules from data that we can then use to make predictions on new samples that were not seen during training.
Example with a linear regression. If we do a linear
regression in 1d and we learn the linear model
y = 2*x - 5. Say someone comes along and says what does your model
x = 10 we can use
y = 2*10 - 5 = 15.
The goal is to predict a target that is continuous (contrary to discrete target for classification problems). Example of regression problems are:
predicting house prices from their descriptions (number of rooms, surface, location, etc …)
predicting the age of patients from their MRI scans
Below, we illustrate an example of regression.
The data provided by the user contains 1
x and we
want to predict the continuous
y. Each black data points are
samples used to train a
model. The model here is a decision tree and thus the
decision rule is defined as a piecewise constant function represented by the
orange line. To predict the
target for a new
sample for a given value of the x-axis, the
model will output the corresponding
y value lying on the orange
A regressor is a predictor in a regression setting.
Ridge are regressor classes.
In linear models, regularization can be used in order to shrink/constrain the weights/parameters towards zero. This can be useful to avoid overfitting.
sample, instance, observation#
A data point in a dataset.
In the 2d data matrix, a sample is a row.
For example in the Iris dataset, a sample would be the measurements of a single flower.
Note: “instance” is also used in a object-oriented meaning in this course. For
example, if we define
clf = KNeighborsClassifier(), we say that
clf is an
instance of the
We can give a concrete graphical example.
The plot represent a supervised classification example. The data are composed of 2 features since we can plot each data point on a 2-axis plot. The color and shape correspond to the target and we have 2 potential choices: blue circle vs. orange square.
Supervised learning learning boiled down to the fact that we have access to the target. During fitting, we exactly know if a data point will be a blue circle or an orange square.
In the contrary unsupervised learning will only have access to the data points and not the target.
Framing a machine learning problem as a supervised or unsupervised learning problem will depend of the data available and the data science problem to be solved.
target, label, annotation#
The quantity we are trying to predict from the features. Targets are available in a supervised learning setting and not in an unsupervised learning setting.
For example, in the Iris dataset, the features might include the petal length and petal width, while the label would be the Iris specie.
In scikit-learn convention:
y is a variable name commonly used to denote the
target. This is because the target can be seen as the output of the
model and follows the convention that output is called
y as in
y = f(x).
Target is usually used for regression setting while label is usually used in classification setting.
The dataset used to make predictions of a model after it is trained and eventually evaluate its generalization performance.
train, learn, fit#
Find ideal model parameters given the data. Let’s give a concrete example.
On the above figure, a linear model (blue line) will be
mathematically defined by
y = a*x + b. The parameter
a defines the slope of
the line while
b defines the intercept. Indeed, we can create an infinity of
models by varying the parameters
b. However, we can search for a
specific linear model that would fulfill a specific requirement, for
instance minimizing the sum of the errors (red lines). Training, learning, or
fitting a model refers to the procedure that will find the best
b fulfilling this requirement.
In a more abstract manner, we can represent fitting with the following diagram:
The model state are indeed the parameters and the jockey wheels are referring to an optimization algorithm to find the best parameters.
The dataset used to train the model.
An estimator (i.e. an object that has a
fit method) supporting
fit_transform. Examples for transformers are
Underfitting occurs when your model does not have enough flexibility to represent the data well. You can tell a model is underfitting when it performs poorly on both training and test sets.
The opposit of underfitting is overfitting.
In this setting, samples are not labelled. One particular example of unsupervised learning is clustering, whose goal is to group the data into subsets of similar samples. Potential applications of clustering include:
using the content of articles to group them into broad topics
finding different types of customers from a e-commerce website data
Note that although mentioned, unsupervised learning is not covered in this course. The opposite of unsupervised learning is supervised learning.
A machine learning model is evaluated on the following manner: the model is trained using a training set and evaluated using a testing set. In this setting, it is implied that the hyperparameters of the model are fixed.
When one would like to tune the hyperparameters of a model as well, then it is necessary to subdivide the training set into a training and a validation set: we fit several machine learning models with different hyperparameters values and select the one performing best on the validation set. Finally, once the hyperparameters fixed we can use the left-out testing set to evaluate this model.
Sometimes, we also use a validation set in context of early-stopping. It is used with machine learning using iterative optimization to be fitted and it is not clear how many iterations are needed to train the model. In this case, one will used a validation set to monitor the performance of the model on some data different from the training set. Once that some criteria are fulfilled, the model is trained. This model is finally evaluated on the left-out testing set.
Other useful glossaries#
For generic machine learning terms:
ML cheatsheet glossary: https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html
Google Machine Learning glossary: https://developers.google.com/machine-learning/glossary
For more advanced scikit-learn related terminology:
scikit-learn glossary: https://scikit-learn.org/stable/glossary.html