This glossary aims to describe the main terms used in this course. For terms that you don’t find in this glossary, we added useful glossaries at the bottom of this page.
Main terms used in this course¶
Examples of classification problems are:
predicting the type of Iris (setosa, versicolor, virginica) from their petal and sepal measurements
predicting whether patients has a particular disease from the result of their medical tests
predicting whether an email is a spam or not from the email content, sender, title, etc …
When the predicted label can have two values, it is called binary classification. This the case for the medical and spam use cases above.
When the predicted label can have at least three values, it is called multi-class classification. This is the case for the Iris use case above.
Below, we illustrate an example of binary classification.
The data provided by the user contains 2 features, represented by the x- and y-axis. This is a binary classification problem because the target contains only 2 labels, here encoded by colors with blue and orange data points. Thus, each data points represent a sample and the entire set was used to train a linear model. The decision rule learned is thus the black dotted line. This decision rule is used to predict the label of a new sample according its position with respect to the line: a sample lying on the left of the line will be predicted as a blue sample while a sample lying on the right of the line will be predicted as an orange sample. Here, we have a linear classifier because the decision rule is defined as a line (in higher dimensions this would be an hyperplane). However, the shape of the decision rule will depend on the model used.
A model used for classification. These models handle
targets that contains discrete values such as
dog. For example in scikit-learn
HistGradientBoostingClassifier are classifier
Note: for historic reasons the
LogisticRegression name is confusing.
LogisticRegression is not a regression model but a classification model, in
contrary with what the name would suggest.
A procedure to estimate how well a model will generalize to new data. The main idea behind this is to train a model on a dataset (called train set) and evaluate its performance on a separate dataset (called test set).
This train/evaluate performance is repeated several times on different train and test sets to get an estimate of the statistical model performance uncertainties.
See this scikit-learn documentation for more details.
data matrix, input data¶
The data matrix has
n_samples rows and
n_features columns. For example for
the Iris dataset:
the data matrix has a number of rows equal to the number of Iris flowers in the dataset
the data matrix has 4 columns (for sepal length, sepal width, petal length, and petal width)
In scikit-learn a common name for the data matrix is to call it
the maths convention that matrices use capital letters and that input is called
x as in
y = f(x))
This consists in stopping an iterative optimization method before the convergence of the algorithm, to avoid over-fitting. This is generally done by monitoring the generalization score on a validation set.
In scikit-learn jargon: an object that has a
fit method. The reasons for the
name estimator is that once the
fit method is called on a model,
the parameters are learned (or estimated) from the data.
feature, variable, attribute, descriptor, covariate¶
For example, in the Iris dataset, there are four features: sepal length, sepal width, petal length and petal width.
Aspects of model configuration that are not learnt from data. Examples of hyperparameters:
for a k-nearest neighbor approach, the number of neighbors to use is a hyperparameter
for a polynomial model (say of degree between 1 and 10 for example), the degree of the polynomial is a hyperparameter.
Hyperparameters will impact the statistical and computational performance of a model. Indeed, hyperparameters of a model are usually inspected with regard to their impact on the model performance and tuned to maximize model performance (usually statistical performance ). It is called hyperparameters tuning and involve grid-search and randomized-search involving model evaluation on some validation sets.
This term has a different meaning in machine-learning and statistical inference.
In machine-learning and more generally in this MOOC, we refer to inference the process of making predictions by applying a trained model to unlabeled data. In other words, inference is equivalent to predict the target of unseen data using a fitted model.
In statistic inference, the notion of left-out/unseen data is not tied to the definition. Indeed, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. You can check the Wikipedia article on statistical inference for more details.
In scikit-learn the convention is that learned parameters finish with
the end in scikit-learn. They are only available after
fit has been called.
Note: parameters can also be used in a general Python meaning, as in passing a parameter to a function or a class
Overfitting occurs when your model stick too closely to the training data, so that it ends up learning the noise in the dataset rather than the relevant patterns. You can tell a model is overfitting when it performs great on your train set, but poorly on your test set (or new real-world data).
An estimator (object with a
fit method) with a
fit_predict method. Note a classifier or a
regressor is a predictor. Example of predictor classes are
Example with a linear regression. If we do a linear
regression in 1d and we learn the linear model
y = 2*x - 5. Say someone comes along and says what does your model
x = 10 we can use
y = 2*10 - 5 = 15.
predicting house prices from their descriptions (number of rooms, surface, location, etc …)
predicting the age of patients from their MRI scans
Below, we illustrate an example of regression.
The data provided by the user contains 1
x and we
want to predict the continuous
y. Each black data points are
samples used to train a
model. The model here is a decision tree and thus the
decision rule is defined as a piecewise constant function represented by the
orange line. To predict the
target for a new
sample for a given value of the x-axis, the
model will output the corresponding
y value lying on the orange
Ridge are regressor classes.
sample, instance, observation¶
A data point in a dataset.
In the 2d data matrix, a sample is a row.
For example in the Iris dataset, a sample would be the measurements of a single flower.
Note: “instance” is also used in a object-oriented meaning in this course. For
example, if we define
clf = KNeighborsClassifier(), we say that
clf is an
instance of the
statistical performance, generalization performance, predictive performance¶
We can give a concrete graphical example.
The plot represent a supervised classification example. The data are composed of 2 features since we can plot each data point on a 2-axis plot. The color and shape correspond to the target and we have 2 potential choices: blue circle vs. orange square.
Supervised learning learning boiled down to the fact that we have access to the target. During fitting, we exactly know if a data point will be a blue circle or an orange square.
target, label, annotation¶
For example, in the Iris dataset, the features might include the petal length and petal width, while the label would be the Iris specie.
In scikit-learn convention:
y is a variable name commonly used to denote the
target. This is because the target can be seen as the output of the
model and follows the convention that output is called
y as in
y = f(x).
train, learn, fit¶
Find ideal model parameters given the data. Let’s give a concrete example.
On the above figure, a linear model (blue line) will be
mathematically defined by
y = a*x + b. The parameter
a defines the slope of
the line while
b defines the intercept. Indeed, we can create an infinity of
models by varying the parameters
b. However, we can search for a
specific linear model that would fulfill a specific requirement, for
instance minimizing the sum of the errors (red lines). Training, learning, or
fitting a model refers to the procedure that will find the best
b fulfilling this requirement.
In a more abstract manner, we can represent fitting with the following diagram:
The model state are indeed the parameters and the jockey wheels are referring to an optimization algorithm to find the best parameters.
An estimator (i.e. an object that has a
fit method) supporting
fit_transform. Examples for transformers are
The opposit of underfitting is overfitting.
In this setting, samples are not labelled. One particular example of unsupervised learning is clustering, whose goal is to group the data into subsets of similar samples. Potential applications of clustering include:
using the content of articles to group them into broad topics
finding different types of customers from a e-commerce website data
Note that although mentioned, unsupervised learning is not covered in this course. The opposite of unsupervised learning is supervised learning.
A machine learning model is evaluated on the following manner: the model is trained using a training set and evaluated using a testing set. In this setting, it is implied that the hyperparameters of the model are fixed.
When one would like to tune the hyperparameters of a model as well, then it is necessary to subdivide the training set into a training and a validation set: we fit several machine learning models with different hyperparameters values and select the one performing best on the validation set. Finally, once the hyperparameters fixed we can use the left-out testing set to evaluate this model.
Sometimes, we also use a validation set in context of early-stopping. It is used with machine learning using iterative optimization to be fitted and it is not clear how many iterations are needed to train the model. In this case, one will used a validation set to monitor the performance of the model on some data different from the training set. Once that some criteria are fulfilled, the model is trained. This model is finally evaluated on the left-out testing set.
Other useful glossaries¶
For generic machine learning terms:
ML cheatsheet glossary: https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html
Google Machine Learning glossary: https://developers.google.com/machine-learning/glossary
For more advanced scikit-learn related terminology:
scikit-learn glossary: https://scikit-learn.org/stable/glossary.html