π Solution for Exercise M7.03#
As with the classification metrics exercise, we will evaluate the regression metrics within a cross-validation framework to get familiar with the syntax.
We will use the Ames house prices dataset.
import pandas as pd
import numpy as np
ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000
Note
If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.
The first step will be to create a linear regression model.
# solution
from sklearn.linear_model import LinearRegression
model = LinearRegression()
Then, use the cross_val_score
to estimate the generalization performance of
the model. Use a KFold
cross-validation with 10 folds. Make the use of the
\(R^2\) score explicit by assigning the parameter scoring
(even though it is
the default score).
# solution
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, data, target, cv=10, scoring="r2")
print(f"R2 score: {scores.mean():.3f} Β± {scores.std():.3f}")
R2 score: 0.794 Β± 0.103
Then, instead of using the \(R^2\) score, use the mean absolute error (MAE). You
may need to refer to the documentation for the scoring
parameter.
# solution
scores = cross_val_score(
model, data, target, cv=10, scoring="neg_mean_absolute_error"
)
errors = -scores
print(f"Mean absolute error: {errors.mean():.3f} k$ Β± {errors.std():.3f}")
Mean absolute error: 21.892 k$ Β± 2.225
The scoring
parameter in scikit-learn expects score. It means that the
higher the values, and the smaller the errors are, the better the model is.
Therefore, the error should be multiplied by -1. Thatβs why the string given
the scoring
starts with neg_
when dealing with metrics which are errors.
Finally, use the cross_validate
function and compute multiple scores/errors
at once by passing a list of scorers to the scoring
parameter. You can
compute the \(R^2\) score and the mean absolute error for instance.
# solution
from sklearn.model_selection import cross_validate
scoring = ["r2", "neg_mean_absolute_error"]
cv_results = cross_validate(model, data, target, scoring=scoring)
import pandas as pd
scores = {
"R2": cv_results["test_r2"],
"MAE": -cv_results["test_neg_mean_absolute_error"],
}
scores = pd.DataFrame(scores)
scores
R2 | MAE | |
---|---|---|
0 | 0.848721 | 21.256799 |
1 | 0.816374 | 22.084083 |
2 | 0.813513 | 22.113367 |
3 | 0.814138 | 20.448279 |
4 | 0.637473 | 24.370341 |
In the Regression Metrics notebook, we introduced the concept of loss function,
which is the metric optimized when training a model. In the case of
LinearRegression
, the fitting process consists in minimizing the mean squared
error (MSE). Some estimators, such as HistGradientBoostingRegressor
, can
use different loss functions, to be set using the loss
hyperparameter.
Notice that the evaluation metrics and the loss functions are not necessarily the same. Letβs see an example:
# solution
from collections import defaultdict
from sklearn.ensemble import HistGradientBoostingRegressor
scoring = ["neg_mean_squared_error", "neg_mean_absolute_error"]
loss_functions = ["squared_error", "absolute_error"]
scores = defaultdict(list)
for loss_func in loss_functions:
model = HistGradientBoostingRegressor(loss=loss_func)
cv_results = cross_validate(model, data, target, scoring=scoring)
mse = -cv_results["test_neg_mean_squared_error"]
mae = -cv_results["test_neg_mean_absolute_error"]
scores["loss"].append(loss_func)
scores["MSE"].append(f"{mse.mean():.1f} Β± {mse.std():.1f}")
scores["MAE"].append(f"{mae.mean():.1f} Β± {mae.std():.1f}")
scores = pd.DataFrame(scores)
scores.set_index("loss")
MSE | MAE | |
---|---|---|
loss | ||
squared_error | 892.2 Β± 243.6 | 17.6 Β± 0.9 |
absolute_error | 923.8 Β± 344.6 | 16.7 Β± 1.5 |
Even if the score distributions overlap due to the presence of outliers in the
dataset, it is true that the average MSE is lower when loss="squared_error"
,
whereas the average MAE is lower when loss="absolute_error"
as expected.
Indeed, the choice of a loss function is made depending on the evaluation
metric that we want to optimize for a given use case.
If you feel like going beyond the contents of this MOOC, you can try different combinations of loss functions and evaluation metrics.
Notice that there are some metrics that cannot be directly optimized by optimizing a loss function. This is the case for metrics that evolve in a discontinuous manner with respect to the internal parameters of the model, as learning solvers based on gradient descent or similar optimizers require continuity (the details are beyond the scope of this MOOC).
For instance, classification models are often evaluated using metrics computed
on hard class predictions (i.e. whether a sample belongs to a given class)
rather than from continuous values such as
predict_proba
(i.e. the estimated probability of belonging to said given class). Because of
this, classifiers are typically trained by optimizing a loss function computed
from some continuous output of the model. We call it a βsurrogate lossβ as it
substitutes the metric of interest. For instance LogisticRegression
minimizes the log_loss
applied to the predict_proba
output of the model.
By minimizing the surrogate loss, we maximize the accuracy. However
scikit-learn does not provide surrogate losses for all possible classification
metrics.