π Solution for Exercise M7.03#
As with the classification metrics exercise, we will evaluate the regression metrics within a cross-validation framework to get familiar with the syntax.
We will use the Ames house prices dataset.
import pandas as pd
import numpy as np
ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000
Note
If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.
The first step will be to create a linear regression model.
# solution
from sklearn.linear_model import LinearRegression
model = LinearRegression()
Then, use the cross_val_score
to estimate the generalization performance of
the model. Use a KFold
cross-validation with 10 folds. Make the use of the
\(R^2\) score explicit by assigning the parameter scoring
(even though it is
the default score).
# solution
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, data, target, cv=10, scoring="r2")
print(f"R2 score: {scores.mean():.3f} Β± {scores.std():.3f}")
R2 score: 0.794 Β± 0.103
Then, instead of using the \(R^2\) score, use the mean absolute error. You need
to refer to the documentation for the scoring
parameter.
# solution
scores = cross_val_score(
model, data, target, cv=10, scoring="neg_mean_absolute_error"
)
errors = -scores
print(f"Mean absolute error: {errors.mean():.3f} k$ Β± {errors.std():.3f}")
Mean absolute error: 21.892 k$ Β± 2.225
The scoring
parameter in scikit-learn expects score. It means that the
higher the values, and the smaller the errors are, the better the model is.
Therefore, the error should be multiplied by -1. Thatβs why the string given
the scoring
starts with neg_
when dealing with metrics which are errors.
Finally, use the cross_validate
function and compute multiple scores/errors
at once by passing a list of scorers to the scoring
parameter. You can
compute the \(R^2\) score and the mean absolute error for instance.
# solution
from sklearn.model_selection import cross_validate
scoring = ["r2", "neg_mean_absolute_error"]
cv_results = cross_validate(model, data, target, scoring=scoring)
import pandas as pd
scores = {
"R2": cv_results["test_r2"],
"MAE": -cv_results["test_neg_mean_absolute_error"],
}
scores = pd.DataFrame(scores)
scores
R2 | MAE | |
---|---|---|
0 | 0.848721 | 21.256799 |
1 | 0.816374 | 22.084083 |
2 | 0.813513 | 22.113367 |
3 | 0.814138 | 20.448279 |
4 | 0.637473 | 24.370341 |