πŸ“ƒ Solution for Exercise M7.03#

As with the classification metrics exercise, we will evaluate the regression metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

The first step will be to create a linear regression model.

# solution
from sklearn.linear_model import LinearRegression

model = LinearRegression()

Then, use the cross_val_score to estimate the generalization performance of the model. Use a KFold cross-validation with 10 folds. Make the use of the \(R^2\) score explicit by assigning the parameter scoring (even though it is the default score).

# solution
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, data, target, cv=10, scoring="r2")
print(f"R2 score: {scores.mean():.3f} Β± {scores.std():.3f}")
R2 score: 0.794 Β± 0.103

Then, instead of using the \(R^2\) score, use the mean absolute error. You need to refer to the documentation for the scoring parameter.

# solution
scores = cross_val_score(
    model, data, target, cv=10, scoring="neg_mean_absolute_error"
)
errors = -scores
print(f"Mean absolute error: {errors.mean():.3f} k$ Β± {errors.std():.3f}")
Mean absolute error: 21.892 k$ Β± 2.225

The scoring parameter in scikit-learn expects score. It means that the higher the values, and the smaller the errors are, the better the model is. Therefore, the error should be multiplied by -1. That’s why the string given the scoring starts with neg_ when dealing with metrics which are errors.

Finally, use the cross_validate function and compute multiple scores/errors at once by passing a list of scorers to the scoring parameter. You can compute the \(R^2\) score and the mean absolute error for instance.

# solution
from sklearn.model_selection import cross_validate

scoring = ["r2", "neg_mean_absolute_error"]
cv_results = cross_validate(model, data, target, scoring=scoring)
import pandas as pd

scores = {
    "R2": cv_results["test_r2"],
    "MAE": -cv_results["test_neg_mean_absolute_error"],
}
scores = pd.DataFrame(scores)
scores
R2 MAE
0 0.848721 21.256799
1 0.816374 22.084083
2 0.813513 22.113367
3 0.814138 20.448279
4 0.637473 24.370341