# Comparing results with baseline and chance levelΒΆ

In this notebook, we present how to compare the statistical performance of a model to a minimal baseline.

Indeed, in the previous notebook, we compared the testing error by
taking into account the target distribution. A good practice is to compare
the testing error with a dummy baseline and the chance level. In
regression, we could use the `DummyRegressor`

and predict the mean target
without using the data. The chance level can be determined by permuting the
labels and check the difference of result.

Therefore, we will conduct experiment to get the score of a model and the two baselines. We will start by loading the California housing dataset.

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

```
from sklearn.datasets import fetch_california_housing
data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100 # rescale the target in k$
```

Across all evaluations, we will use a `ShuffleSplit`

cross-validation.

```
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=30, test_size=0.2, random_state=0)
```

We will start by running the cross-validation for the decision tree regressor which is our model of interest. Besides, we will store the testing error in a pandas series.

```
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
regressor = DecisionTreeRegressor()
result_regressor = cross_validate(regressor, data, target,
cv=cv, scoring="neg_mean_absolute_error",
n_jobs=2)
errors_regressor = pd.Series(-result_regressor["test_score"],
name="Regressor error")
```

Then, we will evaluate our first baseline. This baseline is called a dummy
regressor. This dummy regressor will always predict the mean target computed
on the training. Therefore, the dummy regressor will never use any
information regarding the data `data`

.

```
from sklearn.dummy import DummyRegressor
dummy = DummyRegressor()
result_dummy = cross_validate(dummy, data, target,
cv=cv, scoring="neg_mean_absolute_error",
n_jobs=2)
errors_dummy = pd.Series(-result_dummy["test_score"], name="Dummy error")
```

Finally, we will evaluate the statistical performance of the second baseline. This baseline will provide the statistical performance of the chance level. Indeed, we will train a decision tree on some training data and evaluate the same tree on data where the target vector has been randomized.

```
from sklearn.model_selection import permutation_test_score
regressor = DecisionTreeRegressor()
score, permutation_score, pvalue = permutation_test_score(
regressor, data, target, cv=cv, scoring="neg_mean_absolute_error",
n_jobs=2, n_permutations=30)
errors_permutation = pd.Series(-permutation_score, name="Permuted error")
```

Finally, we plot the testing errors for the two baselines and the actual regressor.

```
final_errors = pd.concat([errors_regressor, errors_dummy, errors_permutation],
axis=1)
```

```
import matplotlib.pyplot as plt
final_errors.plot.hist(bins=50, density=True, edgecolor="black")
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.xlabel("Mean absolute error (k$)")
_ = plt.title("Distribution of the testing errors")
```

We see that even if the statistical performance of our model is far from being good, it is better than the two baselines. Besides, we see that the dummy regressor is better than a chance level regressor.

In practice, using a dummy regressor might be sufficient as a baseline. Indeed, to obtain a reliable estimate the permutation of the target should be repeated and thus this method is costly. However, it gives the true chance level.