# 🏁 Wrap-up quiz 5

**This quiz requires some programming to be answered.**

Open the dataset `ames_housing_no_missing.csv` with the following command:

```python
import pandas as pd

ames_housing = pd.read_csv(
    "../datasets/ames_housing_no_missing.csv",
    na_filter=False,  # required for pandas>2.0
)
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]
```

`ames_housing` is a pandas dataframe. The column "SalePrice" contains the
target variable.

To simplify this exercise, we will only used the numerical features defined
below:

```python
numerical_features = [
    "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
    "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
    "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
    "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
    "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

data_numerical = data[numerical_features]
```

We will compare the generalization performance of a decision tree and a linear
regression. For this purpose, we will create two separate predictive models
and evaluate them by 10-fold cross-validation.

Thus, use `sklearn.linear_model.LinearRegression` and
`sklearn.tree.DecisionTreeRegressor` to create the models. Use the default
parameters for the linear regression and set `random_state=0` for the decision
tree.

Be aware that a linear model requires to scale numerical features.
Please use `sklearn.preprocessing.StandardScaler` so that your
linear regression model behaves the same way as the quiz author
intended ;)

```{admonition} Question
By comparing the cross-validation test scores for both models fold-to-fold, count the number of
times the linear model has a better test score than the decision tree model.
Select the range which this number belongs to:

- a) [0, 3]: the linear model is substantially worse than the decision tree
- b) [4, 6]: both models are almost equivalent
- c) [7, 10]: the linear model is substantially better than the decision tree

_Select a single answer_
```

+++

Instead of using the default parameters for the decision tree regressor, we will
optimize the `max_depth` of the tree. Vary the `max_depth` from 1 level up to 15
levels. Use nested cross-validation to evaluate a grid-search
(`sklearn.model_selection.GridSearchCV`). Set `cv=10` for both the inner and
outer cross-validations, then answer the questions below.

```{admonition} Question
What is the optimal tree depth for the current problem?

- a) The optimal depth is ranging from 3 to 5
- b) The optimal depth is ranging from 5 to 8
- c) The optimal depth is ranging from 8 to 11
- d) The optimal depth is ranging from 11 to 15

_Select a single answer_
```

+++

Now, we want to evaluate the generalization performance of the decision tree
while taking into account the fact that we tune the depth for this specific
dataset. Use the grid-search as an estimator inside a `cross_validate` to
automatically tune the `max_depth` parameter on each cross-validation
fold.

```{admonition} Question
A tree with tuned depth

- a) is always worse than the linear models on all CV folds
- b) is often but not always worse than the linear model
- c) is often but not always better than the linear model
- d) is always better than the linear models on all CV folds

_Select a single answer_

Note: Try to set the random_state of the decision tree to different values
e.g. random_state=1 or random_state=2 and re-run the nested cross-validation
to check that your answer is stable enough.
```

+++

Instead of using only the numerical features you will now use the entire dataset
available in the variable `data`.

Create a preprocessor by dealing separately with the numerical and categorical
columns. For the sake of simplicity, we will assume the following:

- categorical columns can be selected if they have an `object` data type;
- use an `OrdinalEncoder` to encode the categorical columns;
- numerical columns should correspond to the `numerical_features` as defined above.
  This is a subset of the features that are not an `object` data type.

In addition, set the `max_depth` of the decision tree to `7` (fixed, no need
to tune it with a grid-search).

Evaluate this model using `cross_validate` as in the previous questions.

```{admonition} Question
A tree model trained with both numerical and categorical features

- a) is most often worse than the tree model using only the numerical features
- b) is most often better than the tree model using only the numerical features

_Select a single answer_

Note: Try to set the random_state of the decision tree to different values
e.g. random_state=1 or random_state=2 and re-run the (this time single) cross-validation
to check that your answer is stable enough.
```