# π Wrap-up quizΒΆ

**This quiz requires some programming to be answered.**

Open the dataset `house_prices.csv`

with the following command:

```
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]
```

`ames_housing`

is a pandas dataframe. The column βSalePriceβ contains the
target variable. Note that we instructed pandas to treat the character β?β as a
marker for cells with missing values also known as βnullβ values.

To simplify this exercise, we will only used the numerical features defined below:

```
numerical_features = [
"LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
"BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
"GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
"GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
"3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]
data_numerical = data[numerical_features]
```

We will compare the generalization performance of a decision tree and a linear regression. For this purpose, we will create two separate predictive models and evaluate them by 10-fold cross-validation.

Thus, use `sklearn.linear_model.LinearRegression`

and
`sklearn.tree.DecisionTreeRegressor`

to create the model. Use the default
parameters for both models.

**Note**: missing values should be handle with a scikit-learn
`sklearn.impute.SimpleImputer`

and the default strategy (`"mean"`

). Be also
aware that a linear model requires to scale the data. You can use a
`sklearn.preprocessing.StandardScaler`

.

Question

Is the decision tree model better in terms of \(R^2\) score than the linear regression?

a) Yes

b) No

*Select a single answer*

Instead of using the default parameter for decision tree regressor, we will
optimize the depth of the tree. Using a grid-search
(`sklearn.model_selection.GridSearchCV`

) with a 10-fold cross-validation,
answer to the questions below. Vary the `max_depth`

from 1
level up to 15 levels.

Question

What is the optimal tree depth for the current problem?

a) The optimal depth is ranging from 3 to 5

b) The optimal depth is ranging from 5 to 8

c) The optimal depth is ranging from 8 to 11

d) The optimal depth is ranging from 11 to 15

*Select a single answer*

Question

A tree with an optimal depth has a score of:

a) ~0.74 and is better than the linear model b) ~0.72 and is equal to the linear model c) ~0.7 and is worse than the linear model

*Select a single answer*

Instead of using only the numerical dataset you will now use the entire
dataset available in the variable `data`

.

Create a preprocessor by dealing separately with the numerical and categorical columns. For the sake of simplicity, we will assume the following:

categorical columns can be selected if they have an

`object`

data type;numerical columns can be selected if they do not have an

`object`

data type. It will be the complement of the numerical columns.

**Do not optimize the max_depth parameter for this exercise.**

**Fix the random state of the tree by passing the parameter random_state=0**

Question

Are the performance in terms of \(R^2\) better by incorporating the categorical features in comparison with the previous tree with the optimal depth?

a) No, the generalization performance is the same: ~0.7

b) The generalization performance is slightly better: ~0.72

c) The generalization performance is better: ~0.74

*Select a single answer*