# π Wrap-up quizΒΆ

**This quiz requires some programming to be answered.**

Open the dataset `house_prices.csv`

with the following command:

```
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]
```

`ames_housing`

is a pandas dataframe. The column βSalePriceβ contains the
target variable. Note that we instructed pandas to treat the character β?β as a
marker for cells with missing values also known as βnullβ values.

To simplify this exercise, we will only used the numerical features defined below:

```
numerical_features = [
"LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
"BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
"GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
"GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
"3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]
data_numerical = data[numerical_features]
```

Start by fitting a linear regression (`sklearn.linear_model.LinearRegression`

).
Use a 10-fold cross-validation and pass the argument `return_estimator=True`

in
`sklearn.model_selection.cross_validate`

to access all fitted estimators fitted
on each fold. As we saw in the previous notebooks, you will have to use a
`sklearn.preprocessing.StandardScaler`

to scale the data before passing it to
the regressor. Also, some missing data are present in the different columns.
You can use a `sklearn.impute.SimpleImputer`

with the default parameters to
impute missing data. Thus, you can create a model that will **pipeline the
scaler, followed by the imputer, followed by the linear regression**.

Question

What is the order of magnitude of the extremum weight values over all the features:

a) 1e4

b) 1e6

c) 1e18

*Select a single answer*

Repeat the same experiment by fitting a ridge regressor
(`sklearn.linear_model.Ridge`

) with the default parameter.

Question

What magnitude of the extremum weight values for all features?

a) 1e4

b) 1e6

c) 1e18

*Select a single answer*

Question

What are the two most important features used by the ridge regressor? You can make a box-plot of the coefficients across all folds to get a good insight.

a)

`"MiscVal"`

and`"BsmtFinSF1"`

b)

`"GarageCars"`

and`"GrLivArea"`

c)

`"TotalBsmtSF"`

and`"GarageCars"`

*Select a single answer*

Remove the feature `"GarageArea"`

from the dataset and repeat the previous
experiment.

Question

What is the impact on the weights of removing `"GarageArea"`

from the dataset?

a) None

b) Change completely the order of the feature importance

c) The variability of the most important feature reduced

*Select a single answer*

Question

What is the reason for observing the previous impact on the most important weight?

a) Both features are correlated and are carrying similar information

b) Removing a feature reduce the noise in the dataset

c) Just some random effects

*Select a single answer*

Now, we will search for the regularization strength that will maximize the
statistical performance of our predictive model. Fit a
`sklearn.linear_model.RidgeCV`

instead of a `Ridge`

regressor pass `alphas=np.logspace(-1, 3, num=30)`

to
explore the effect of changing the regularization strength.

Question

Are there major differences regarding the most important weights?

a) Yes, the weights order is completely different

b) No, the weights order is very similar

*Select a single answer*

Check the parameter `alpha_`

(the regularization strength) for the different
ridge regressors obtained on each fold.

Question

In which range does `alpha_`

fall into for most folds?

a) between 0.1 and 1

b) between 1 and 10

c) between 10 and 100

d) between 100 and 1000

*Select a single answer*

Now, we will tackle a classification problem instead of a regression problem.
Load the Adult Census dataset with the following snippet of code and we will
work only with **numerical features**.

```
adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.select_dtypes(["integer", "floating"])
data = data.drop(columns=["education-num"])
```

Question

How many numerical features are present in the dataset contained in the
variable `data`

?

a) 3

b) 4

c) 5

*Select a single answer*

Question

Are there any missing values in the dataset contained in the variable `data`

?

a) Yes

b) No

*Select a single answer*

Hint: you can use `df.info()`

to get information regarding each column.

Fit a `sklearn.linear_model.LogisticRegression`

classifier using a 10-fold
cross-validation to assess the performance. Since we are dealing with a linear
model, do not forget to scale the data with a `StandardScaler`

before training
the model.

Question

On average, how much better/worse/similar is the logistic regression to a dummy classifier that would predict the most frequent class?

a) Worse than a dummy classifier by ~4%

b) Similar to a dummy classifier

c) Better than a dummy classifier by ~4%

*Select a single answer*

Question

What is the most important feature seen by the logistic regression?

a)

`"age"`

b)

`"capital-gain"`

c)

`"capital-loss"`

d)

`"hours-per-week"`

*Select a single answer*

Now, we will work with **both numerical and categorical features**. You can
load Adult Census with the following snippet:

```
adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.drop(columns=["class", "education-num"])
```

Question

Are there missing values in this dataset?

a) Yes

b) No

*Select a single answer*

Hint: you can use `df.info()`

to get information regarding each column.

Create a predictive model where the categorical data should be one-hot encoded, the numerical data should be scaled, and the predictor used should be a logistic regression classifier.

Question

On average, what is the improvement of using the categorical features?

a) It gives similar results

b) It improves the statistical performance by 2.5%

c) it improves the statistical performance by 5%

d) it improves the statistical performance by 7.5%

e) it improves the statistical performance by 10%

*Select a single answer*

For the following questions, you can use the following snippet to get the feature names after the preprocessing performed.

```
preprocessor.fit(data)
feature_names = (preprocessor.named_transformers_["onehotencoder"]
.get_feature_names(categorical_columns)).tolist()
feature_names += numerical_columns
```

There is as many feature names as coefficients in the last step of your predictive pipeline.

Question

What are the two most important features used by the logistic regressor?

a)

`"hours-per-week"`

and`"native-country_Columbia"`

b)

`"workclass_?"`

and`"naitive_country_?"`

c)

`"capital-gain"`

and`"education_Doctorate"`

*Select a single answer*

Question

What is the effect of decreasing the `C`

parameter on the coefficients?

a) shrinking the magnitude of the weights towards zeros

b) increasing the magnitude of the weights

c) reducing the weightsβ variance

d) increasing the weightsβ variance

e) it has no influence on the weightsβ variance

*Select several answers*