🏁 Wrap-up quiz¢

This quiz requires some programming to be answered.

Open the dataset house_prices.csv with the following command:

import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
ames_housing = ames_housing.drop(columns="Id")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

ames_housing is a pandas dataframe. The column β€œSalePrice” contains the target variable. Note that we instructed pandas to treat the character β€œ?” as a marker for cells with missing values also known as β€œnull” values.

Furthermore, we ignore the column named β€œId” because unique identifiers are usually useless in the context of predictive modeling.

We did not encounter any regression problem yet. Therefore, we will convert the regression target into a classification target to predict whether or not an house is expensive. β€œExpensive” is defined as a sale price greater than $200,000.


Use the data.info() and data.head() commands to examine the columns of the dataframe. The dataset contains:

  • a) numerical features

  • b) categorical features

  • c) missing data

Select several answers


How many features are available to predict whether or not an house is expensive?

  • a) 79

  • b) 80

  • c) 81

Select a single answer


How many features are represented with numbers?

  • a) 0

  • b) 36

  • c) 42

  • d) 79

Select a single answer

Hint: you can use the method df.select_dtypes or the function sklearn.compose.make_column_selector as shown in a previous notebook.

Refer to the dataset description regarding the meaning of the dataset.


Among the following columns, which columns express a quantitative numerical value (excluding ordinal categories)?

  • a) β€œLotFrontage”

  • b) β€œLotArea”

  • c) β€œOverallQual”

  • d) β€œOverallCond”

  • e) β€œYearBuilt”

Select several answers

We consider the following numerical columns:

numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",

Now create a predictive model that uses these numerical columns as input data. Your predictive model should be a pipeline composed of a standard scaler, a mean imputer (cf. sklearn.impute.SimpleImputer(strategy="mean")) and a sklearn.linear_model.LogisticRegression.


What is the accuracy score obtained by 5-fold cross-validation of this pipeline?

  • a) ~0.5

  • b) ~0.7

  • c) ~0.9

Select a single answer

Instead of solely using the numerical columns, let us build a pipeline that can process both the numerical and categorical features together as follows:

  • numerical features should be processed as previously;

  • the left-out columns should be treated as categorical variables using a sklearn.preprocessing.OneHotEncoder;

  • prior to one-hot encoding, insert the sklearn.impute.SimpleImputer(strategy="most_frequent") transformer to replace missing values by the most-frequent value in each column.

Be aware that you can pass a Pipeline as a transformer in a ColumnTransformer. We give a succinct example where we use a ColumnTransformer to select the numerical columns and process them (i.e. scale and impute). We additionally show that we can create a final model combining this preprocessor with a classifier.

scaler_imputer_transformer = make_pipeline(StandardScaler(), SimpleImputer())
preprocessor = ColumnTransformer(transformers=[
    ("num_preprocessor", scaler_imputer_transformer, numerical_features)
model = make_pipeline(preprocessor, LogisticRegression())

Let us now define a substantial improvement or deterioration as an increase or decrease of the mean test score (difference of the mean test scores of models using only numerical features and numerical together with categorical features) of at least three times the standard deviation of the cross-validated test scores of the model using both categorical and numerical features.


With this heterogeneous pipeline, the accuracy score:

  • a) worsens substantially

  • b) worsens slightly

  • c) improves slightly

  • d) improves substantially

Select a single answer