The Ames housing dataset

The Ames housing dataset#

In this notebook, we will quickly present the “Ames housing” dataset. We will see that this dataset is similar to the “California housing” dataset. However, it is more complex to handle: it contains missing data and both numerical and categorical features.

This dataset is located in the datasets directory. It is stored in a comma separated value (CSV) file. As previously mentioned, we are aware that the dataset contains missing values. The character "?" is used as a missing value marker.

We will open the dataset and specify the missing value marker such that they will be parsed by pandas when opening the file.

import pandas as pd

ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
ames_housing = ames_housing.drop(columns="Id")

We can have a first look at the available columns in this dataset.

ames_housing.head()

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 80 columns

We see that the last column named "SalePrice" is indeed the target that we would like to predict. So we will split our dataset into two variables containing the data and the target.

target_name = "SalePrice"
data, target = (
    ames_housing.drop(columns=target_name),
    ames_housing[target_name],
)

Let’s have a quick look at the target before to focus on the data.

target.head()

  208500
  181500
  223500
  140000
  250000
Name: SalePrice, dtype: int64

We see that the target contains continuous value. It corresponds to the price of a house in $. We can have a look at the target distribution.

import matplotlib.pyplot as plt

target.plot.hist(bins=20, edgecolor="black")
plt.xlabel("House price in $")
_ = plt.title("Distribution of the house price \nin Ames")

../_images/309abdc3b56352ffce63ca265d6a85dbe25e45b60493f18bd2b680da9414b24b.png

We see that the distribution has a long tail. It means that most of the house are normally distributed but a couple of houses have a higher than normal value. It could be critical to take this peculiarity into account when designing a predictive model.

Now, we can have a look at the available data that we could use to predict house prices.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 MSSubClass     1460 non-null   int64  
 MSZoning       1460 non-null   object 
 LotFrontage    1201 non-null   float64
 LotArea        1460 non-null   int64  
 Street         1460 non-null   object 
 Alley          91 non-null     object 
 LotShape       1460 non-null   object 
 LandContour    1460 non-null   object 
 Utilities      1460 non-null   object 
 LotConfig      1460 non-null   object 
LandSlope      1460 non-null   object 
Neighborhood   1460 non-null   object 
Condition1     1460 non-null   object 
Condition2     1460 non-null   object 
BldgType       1460 non-null   object 
HouseStyle     1460 non-null   object 
OverallQual    1460 non-null   int64  
OverallCond    1460 non-null   int64  
YearBuilt      1460 non-null   int64  
YearRemodAdd   1460 non-null   int64  
RoofStyle      1460 non-null   object 
RoofMatl       1460 non-null   object 
Exterior1st    1460 non-null   object 
Exterior2nd    1460 non-null   object 
MasVnrType     588 non-null    object 
MasVnrArea     1452 non-null   float64
ExterQual      1460 non-null   object 
ExterCond      1460 non-null   object 
Foundation     1460 non-null   object 
BsmtQual       1423 non-null   object 
BsmtCond       1423 non-null   object 
BsmtExposure   1422 non-null   object 
BsmtFinType1   1423 non-null   object 
BsmtFinSF1     1460 non-null   int64  
BsmtFinType2   1422 non-null   object 
BsmtFinSF2     1460 non-null   int64  
BsmtUnfSF      1460 non-null   int64  
TotalBsmtSF    1460 non-null   int64  
Heating        1460 non-null   object 
HeatingQC      1460 non-null   object 
CentralAir     1460 non-null   object 
Electrical     1459 non-null   object 
1stFlrSF       1460 non-null   int64  
2ndFlrSF       1460 non-null   int64  
LowQualFinSF   1460 non-null   int64  
GrLivArea      1460 non-null   int64  
BsmtFullBath   1460 non-null   int64  
BsmtHalfBath   1460 non-null   int64  
FullBath       1460 non-null   int64  
HalfBath       1460 non-null   int64  
BedroomAbvGr   1460 non-null   int64  
KitchenAbvGr   1460 non-null   int64  
KitchenQual    1460 non-null   object 
TotRmsAbvGrd   1460 non-null   int64  
Functional     1460 non-null   object 
Fireplaces     1460 non-null   int64  
FireplaceQu    770 non-null    object 
GarageType     1379 non-null   object 
GarageYrBlt    1379 non-null   float64
GarageFinish   1379 non-null   object 
GarageCars     1460 non-null   int64  
GarageArea     1460 non-null   int64  
GarageQual     1379 non-null   object 
GarageCond     1379 non-null   object 
PavedDrive     1460 non-null   object 
WoodDeckSF     1460 non-null   int64  
OpenPorchSF    1460 non-null   int64  
EnclosedPorch  1460 non-null   int64  
3SsnPorch      1460 non-null   int64  
ScreenPorch    1460 non-null   int64  
PoolArea       1460 non-null   int64  
PoolQC         7 non-null      object 
Fence          281 non-null    object 
MiscFeature    54 non-null     object 
MiscVal        1460 non-null   int64  
MoSold         1460 non-null   int64  
YrSold         1460 non-null   int64  
SaleType       1460 non-null   object 
SaleCondition  1460 non-null   object 
dtypes: float64(3), int64(33), object(43)
memory usage: 901.2+ KB

Looking at the dataframe general information, we can see that 79 features are available and that the dataset contains 1460 samples. However, some features contains missing values. Also, the type of data is heterogeneous: both numerical and categorical data are available.

First, we will have a look at the data represented with numbers.

numerical_data = data.select_dtypes("number")
numerical_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 36 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 MSSubClass     1460 non-null   int64  
 LotFrontage    1201 non-null   float64
 LotArea        1460 non-null   int64  
 OverallQual    1460 non-null   int64  
 OverallCond    1460 non-null   int64  
 YearBuilt      1460 non-null   int64  
 YearRemodAdd   1460 non-null   int64  
 MasVnrArea     1452 non-null   float64
 BsmtFinSF1     1460 non-null   int64  
 BsmtFinSF2     1460 non-null   int64  
BsmtUnfSF      1460 non-null   int64  
TotalBsmtSF    1460 non-null   int64  
1stFlrSF       1460 non-null   int64  
2ndFlrSF       1460 non-null   int64  
LowQualFinSF   1460 non-null   int64  
GrLivArea      1460 non-null   int64  
BsmtFullBath   1460 non-null   int64  
BsmtHalfBath   1460 non-null   int64  
FullBath       1460 non-null   int64  
HalfBath       1460 non-null   int64  
BedroomAbvGr   1460 non-null   int64  
KitchenAbvGr   1460 non-null   int64  
TotRmsAbvGrd   1460 non-null   int64  
Fireplaces     1460 non-null   int64  
GarageYrBlt    1379 non-null   float64
GarageCars     1460 non-null   int64  
GarageArea     1460 non-null   int64  
WoodDeckSF     1460 non-null   int64  
OpenPorchSF    1460 non-null   int64  
EnclosedPorch  1460 non-null   int64  
3SsnPorch      1460 non-null   int64  
ScreenPorch    1460 non-null   int64  
PoolArea       1460 non-null   int64  
MiscVal        1460 non-null   int64  
MoSold         1460 non-null   int64  
YrSold         1460 non-null   int64  
dtypes: float64(3), int64(33)
memory usage: 410.8 KB

We see that the data are mainly represented with integer number. Let’s have a look at the histogram for all these features.

numerical_data.hist(
    bins=20, figsize=(12, 22), edgecolor="black", layout=(9, 4)
)
plt.subplots_adjust(hspace=0.8, wspace=0.8)

../_images/babf29d08c18f377d3b8380951d19cd69af025605d1922e32ceaaa89f6ab1a46.png

We see that some features have high picks for 0. It could be linked that this value was assigned when the criterion did not apply, for instance the area of the swimming pool when no swimming pools are available.

We also have some feature encoding some date (for instance year).

These information are useful and should also be considered when designing a predictive model.

Now, let’s have a look at the data encoded with strings.

string_data = data.select_dtypes(object)
string_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 43 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 MSZoning       1460 non-null   object
 Street         1460 non-null   object
 Alley          91 non-null     object
 LotShape       1460 non-null   object
 LandContour    1460 non-null   object
 Utilities      1460 non-null   object
 LotConfig      1460 non-null   object
 LandSlope      1460 non-null   object
 Neighborhood   1460 non-null   object
 Condition1     1460 non-null   object
Condition2     1460 non-null   object
BldgType       1460 non-null   object
HouseStyle     1460 non-null   object
RoofStyle      1460 non-null   object
RoofMatl       1460 non-null   object
Exterior1st    1460 non-null   object
Exterior2nd    1460 non-null   object
MasVnrType     588 non-null    object
ExterQual      1460 non-null   object
ExterCond      1460 non-null   object
Foundation     1460 non-null   object
BsmtQual       1423 non-null   object
BsmtCond       1423 non-null   object
BsmtExposure   1422 non-null   object
BsmtFinType1   1423 non-null   object
BsmtFinType2   1422 non-null   object
Heating        1460 non-null   object
HeatingQC      1460 non-null   object
CentralAir     1460 non-null   object
Electrical     1459 non-null   object
KitchenQual    1460 non-null   object
Functional     1460 non-null   object
FireplaceQu    770 non-null    object
GarageType     1379 non-null   object
GarageFinish   1379 non-null   object
GarageQual     1379 non-null   object
GarageCond     1379 non-null   object
PavedDrive     1460 non-null   object
PoolQC         7 non-null      object
Fence          281 non-null    object
MiscFeature    54 non-null     object
SaleType       1460 non-null   object
SaleCondition  1460 non-null   object
dtypes: object(43)
memory usage: 490.6+ KB

These features are categorical. We can make some bar plot to see categories count for each feature.

from math import ceil
from itertools import zip_longest

n_string_features = string_data.shape[1]
nrows, ncols = ceil(n_string_features / 4), 4

fig, axs = plt.subplots(ncols=ncols, nrows=nrows, figsize=(14, 80))

for feature_name, ax in zip_longest(string_data, axs.ravel()):
    if feature_name is None:
        # do not show the axis
        ax.axis("off")
        continue

    string_data[feature_name].value_counts().plot.barh(ax=ax)
    ax.set_title(feature_name)

plt.subplots_adjust(hspace=0.2, wspace=0.8)

../_images/75297f7bc01f3e6b9f5b722df1281b05f068bb5fae08cac6524b6d46d10af46c.png

Plotting this information allows us to answer to two questions:

Is there few or many categories for a given features?
Is there rare categories for some features?

Knowing about these peculiarities would help at designing the predictive pipeline.

Note

In order to keep the content of the course simple and didactic, we created a version of this database without missing values.

ames_housing_no_missing = pd.read_csv(
    "../datasets/ames_housing_no_missing.csv"
)
ames_housing_no_missing.head()

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	60	RL	65.0	8450	Pave	Grvl	Reg	Lvl	AllPub	Inside	...	Gd	MnPrv	Shed	2	2008	WD	Normal	208500
1	20	RL	80.0	9600	Pave	Grvl	Reg	Lvl	AllPub	FR2	...	Gd	MnPrv	Shed	5	2007	WD	Normal	181500
2	60	RL	68.0	11250	Pave	Grvl	IR1	Lvl	AllPub	Inside	...	Gd	MnPrv	Shed	9	2008	WD	Normal	223500
3	70	RL	60.0	9550	Pave	Grvl	IR1	Lvl	AllPub	Corner	...	Gd	MnPrv	Shed	2	2006	WD	Abnorml	140000
4	60	RL	84.0	14260	Pave	Grvl	IR1	Lvl	AllPub	FR2	...	Gd	MnPrv	Shed	12	2008	WD	Normal	250000

5 rows × 80 columns

It contains the same information as the original dataset after using a sklearn.impute.SimpleImputer to replace missing values using the mean along each numerical column (including the target), and the most frequent value along each categorical column.

from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline


numerical_features = [
    "LotFrontage",
    "LotArea",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "BsmtUnfSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
    "LowQualFinSF",
    "GrLivArea",
    "BedroomAbvGr",
    "KitchenAbvGr",
    "TotRmsAbvGrd",
    "Fireplaces",
    "GarageCars",
    "GarageArea",
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "3SsnPorch",
    "ScreenPorch",
    "PoolArea",
    "MiscVal",
    target_name,
]
categorical_features = data.columns.difference(numerical_features)

most_frequent_imputer = SimpleImputer(strategy="most_frequent")
mean_imputer = SimpleImputer(strategy="mean")

preprocessor = make_column_transformer(
    (most_frequent_imputer, categorical_features),
    (mean_imputer, numerical_features),
)
ames_housing_preprocessed = pd.DataFrame(
    preprocessor.fit_transform(ames_housing),
    columns=categorical_features.tolist() + numerical_features,
)
ames_housing_preprocessed = ames_housing_preprocessed[ames_housing.columns]
ames_housing_preprocessed = ames_housing_preprocessed.astype(
    ames_housing.dtypes
)
(ames_housing_no_missing == ames_housing_preprocessed).all()

MSSubClass       True
MSZoning         True
LotFrontage      True
LotArea          True
Street           True
                 ... 
MoSold           True
YrSold           True
SaleType         True
SaleCondition    True
SalePrice        True
Length: 80, dtype: bool