The blood transfusion dataset#
In this notebook, we will present the βblood transfusionβ dataset. This
dataset is locally available in the directory datasets and it is stored as a
comma separated value (CSV) file. We start by loading the entire dataset.
import pandas as pd
blood_transfusion = pd.read_csv("../datasets/blood_transfusion.csv")
We can have a first look at the at the dataset loaded.
blood_transfusion.head()
| Recency | Frequency | Monetary | Time | Class | |
|---|---|---|---|---|---|
| 0 | 2 | 50 | 12500 | 98 | donated |
| 1 | 0 | 13 | 3250 | 28 | donated |
| 2 | 1 | 16 | 4000 | 35 | donated |
| 3 | 2 | 20 | 5000 | 45 | donated |
| 4 | 1 | 24 | 6000 | 77 | not donated |
In this dataframe, we can see that the last column correspond to the target to
be predicted called "Class". We will create two variables, data and
target to separate the data from which we could learn a predictive model and
the target that should be predicted.
data = blood_transfusion.drop(columns="Class")
target = blood_transfusion["Class"]
Letβs have a first look at the data variable.
data.head()
| Recency | Frequency | Monetary | Time | |
|---|---|---|---|---|
| 0 | 2 | 50 | 12500 | 98 |
| 1 | 0 | 13 | 3250 | 28 |
| 2 | 1 | 16 | 4000 | 35 |
| 3 | 2 | 20 | 5000 | 45 |
| 4 | 1 | 24 | 6000 | 77 |
We observe four columns. Each record corresponds to a person that intended to give blood. The information stored in each column are:
Recency: the time in months since the last time a person intended to give blood;Frequency: the number of time a person intended to give blood in the past;Monetary: the amount of blood given in the past (in cmΒ³);Time: the time in months since the first time a person intended to give blood.
Now, letβs have a look regarding the type of data that we are dealing in these columns and if any missing values are present in our dataset.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Recency 748 non-null int64
1 Frequency 748 non-null int64
2 Monetary 748 non-null int64
3 Time 748 non-null int64
dtypes: int64(4)
memory usage: 23.5 KB
Our dataset is made of 748 samples. All features are represented with integer numbers and there is no missing values. We can have a look at each feature distributions.
_ = data.hist(figsize=(12, 10), bins=30, edgecolor="black")
There is nothing shocking regarding the distributions. We only observe a high
value range for the features "Recency", "Frequency", and "Monetary". It
means that we have a few extreme high values for these features.
Now, letβs have a look at the target that we would like to predict for this task.
target.head()
0 donated
1 donated
2 donated
3 donated
4 not donated
Name: Class, dtype: object
import matplotlib.pyplot as plt
target.value_counts(normalize=True).plot.barh()
plt.xlabel("Number of samples")
_ = plt.title("Class distribution")
We see that the target is discrete and contains two categories: whether a
person "donated" or "not donated" his/her blood. Thus the task to be
solved is a classification problem. We should note that the class counts of
these two classes is different.
target.value_counts(normalize=True)
Class
not donated 0.762032
donated 0.237968
Name: proportion, dtype: float64
Indeed, ~76% of the samples belong to the class "not donated". It is rather
important: a classifier that would predict always this "not donated" class
would achieve an accuracy of 76% of good classification without using any
information from the data itself. This issue is known as class imbalance. One
should take care about the generalization performance metric used to evaluate
a model as well as the predictive model chosen itself.
Now, letβs have a naive analysis to see if there is a link between features and the target using a pair plot representation.
import seaborn as sns
_ = sns.pairplot(blood_transfusion, hue="Class")
Looking at the diagonal plots, we donβt see any feature that individually
could help at separating the two classes. When looking at a pair of feature,
we donβt see any striking combinations as well. However, we can note that the
"Monetary" and "Frequency" features are perfectly correlated: all the data
points are aligned on a diagonal.
As a conclusion, this dataset would be a challenging dataset: it suffer from class imbalance, correlated features and thus very few features will be available to learn a model, and none of the feature combinations were found to help at predicting.