π Wrap-up quiz 7#
This quiz requires some programming to be answered.
Open the dataset bike_rides.csv
with the following commands:
import pandas as pd
cycling = pd.read_csv("../datasets/bike_rides.csv", index_col=0,
parse_dates=True)
cycling.index.name = ""
target_name = "power"
data, target = cycling.drop(columns=target_name), cycling[target_name]
data
A detailed description of this dataset is given in the appendix. As a reminder, the problem we are trying to solve with this dataset is to use measurements from cheap sensors (GPS, heart-rate monitor, etc.) in order to predict a cyclist power. Power can indeed be recorded via a cycling power meter device, but this device is rather expensive.
Instead of using blindly machine learning, we will first introduce some flavor of classic mechanics: the Newtonβs second law.
\(P_{meca} = (\frac{1}{2} \rho . SC_x . V_{a}^{2} + C_r . mg . \cos \alpha + mg . \sin \alpha + ma) V_d\)
where \(\rho\) is the air density in kg.m\(^{-3}\), \(S\) is frontal surface of the cyclist in m\(^{2}\), \(C_x\) is the drag coefficient, \(V_a\) is the air speed in m.s\(^{-1}\), \(C_r\) is the rolling coefficient, \(m\) is the mass of the rider and bicycle in kg, \(g\) is the standard acceleration due to gravity which is equal to 9.81 m.s\(^{-2}\), \(\alpha\) is the slope in radian, \(V_d\) is the rider speed in m.s\(^{-1}\), and \(a\) is the rider acceleration in m.s\(^{-2}\).
This equation might look a bit complex at first but we can explain with words what the different terms within the parenthesis are:
the first term is the power that a cyclist is required to produce to fight wind
the second term is the power that a cyclist is required to produce to fight the rolling resistance created by the tires on the floor
the third term is the power that a cyclist is required to produce to go up a hill if the slope is positive. If the slope is negative the cyclist does not need to produce any power to go forward
the fourth and last term is the power that a cyclist requires to change his speed (i.e. acceleration).
We can simplify the model above by using the data that we have at hand. It would look like the following.
\(P_{meca} = \beta_{1} V_{d}^{3} + \beta_{2} V_{d} + \beta_{3} \sin(\alpha) V_{d} + \beta_{4} a V_{d}\)
This model is closer to what we saw previously: it is a linear model trained on a non-linear feature transformation. We will build, train and evaluate such a model as part of this exercise. Thus, you need to:
create a new data matrix containing the cube of the speed, the speed, the speed multiplied by the sine of the angle of the slope, and the speed multiplied by the acceleration. To compute the angle of the slope, you need to take the arc tangent of the slope (
alpha = np.arctan(slope)
). In addition, we can limit ourself to positive acceleration only by clipping to 0 the negative acceleration values (they would correspond to some power created by the braking that we are not modeling here).using the new data matrix, create a linear predictive model based on a
sklearn.preprocessing.StandardScaler
and asklearn.linear_model.RidgeCV
;use a
sklearn.model_selection.ShuffleSplit
cross-validation strategy with only 4 splits (n_splits=4
) to evaluate the generalization performance of the model. Use the mean absolute error (MAE) as a generalization performance metric. Also, pass the parameterreturn_estimator=True
andreturn_train_score=True
to answer the subsequent questions. Be aware that theShuffleSplit
strategy is a naive strategy and we will investigate the consequence of making this choice in the subsequent questions.
Question
What is the mean value of the column containing the information of \(\sin(\alpha) V_{d}\)?
a) about -3
b) about -0.3
c) about -0.03
d) about -0.003
Select a single answer
Question
On average, the Mean Absolute Error on the test sets obtained through cross-validation is closest to:
a) 20 Watts
b) 50 Watts
c) 70 Watts
d) 90 Watts
Select a single answer
Hint: pass scoring="neg_mean_absolute_error"
to the cross_validate
function to compute the (negative of) the requested metric.
Hint: it is possible to replace the negative acceleration values by 0 using
data["acceleration"].clip(lower=0)
Question
Given the model \(P_{meca} = \beta_{1} V_{d}^{3} + \beta_{2} V_{d} + \beta_{3} \sin(\alpha) V_{d} + \beta_{4} a V_{d}\) that you created, inspect the weights of the linear models fitted during cross-validation and select the correct statements:
a) \(\beta_{1} < \beta_{2} < \beta_{3}\)
b) \(\beta_{3} < \beta_{1} < \beta_{2}\)
c) \(\beta_{2} < \beta_{3} < \beta_{1}\)
d) \(\beta_{1} < 0\)
e) \(\beta_{2} < 0\)
f) \(\beta_{3} < 0\)
g) \(\beta_{4} < 0\)
h) All \(\beta\)s are \(> 0\)
Select all answers that apply
Now, we will create a predictive model that uses all data
, including available
sensor measurements such as cadence (the speed at which a cyclist turns pedals
measured in rotation per minute) and heart-rate (the number of beat per minute
of the heart of the cyclist while exercising). Also, we will use a non-linear
regressor, a
sklearn.ensemble.HistGradientBoostingRegressor
.
Fix the number of maximum iterations to 1000 (max_iter=1_000
) and activate the
early stopping (early_stopping=True
). Repeat the previous evaluation using
this regressor.
Question
On average, the Mean Absolute Error on the test sets obtained through cross-validation is closest to:
a) 20 Watts
b) 40 Watts
c) 60 Watts
d) 80 Watts
Select a single answer
Question
Comparing both the linear model and the histogram gradient boosting model and taking into consideration the train and test MAE obtained via cross-validation, select the correct statements:
a) the generalization performance of the histogram gradient-boosting model is limited by its underfitting
b) the generalization performance of the histogram gradient-boosting model is limited by its overfitting
c) the generalization performance of the linear model is limited by its underfitting
d) the generalization performance of the linear model is limited by its overfitting
Select all answers that apply
Hint: look at the values of the train_score
and the test_score
collected
in the dictionaries returned by the cross_validate
function.
In the previous cross-validation, we made the choice of using a ShuffleSplit
cross-validation strategy. It means that randomly selected samples were selected
as testing set ignoring any time dependency between the lines of the dataframe.
We would like to have a cross-validation strategy that takes into account the groups defined by each individual date. Each group corresponds to a bike ride.
Question
How many bike rides are stored in the dataframe data
? Do not hesitate to
look at the hints.
a) 2
b) 3
c) 4
d) 5
Select a single answer
Hint: You can check the unique day in the DatetimeIndex
(the index of the
dataframe data
). Indeed, we assume that on a given day the rider went cycling
at most once per day.
Hint: You can access to the date and time of a DatetimeIndex
using
df.index.date
and df.index.time
, respectively.
We would like to have a cross-validation strategy that evaluates the capacity of
our model to predict on a completely new bike ride: the samples in the
validation set should only come from rides not present in the training set.
Therefore, we can use a LeaveOneGroupOut
strategy: at each iteration of the
cross-validation, we will keep a bike ride for the evaluation and use all other
bike rides to train our model.
Thus, you concretely need to:
create a variable called
group
that is a 1D numpy array containing the index of each ride present in the dataframe. Therefore, the length ofgroup
will be equal to the number of samples indata
. If we had 2 bike rides, we would expect the indices 0 and 1 ingroup
to differentiate the bike ride. You can usepd.factorize
to encode any Python types into integer indices.create a cross-validation object named
cv
using thesklearn.model_selection.LeaveOneGroupOut
strategy.evaluate both the linear and histogram gradient boosting models with this strategy.
Question
Using the previous evaluations (with the LeaveOneGroupOut
strategy)
and looking at the train and test errors for both models, select the
correct statements:
a) the generalization performance of the gradient-boosting model is limited by its underfitting
b) the generalization performance of the gradient-boosting model is limited by its overfitting
c) the generalization performance of the linear model is limited by its underfitting
d) the generalization performance of the linear model is limited by its overfitting
Select all answers that apply
Question
In this case we cannot compare cross-validation scores fold-to-fold as the folds are not aligned (they are not generated by the exact same strategy). Instead, compare the mean of the cross-validation test errors in the evaluations of the linear model to select the correct statement.
When using the ShuffleSplit
strategy, the mean test error:
a) is greater than the
LeaveOneGroupOut
mean test error by more than 3 Watts, i.e.ShuffleSplit
is giving over-pessimistic resultsb) differs from the
LeaveOneGroupOut
mean test error by less than 3 Watts, i.e. both cross-validation strategies are equivalentc) is lower than the
LeaveOneGroupOut
mean test error by more than 3 Watts, i.e.ShuffleSplit
is giving over-optimistic results
Select a single answer
Question
Compare the mean of the cross-validation test errors in the evaluations of the gradient-boosting model to select the correct statement.
When using the ShuffleSplit
strategy, the mean test error:
a) is greater than the
LeaveOneGroupOut
mean test error by more than 3 Watts, i.e.ShuffleSplit
is giving over-pessimistic resultsb) differs from the
LeaveOneGroupOut
mean test error by less than 3 Watts, i.e. both cross-validation strategies are equivalentc) is lower than the
LeaveOneGroupOut
mean test error by more than 3 Watts, i.e.ShuffleSplit
is giving over-optimistic results
Select a single answer
Question
Compare more precisely the errors estimated through cross-validation and select the correct statement:
a) in general, the standard deviation of the train and test errors increased using the
LeaveOneGroupOut
cross-validationb) in general, the standard deviation of the train and test errors decreased using the
LeaveOneGroupOut
cross-validation
Select a single answer
Now, we will go more into details by picking a single ride for the testing and
analyse the predictions of the models for this test ride. To do so, we can reuse
the LeaveOneGroupOut
cross-validation object in the following manner:
cv = LeaveOneGroupOut()
train_indices, test_indices = list(cv.split(data, target, groups=groups))[0]
data_linear_model_train = data_linear_model.iloc[train_indices]
data_linear_model_test = data_linear_model.iloc[test_indices]
data_train = data.iloc[train_indices]
data_test = data.iloc[test_indices]
target_train = target.iloc[train_indices]
target_test = target.iloc[test_indices]
Now, fit both the linear model and the histogram gradient boosting regressor models on the training data and collect the predictions on the testing data. Make a scatter plot where on the x-axis, you will plot the measured powers (true target) and on the y-axis, you will plot the predicted powers (predicted target). Do two separated plots for each model.
Question
By analysing the plots, select the correct statements:
a) the linear regressor tends to under-predict samples with high power
b) the linear regressor tends to over-predict samples with high power
c) the linear regressor makes catastrophic predictions for samples with power close to zero
d) the histogram gradient boosting regressor tends to under-predict samples with high power
e) the histogram gradient boosting regressor tends to over-predict samples with high power
f) the histogram gradient boosting makes catastrophic predictions for samples with power close to zero
Select all answers that apply
Now select a portion of the testing data using the following code:
time_slice = slice("2020-08-18 17:00:00", "2020-08-18 17:05:00")
data_test_linear_model_subset = data_linear_model_test[time_slice]
data_test_subset = data_test[time_slice]
target_test_subset = target_test[time_slice]
It allows to select data from 5.00 pm until 5.05 pm. Used the previous fitted models (linear and gradient-boosting regressor) to predict on this portion of the test data. Draw on the same plot the true targets and the predictions of each model.
Question
By using the previous plot, select the correct statements:
a) the linear model is more accurate than the histogram gradient boosting regressor
b) the histogram gradient boosting regressor is more accurate than the linear model
Select a single answer