All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Find the most common destination in the Destination
column, and fill the null values with it. Perform the fixes in place, modifying the df
variable.
You must modify the df
variable itself. Don't worry if you screw up with the DataFrame! just reload it with the first line of the notebook.
Find the most common value for the VIP
column, and fill the null values with it. Perform the fixes in place, modifying the df
variable.
There are a few variables in the dataset that are independent and won't contribute to a good prediction for our final model, which ones are them?
Drop the columns that you have previously identified as independent. Perform your drop inplace, modifying the df
variable. If you have made a mistake, restart your notebook from the beginning.
Drop any other row containing a null/NaN
value. The final dataframe should have NO null values whatsoever.
What categorical features should be encoded before training our model?
Given the features previously identified as categorical and that need to be encoded, encode them in a new dataframe named df_encoded
. IMPORTANT. Do not modify df
yet, it should be a brand new DataFrame with only the previously selected values in coded as one-hot values, tht is, 1
s and 0
s.
Don't perform any name changes to the columns; for example, the encoded columns for CryoSleep
will be CryoSleep_False
and CryoSleep_True
. For HomePlanet
they'll be HomePlanet_Earth
, HomePlanet_Europa
, HomePlanet_Mars
, etc.
Important! You will (most likely) need to transform your the VIP
column to an object
/string before encoding it. Use the .astype(str)
method before encoding.
Now it's time to remove from df
the original features you have previously identified as categorical and encoded. But, this time, don't remove them from df
. Create a NEW variable named df_no_categorical
that contains the results of the drop operation.
Create a new DataFrame in the variable df_final
that contains the combination of the two previously processed dataframes: df_no_categorical
and df_encoded
, in that order.
The result will look something like:
Using df_final
, which contains all our data correctly cleaned and prepared, create two new derivative variables:
The transported
variable should be a Series containing ONLY the Transported
column.
The df_train
variable should be a dataframe containing ALL the columns in df_final
, EXCEPT for the Transported
column. This is equivalent of saying: "remove the Transported
column from df_final
and store it in df_train
.
Important: DO NOT modify df_final
.
Given the RandomForestClassifier
created (with random_state=42
, important, don't change it!) instantiate a GridSearchCV
to find the best possible parameter for max_depth
, in the range 5
to 25
.
According to our grid search, what's the best hyperparameter value for max_depth
?
For this project, it'll be important to optimize our recall, as we are trying to save people from being transported to another galaxy. So, now create a RandomForestClassifier
object in the variable model
and train it with X_train
and y_train
.
You should select the correct hyperparameters to achieve a precision of at least 0.8
and a recall of at least 0.75
.
Instantiate and train your model your model in the variable model
.