Predicting intergalactic transportations with Spaceship Titanic [Guided]

input

How many null/missing values has the column `CryoSleep`?

input

How many null/missing values has the column `FoodCourt`?

input

How many null/missing values has the column `PassengerId`?

codevalidated

Fill the null/missing values in `Destination` with the most common value

Find the most common destination in the Destination column, and fill the null values with it. Perform the fixes in place, modifying the df variable.

codevalidated

Fill the null/missing values in `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa` and `VRDeck` with the median

You must modify the df variable itself. Don't worry if you screw up with the DataFrame! just reload it with the first line of the notebook.

codevalidated

Fill the null/missing values in `VIP` with the most common value

Find the most common value for the VIP column, and fill the null values with it. Perform the fixes in place, modifying the df variable.

multiplechoice

What variables (columns) are independent and should be dropped before moving forward?

There are a few variables in the dataset that are independent and won't contribute to a good prediction for our final model, which ones are them?

codevalidated

Drop the previously defined columns

Drop the columns that you have previously identified as independent. Perform your drop inplace, modifying the df variable. If you have made a mistake, restart your notebook from the beginning.

codevalidated

Drop any other row that contains a null value

Drop any other row containing a null/NaN value. The final dataframe should have NO null values whatsoever.

multiplechoice

What features should be encoded?

What categorical features should be encoded before training our model?

codevalidated

Encode the features previously identified in its own dataframe

Given the features previously identified as categorical and that need to be encoded, encode them in a new dataframe named df_encoded. IMPORTANT. Do not modify df yet, it should be a brand new DataFrame with only the previously selected values in coded as one-hot values, tht is, 1s and 0s.

Don't perform any name changes to the columns; for example, the encoded columns for CryoSleep will be CryoSleep_False and CryoSleep_True. For HomePlanet they'll be HomePlanet_Earth, HomePlanet_Europa, HomePlanet_Mars, etc.

Important! You will (most likely) need to transform your the VIP column to an object/string before encoding it. Use the .astype(str) method before encoding.

codevalidated

Remove the original encoded features from `df`

Now it's time to remove from df the original features you have previously identified as categorical and encoded. But, this time, don't remove them from df. Create a NEW variable named df_no_categorical that contains the results of the drop operation.

codevalidated

Create a new dataframe combining `df_no_categorical` and `df_encoded`

Create a new DataFrame in the variable df_final that contains the combination of the two previously processed dataframes: df_no_categorical and df_encoded, in that order.

The result will look something like:

codevalidated

Finally, separate the target variable `Transported` from the training data

Using df_final, which contains all our data correctly cleaned and prepared, create two new derivative variables:

The transported variable should be a Series containing ONLY the Transported column.

The df_train variable should be a dataframe containing ALL the columns in df_final, EXCEPT for the Transported column. This is equivalent of saying: "remove the Transported column from df_final and store it in df_train.

Important: DO NOT modify df_final.

input

Use a `GridSearchCV` to find the best `max_depth` parameter for a `RandomForestClassifier`

Given the RandomForestClassifier created (with random_state=42, important, don't change it!) instantiate a GridSearchCV to find the best possible parameter for max_depth, in the range 5 to 25.

According to our grid search, what's the best hyperparameter value for max_depth?

codevalidated

Create a `RandomForestClassifier` that achieves at least `0.8` in precision and at least `0.75` in recall

For this project, it'll be important to optimize our recall, as we are trying to save people from being transported to another galaxy. So, now create a RandomForestClassifier object in the variable model and train it with X_train and y_train.

You should select the correct hyperparameters to achieve a precision of at least 0.8 and a recall of at least 0.75.

Instantiate and train your model your model in the variable model.

Santiago Basulto

Project Activities

How many null/missing values has the column `CryoSleep`?

How many null/missing values has the column `FoodCourt`?

How many null/missing values has the column `PassengerId`?

Fill the null/missing values in `Destination` with the most common value

Fill the null/missing values in `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa` and `VRDeck` with the median

Fill the null/missing values in `VIP` with the most common value

What variables (columns) are independent and should be dropped before moving forward?

Drop the previously defined columns

Drop any other row that contains a null value

What features should be encoded?

Encode the features previously identified in its own dataframe

Remove the original encoded features from `df`

Create a new dataframe combining `df_no_categorical` and `df_encoded`

Finally, separate the target variable `Transported` from the training data

Use a `GridSearchCV` to find the best `max_depth` parameter for a `RandomForestClassifier`

Create a `RandomForestClassifier` that achieves at least `0.8` in precision and at least `0.75` in recall

Santiago Basulto

Classification in Depth with Scikit-Learn

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database