Simple Model Tests
Simple Model Tests Data Science Project
Classification in Depth with Scikit-Learn

Simple Model Tests

The main objective of the first project is to validate the ability to split the dataset into train and test sets stratified by the target class, and to train and compare the performance of different models such as decision tree, random forest, light xgboost, and xgboost. For this project, we will use accuracy as the evaluation metric.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

Did you find any missing value?

Calculate the percentage of missing values.

codevalidated

Dropping Unnecessary Features

Drop the columns that you have previously identified as independent. Perform your drop in place, modifying the df variable. If you have made a mistake, restart your notebook from the beginning.

Store the data frame in the variable df.

codevalidated

Separate the target and the features into two variables.

Store the features in X and the target y.

codevalidated

Use train_test_split to split the data into training and testing sets. Split the dataset in 80% training, 20% testing, and random_state=0.

Store the values in the variables in X_train, X_test,y_train, y_test, and random_state.

codevalidated

Train an Random Forest with the following parameters: n_estimators=100 and random_state=42 and calculated the accuracy for the testing set.

Train a Random Forest Classifier using the training data, and store the model in rf. You can specify the model parameters such as the maximum depth of the tree or the minimum number of samples required to split an internal node.

Calculate the accuracy of both the training and testing sets and run the code in a Jupyter Notebook.

Store the results in the variables train_accuracy and test_accuracy.

The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 80% in some cases.

multiplechoice

Best models performance

The two models that present the best performance in terms of the evaluation metrics (Highest accuracy and AUC).

Simple Model TestsSimple Model Tests
Author

Verónica Barraza

This project is part of

Classification in Depth with Scikit-Learn

Explore other projects