Supervised machine learning classification: Customer Churn prediction
Supervised machine learning classification: Customer Churn prediction Data Science Project
Classification in Depth with Scikit-Learn

Supervised machine learning classification: Customer Churn prediction

In this project you'll apply all the previously learned techniques and models involving cleaning, feature engineering, tuning hyperparameters and much more. All this with a dataset containing information about Customer Churn. This project combines quizzes and practical activities to guide you towards achieving the best possible results.
Start this project
Supervised machine learning classification: Customer Churn predictionSupervised machine learning classification: Customer Churn prediction
Project Created by

Verónica Barraza

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

What are the dimensions of the dataset?

multiplechoice

How many null values are present in the dataset?

multiplechoice

What kind of data is `NumOfProducts`?

multiplechoice

Are `NumOfProducts` and `Age` the same kind of data?

multiplechoice

Is the following statement true or false?

multiplechoice

Which countries are included in this dataset?

Select the correct answer.

input

Determine the percentage of non-active members.

The answer should be rounded to two decimal places.

input

What is the mean age of the women who have closed their accounts with the bank?

The answer should be rounded to two decimal places.

multiplechoice

Determine the quantile 25 and 75 of the variable age for men and women using `groupby`.

multiplechoice

How do you calculate the mean, min, and max values as well as some percentiles (25th, 50th, or median and 75th) for a given dataset?

Select the correct answer.

multiplechoice

Compute the number of people per country that has a credit card.

There could be more than just one correct answer.

multiplechoice

What is the mean salary for men over 40 years old?

multiplechoice

How do you calculate the correlation matrix for a dataset using Pearson's correlation coefficient?

There could be more than just one correct answer.

multiplechoice

Correlation

Based on the correlation analysis of the dataset, which variable has the highest correlation with the target column?

multiplechoice

What is the correlation between Balance and the target?

multiplechoice

Is the credit score positively associated with the target?

multiplechoice

Correlation analysis

Based on the calculation of the correlation between the variables. There could be more than just one correct answer.

multiplechoice

What type of data is a scatter plot typically used to represent or analyze in data visualization and statistical analysis?

multiplechoice

Visualization

What type of plot would you use to compare the credit score distribution for customers who have churned the bank versus those who have not? Please select the figure that shows the correct representation.

multiplechoice

What type of information can you gain from a box plot in statistical analysis and data visualization?

There could be more than just one correct answer.

multiplechoice

How can you create a plot to show the relationship between all variables in a single layout using Seaborn in Python?

For this task select the following variables: CreditScore, Age, Balance, HasCrCard,EstimatedSalary and show the relationship between these variables classified by Exited. Then select the correct answer.

multiplechoice

Histogram

Two students created histograms for the 'credit score' variable using the same bin width and boundary values, but their plots have distinctively different shapes. What could be the reason for the different shapes in their histograms?

Let's see the figures:

fig, (ax1, ax2, ax3) = plt.subplots(1, 3,figsize=(10,5))
ax1.hist(df.CreditScore,align='left', color='#0504aa',alpha=0.7)
ax2.hist(df.CreditScore,align='right', color='#0504aa',alpha=0.7)
ax3.hist(df.CreditScore,color='#0504aa',alpha=0.7)
ax1.set_xlabel('Value',fontsize=15)
ax2.set_xlabel('Value',fontsize=15)
ax3.set_xlabel('Value',fontsize=15)
ax1.set_ylabel('Frequency',fontsize=15)
ax1.set_title('a',fontsize=15)
ax2.set_title('b',fontsize=15)
ax3.set_title('c',fontsize=15)
plt.show()

answer-mvb346

multiplechoice

What is the appropriate figure or chart to represent the number of classes for the 'Exited' variable?

There could be more than just one correct answer.

multiplechoice

Density plot

Compared to overlapping histograms, overlapping density plots generally do not present the same issues, as the continuous density lines assist the viewer in distinguishing between the different distributions. This is because the smooth lines of the density plot allow for a more intuitive understanding of the shape of the data, even when multiple distributions are being presented simultaneously

Select the code that shows the balance distribution by country.

multiplechoice

Data leakage

Data leakage can cause you to create overly optimistic if not completely invalid predictive models.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

Which of the following columns would you remove because they would cause data leakage?

There could be more than just one correct answer.

multiplechoice

Drop unwanted features

Not it's time to drop unwanted features ('Surname', 'RowNumber', 'CustomerId'). Which of the following statement are correct?

There could be more than just one correct answer.

multiplechoice

Which of the following statements are true about normalization?

multiplechoice

Encode categorical variables

A machine learning algorithm needs to be able to understand the data it receives. There are plenty of methods to encode categorical variables into numeric and each method comes with its advantages and disadvantages. Which is the correct way to encode the variables gender and geography?:

multiplechoice

Encode categorical variables

Execute the following code:

Geography= pd.get_dummies(df['Geography'], drop_first=True)
Gender= pd.get_dummies(df['Gender'], drop_first=True)

df = pd.concat([df, Geography, Gender], axis=1)
df.info()

Which of the following statements are true?

There could be more than just one correct answer.

multiplechoice

Classification or Regression

Based on this, you should select wheater this scenario is a classification or a regression problem.

multiplechoice

Split train and test

Select the correct way to split the dataset in 30% test and 70% train.

There could be more than just one correct answer.

multiplechoice

Confusion Matrix

We ask you to build a predictive model that answers the question: “what sorts of people were more likely to commit churn?”.

Which of the following statements of the confusion matrix are true?

There could be more than just one correct answer.

multiplechoice

XGBoost

Which of the following statements of the model are true about XGBoost? There could be more than just one correct answer.

multiplechoice

XGBClassifier

Train an XGBoost with the following parameters: objective='"binary:logistic" and random_state=42 and calculated the accuracy for the training set.

multiplechoice

XGBoost: evaluation metrics

Now, let's train an xgboost with logistic objective and n_estimators 30 and maximal depth 2.

Use random state = 42.

Plot the histogram of the score, and estimate the precision and recall for threshold equal to [0.1,0.5,0.7,0.8] using the test dataset.

Based on these results, which of the following statements of the model are true? There could be more than just one correct answer.

multiplechoice

XGBoost: precision and recall

Use the following function to make a precision and recall curve for the training set.

def plot_prc(name, labels, predictions, **kwargs):
    precision, recall, _ = precision_recall_curve(labels, predictions)
    plt.plot(precision, recall, label=name, linewidth=2, **kwargs)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.grid(True)
    ax = plt.gca()
    ax.set_aspect('equal')

Which of the following statements are true?

There could be more than just one correct answer.

multiplechoice

XGBoost: Tuning Parameters

The RandomizedSearchCV() function takes in the following arguments:

  • estimator: The estimator being fit, here it's XGBoost.
  • param_distributions: Unlike params - this is the distribution of possible hyperparameters to use.
  • cv: Number of cross-validation iterations
  • n_iter: Number of hyperparameter combinations to choose from verbose: Prints more output

Follow the instructions and solve the exercise:

  1. Create a parameter grid called rs_param_grid that contains:

    • 'max_depth': list((range(3,12)))
    • 'alpha': [0,0.001, 0.01,0.1,1]
    • 'subsample': [0.5,0.75,1]
    • 'learning_rate': np.linspace(0.01,0.5, 10)
    • 'n_estimators': [10, 25, 40]
  2. Create a RandomizedSearchCV object called xgb_rs, passing in the parameter grid to param_distributions. Also, specify verbose=2, cv=3, and n_iter=5.

  3. Your objective is to maximize F1-score.

  4. Fit the RandomizedSearchCV object to X and y.

What are the best parameters?

Supervised machine learning classification: Customer Churn predictionSupervised machine learning classification: Customer Churn prediction
Project Created by

Verónica Barraza

This project is part of

Classification in Depth with Scikit-Learn

Explore other projects