Supervised machine learning classification: Customer Churn prediction

multiplechoice

What are the dimensions of the dataset?

multiplechoice

How many null values are present in the dataset?

multiplechoice

What kind of data is `NumOfProducts`?

multiplechoice

Are `NumOfProducts` and `Age` the same kind of data?

multiplechoice

Is the following statement true or false?

multiplechoice

Which countries are included in this dataset?

Select the correct answer.

input

Determine the percentage of non-active members.

input

What is the mean age of the women who have closed their accounts with the bank?

multiplechoice

Determine the quantile 25 and 75 of the variable age for men and women using `groupby`.

multiplechoice

How do you calculate the mean, min, and max values as well as some percentiles (25th, 50th, or median and 75th) for a given dataset?

Select the correct answer.

multiplechoice

Compute the number of people per country that has a credit card.

multiplechoice

What is the mean salary for men over 40 years old?

multiplechoice

How do you calculate the correlation matrix for a dataset using Pearson's correlation coefficient?

multiplechoice

Correlation

Based on the correlation analysis of the dataset, which variable has the highest correlation with the target column?

multiplechoice

What is the correlation between Balance and the target?

multiplechoice

Is the credit score positively associated with the target?

multiplechoice

Correlation analysis

Based on the calculation of the correlation between the variables. There could be more than just one correct answer.

multiplechoice

What type of data is a scatter plot typically used to represent or analyze in data visualization and statistical analysis?

multiplechoice

Visualization

What type of plot would you use to compare the credit score distribution for customers who have churned the bank versus those who have not? Please select the figure that shows the correct representation.

multiplechoice

What type of information can you gain from a box plot in statistical analysis and data visualization?

multiplechoice

How can you create a plot to show the relationship between all variables in a single layout using Seaborn in Python?

multiplechoice

Histogram

Two students created histograms for the 'credit score' variable using the same bin width and boundary values, but their plots have distinctively different shapes. What could be the reason for the different shapes in their histograms?

Let's see the figures:

fig, (ax1, ax2, ax3) = plt.subplots(1, 3,figsize=(10,5))
ax1.hist(df.CreditScore,align='left', color='#0504aa',alpha=0.7)
ax2.hist(df.CreditScore,align='right', color='#0504aa',alpha=0.7)
ax3.hist(df.CreditScore,color='#0504aa',alpha=0.7)
ax1.set_xlabel('Value',fontsize=15)
ax2.set_xlabel('Value',fontsize=15)
ax3.set_xlabel('Value',fontsize=15)
ax1.set_ylabel('Frequency',fontsize=15)
ax1.set_title('a',fontsize=15)
ax2.set_title('b',fontsize=15)
ax3.set_title('c',fontsize=15)
plt.show()

answer-mvb346

multiplechoice

What is the appropriate figure or chart to represent the number of classes for the 'Exited' variable?

multiplechoice

Density plot

Compared to overlapping histograms, overlapping density plots generally do not present the same issues, as the continuous density lines assist the viewer in distinguishing between the different distributions. This is because the smooth lines of the density plot allow for a more intuitive understanding of the shape of the data, even when multiple distributions are being presented simultaneously

Select the code that shows the balance distribution by country.

multiplechoice

Data leakage

Data leakage can cause you to create overly optimistic if not completely invalid predictive models.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

Which of the following columns would you remove because they would cause data leakage?

multiplechoice

Drop unwanted features

Not it's time to drop unwanted features ('Surname', 'RowNumber', 'CustomerId'). Which of the following statement are correct?

multiplechoice

Which of the following statements are true about normalization?

multiplechoice

Encode categorical variables

A machine learning algorithm needs to be able to understand the data it receives. There are plenty of methods to encode categorical variables into numeric and each method comes with its advantages and disadvantages. Which is the correct way to encode the variables gender and geography?:

multiplechoice

Encode categorical variables

Execute the following code:

Geography= pd.get_dummies(df['Geography'], drop_first=True)
Gender= pd.get_dummies(df['Gender'], drop_first=True)

df = pd.concat([df, Geography, Gender], axis=1)
df.info()

Which of the following statements are true?

multiplechoice

Classification or Regression

Based on this, you should select wheater this scenario is a classification or a regression problem.

multiplechoice

Split train and test

Select the correct way to split the dataset in 30% test and 70% train.

multiplechoice

Confusion Matrix

We ask you to build a predictive model that answers the question: “what sorts of people were more likely to commit churn?”.

Which of the following statements of the confusion matrix are true?

multiplechoice

XGBoost

Which of the following statements of the model are true about XGBoost? There could be more than just one correct answer.

multiplechoice

XGBClassifier

Train an XGBoost with the following parameters: objective='"binary:logistic" and random_state=42 and calculated the accuracy for the training set.

multiplechoice

XGBoost: evaluation metrics

Now, let's train an xgboost with logistic objective and n_estimators 30 and maximal depth 2.

Use random state = 42.

Plot the histogram of the score, and estimate the precision and recall for threshold equal to [0.1,0.5,0.7,0.8] using the test dataset.

Based on these results, which of the following statements of the model are true? There could be more than just one correct answer.

multiplechoice

XGBoost: precision and recall

Use the following function to make a precision and recall curve for the training set.

def plot_prc(name, labels, predictions, **kwargs):
    precision, recall, _ = precision_recall_curve(labels, predictions)
    plt.plot(precision, recall, label=name, linewidth=2, **kwargs)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.grid(True)
    ax = plt.gca()
    ax.set_aspect('equal')

Which of the following statements are true?

multiplechoice

XGBoost: Tuning Parameters

estimator: The estimator being fit, here it's XGBoost.
param_distributions: Unlike params - this is the distribution of possible hyperparameters to use.
cv: Number of cross-validation iterations
n_iter: Number of hyperparameter combinations to choose from verbose: Prints more output

Follow the instructions and solve the exercise:

Create a parameter grid called rs_param_grid that contains:
- 'max_depth': list((range(3,12)))
- 'alpha': [0,0.001, 0.01,0.1,1]
- 'subsample': [0.5,0.75,1]
- 'learning_rate': np.linspace(0.01,0.5, 10)
- 'n_estimators': [10, 25, 40]
Create a RandomizedSearchCV object called xgb_rs, passing in the parameter grid to param_distributions. Also, specify verbose=2, cv=3, and n_iter=5.
Your objective is to maximize F1-score.
Fit the RandomizedSearchCV object to X and y.

What are the best parameters?

Verónica Barraza

Project Activities

What are the dimensions of the dataset?

How many null values are present in the dataset?

What kind of data is `NumOfProducts`?

Are `NumOfProducts` and `Age` the same kind of data?

Is the following statement true or false?

Which countries are included in this dataset?

Determine the percentage of non-active members.

What is the mean age of the women who have closed their accounts with the bank?

Determine the quantile 25 and 75 of the variable age for men and women using `groupby`.

How do you calculate the mean, min, and max values as well as some percentiles (25th, 50th, or median and 75th) for a given dataset?

Compute the number of people per country that has a credit card.

What is the mean salary for men over 40 years old?

How do you calculate the correlation matrix for a dataset using Pearson's correlation coefficient?

Correlation

What is the correlation between Balance and the target?

Is the credit score positively associated with the target?

Correlation analysis

What type of data is a scatter plot typically used to represent or analyze in data visualization and statistical analysis?

Visualization

What type of information can you gain from a box plot in statistical analysis and data visualization?

How can you create a plot to show the relationship between all variables in a single layout using Seaborn in Python?

Histogram

What is the appropriate figure or chart to represent the number of classes for the 'Exited' variable?

Density plot

Data leakage

Drop unwanted features

Which of the following statements are true about normalization?

Encode categorical variables

Encode categorical variables

Classification or Regression

Split train and test

Confusion Matrix

XGBoost

XGBClassifier

XGBoost: evaluation metrics

XGBoost: precision and recall

XGBoost: Tuning Parameters

Verónica Barraza

Classification in Depth with Scikit-Learn

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database