All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Select the correct answer.
The answer should be rounded to two decimal places.
The answer should be rounded to two decimal places.
Select the correct answer.
There could be more than just one correct answer.
There could be more than just one correct answer.
Based on the correlation analysis of the dataset, which variable has the highest correlation with the target column?
Based on the calculation of the correlation between the variables. There could be more than just one correct answer.
What type of plot would you use to compare the credit score distribution for customers who have churned the bank versus those who have not? Please select the figure that shows the correct representation.
There could be more than just one correct answer.
For this task select the following variables:
CreditScore
,Age
,Balance
,HasCrCard
,EstimatedSalary
and show the relationship between these variables classified byExited
. Then select the correct answer.
Two students created histograms for the 'credit score' variable using the same bin width and boundary values, but their plots have distinctively different shapes. What could be the reason for the different shapes in their histograms?
Let's see the figures:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3,figsize=(10,5))
ax1.hist(df.CreditScore,align='left', color='#0504aa',alpha=0.7)
ax2.hist(df.CreditScore,align='right', color='#0504aa',alpha=0.7)
ax3.hist(df.CreditScore,color='#0504aa',alpha=0.7)
ax1.set_xlabel('Value',fontsize=15)
ax2.set_xlabel('Value',fontsize=15)
ax3.set_xlabel('Value',fontsize=15)
ax1.set_ylabel('Frequency',fontsize=15)
ax1.set_title('a',fontsize=15)
ax2.set_title('b',fontsize=15)
ax3.set_title('c',fontsize=15)
plt.show()
There could be more than just one correct answer.
Compared to overlapping histograms, overlapping density plots generally do not present the same issues, as the continuous density lines assist the viewer in distinguishing between the different distributions. This is because the smooth lines of the density plot allow for a more intuitive understanding of the shape of the data, even when multiple distributions are being presented simultaneously
Select the code that shows the balance distribution by country.
Data leakage can cause you to create overly optimistic if not completely invalid predictive models.
Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.
Which of the following columns would you remove because they would cause data leakage?
There could be more than just one correct answer.
Not it's time to drop unwanted features ('Surname', 'RowNumber', 'CustomerId'). Which of the following statement are correct?
There could be more than just one correct answer.
A machine learning algorithm needs to be able to understand the data it receives. There are plenty of methods to encode categorical variables into numeric and each method comes with its advantages and disadvantages. Which is the correct way to encode the variables gender
and geography
?:
Execute the following code:
Geography= pd.get_dummies(df['Geography'], drop_first=True)
Gender= pd.get_dummies(df['Gender'], drop_first=True)
df = pd.concat([df, Geography, Gender], axis=1)
df.info()
Which of the following statements are true?
There could be more than just one correct answer.
Based on this, you should select wheater this scenario is a classification or a regression problem.
Select the correct way to split the dataset in 30% test and 70% train.
There could be more than just one correct answer.
We ask you to build a predictive model that answers the question: “what sorts of people were more likely to commit churn?”.
Which of the following statements of the confusion matrix are true?
There could be more than just one correct answer.
Which of the following statements of the model are true about XGBoost? There could be more than just one correct answer.
Train an XGBoost with the following parameters: objective='"binary:logistic"
and random_state=42
and calculated the accuracy for the training set.
Now, let's train an xgboost with logistic objective
and n_estimators
30 and maximal depth
2.
Use random state
= 42.
Plot the histogram of the score, and estimate the precision and recall for threshold equal to [0.1,0.5,0.7,0.8] using the test dataset.
Based on these results, which of the following statements of the model are true? There could be more than just one correct answer.
Use the following function to make a precision and recall curve for the training set.
def plot_prc(name, labels, predictions, **kwargs):
precision, recall, _ = precision_recall_curve(labels, predictions)
plt.plot(precision, recall, label=name, linewidth=2, **kwargs)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.grid(True)
ax = plt.gca()
ax.set_aspect('equal')
Which of the following statements are true?
There could be more than just one correct answer.
The
RandomizedSearchCV()
function takes in the following arguments:
estimator
: The estimator being fit, here it's XGBoost.param_distributions
: Unlike params - this is the distribution of possible hyperparameters to use.cv
: Number of cross-validation iterationsn_iter
: Number of hyperparameter combinations to choose from
verbose: Prints more outputFollow the instructions and solve the exercise:
Create a parameter grid called rs_param_grid that contains:
Create a RandomizedSearchCV
object called xgb_rs
, passing in the parameter grid to param_distributions
. Also, specify verbose=2
, cv=3
, and n_iter=5
.
Your objective is to maximize F1-score.
Fit the RandomizedSearchCV
object to X
and y
.
What are the best parameters?