All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Calculate the percentage of missing values. If you find any missing values, please drop them using df.dropna(inplace=True)
.
id
column is unique for every row and will be deviating from the model. So let's just remove it.
Choose the correct code to drop this feature.
Complete the following code with the categorical variables and run it in the notebook:
cat_variables = Complete
for i in cat_variables:
fig_dims = (10, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot(x=i, hue="stroke", ax=ax, data=df,palette="RdBu",order=df[i].value_counts().index)
plt.xticks(rotation=90)
plt.legend(loc='upper right')
plt.show()
With the following code, we will sample 249 instances from the negative class.
Let's run this code in the notebook.
non_stroke=df.loc[df.stroke==0].sample(df.loc[df.stroke==1].shape[0],random_state=1)
stroke=df.loc[df.stroke==1]
frames = [stroke,non_stroke]
result = pd.concat(frames)
result.sample(frac=1,random_state=1)
result.head()
After that drop categorical variables and separate the target and the features into two variables.
Store the features in
X
and the targety
.
First, use train_test_split
to split the data into training and testing sets.
Split the dataset in 80% training, 20% testing, and random_state=0.
Store the values in the variables in
X_train
,X_test
,y_train
,y_test
,random_state
.
Train a Decision Tree Classifier
using the training data, and store the model in dt
. You can specify the model parameters such as the maximum depth of the tree or the minimum number of samples required to split an internal node.
Calculate the accuracy of both the training and testing sets and run the code in a Jupyter Notebook.
Store the results in the variables train_accuracy
and test_accuracy
.
The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 70% in some cases.
Train a KNeighborsClassifier
using the training data, and store the model in the variable knn
. In KNN, the value of k determines the number of nearest neighbors to consider for making a prediction. In this example, you will specify the value of k by passing the n_neighbors parameter to the class constructor. Remember that KNN is sensitive to the scale of the features, so you should use StandardScaler
to standardize the features and store the results in the variables X_train_scaler
and X_test_scaler
.
After training the model, calculate the accuracy of both the training and testing sets using an appropriate metric. The code should be run in a Jupyter Notebook. The results of the accuracy calculation should be stored in the variables train_accuracy
and test_accuracy
for the training and testing sets, respectively.
The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 70% in some cases.