All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Group the data by topic and language:
Use value_counts()
to find the most common chatbot for each combination.
Handle cases where there might be no common chatbot for a combination.
Create a function to fill missing chatbot names:
If a chatbot name is missing, replace it with the most common one for that topic-language pair.
If there's no common chatbot for a particular combination when a value is missing, keep it as None
type.
Apply the function to fill missing values in the chatbot_name
column.
Replace any remaining missing values with Unknown
.
Handle negative values in the word_count
column:
NaN
(Not a Number).Calculate the median word count:
Fill missing values:
NaN
values in the word_count
column.Standardize data type:
word_count
column to integers.Calculate the mode of non-null sentiments:
Fill missing values:
Calculate quartiles of the user_age
column:
Calculate the Interquartile Range (IQR):
Determine the bounds for outliers:
Identify outliers:
outliers
.Define age limits for the user_age
column:
Clean age data outside the defined limits:
Handle missing age values:
Standardize age data type:
user_age
column to integers.Note : Use
df.head().T
for viewing yourdf
.df.head().T
provides a compact way to view the initial rows as columns, making it easier to scan the data horizontally.
Standardize the date
column:
date
column to datetime format.Sort the dataset:
date
in ascending order (earliest to latest).response_time
in descending order.Remove duplicate entries:
user_id
and date
columns.Reset the dataframe index:
Identify and remove duplicate entries:
chat_id
column as the basis for identifying duplicates.Update the dataframe:
Create a composite key:
message
, topic
, and language
columns.message_context
.Remove duplicate entries:
message_context
column to identify duplicates.Clean up temporary data:
message_context
column.Update the dataframe:
Define subscription hierarchy:
'enterprise'
: 3, 'premium'
: 2, 'basic'
: 1, 'free'
: 0Create a temporary numeric column subscription rank sub_rank
:
subscription_type
to its corresponding numeric value.Sort the data:
Keep only the most recent or premium subscription:
user_id
Clean up:
sub_rank
columnModify the topic
column by capitalizing the first letter of each word.
last_login
column to datetime format.login_month
.chatbot_name
column.chatbot_initials
.'Short'
'Medium'
'Long'
session_duration
column.duration_category
.date
column to a string format '%Y%m%d'
.chat_id
, separating them with an underscore.full_chat_id
.'Unknown'
: for missing or negative values'Under 18'
: for ages below 18'Young Adult'
: for ages 18-29'Adult'
: for ages 30-49'Senior'
: for ages 50 and aboveuser_age
column.age_group
.Task: Create a function to map language codes to full language names and generate a new user_location
column in a DataFrame.
Define a function that:
'en'
(English), 'es'
(Spanish), 'fr'
(French), 'de'
(German), and 'it'
(Italian).'Other'
for any code not in the dictionary.Apply this function to the language
column to get full language names.
Create a new user_location
column by combining user_country
and the full language name in the format Country (Language)
.
Handle missing or unknown data:
'Unknown'
and the language is 'Other'
with 'Unknown'
.Ensure the original language
column remains unchanged.
Remove negative values from the response_time
column:
Handle outliers using the 99th percentile:
Improve precision:
Create a list of valid language codes:
'en'
, 'es'
, 'fr'
, 'de'
, and 'it'
.Create a function to clean and validate language codes. This function should:
'other'
if the code is not in the valid list.'unknown'
if the code is missing or NaN
.Apply this cleaning function to the language
column.
Create a function to clean accuracy_scores
. This function should:
NaN
.Apply this cleaning function to the accuracy_score
column.
Create a list of valid device types:
'mobile'
, 'tablet'
, and 'desktop'
.Create a function to clean device types. This function should:
'unknown'
.'other'
.Apply this cleaning function to the device_type
column.
Create a function to clean the is_returning_user
values. This function should:
NaN
.'true'
, 'yes'
, or '1'
.Apply this cleaning function to the is_returning_user
column.
Convert the entire column to boolean type:
Fill any remaining NaN values with False.
Create a list of valid sentiment categories:
positive
, neutral
, and negative
.Create a function to normalize sentiment values. This function should:
'unknown'
.'unknown'
.Apply this cleaning function to the sentiment
column.
Create a dictionary mapping common country names and abbreviations to their standardized forms. Include the following:
'usa'
, 'us'
, and 'united states'
for 'United States'
'uk'
, 'united kingdom'
, and 'great britain'
for 'United Kingdom'
'germany'
for 'Germany'
'france'
for 'France'
'italy'
for 'Italy'
'spain'
for 'Spain'
'canada'
for 'Canada'
'australia'
for 'Australia'
Create a function to clean country names. This function should:
'Unknown'
.Apply this cleaning function to the user_country
column.
Calculate the Interquartile Range (IQR):
Define bounds for outliers:
Create a function to handle outliers. This function should:
Apply this function to the session_duration
column to remove extreme outliers.
Extract the day of the week from the date
column:
chat_day_of_week
.Create a list defining the order of days:
Count the number of chats for each day of the week:
chat_counts
.Calculate the days between date
and last_login
:
last_login
from date
.days_since_last_login
.Handle potential data errors:
days_since_last_login
are non-negative.Extract the week number from the date
column:
week_number
.Group the data by week number and count chats:
week_number
column.weekly_chats
.weekly_chats
has columns for week number and chat count.Calculate the running total of chats:
weekly_chats
.