NLP Chatbot Interaction Data Cleaning and Analysis Project

codevalidated

Fill `chatbot_name` gaps using topic and language groups

Group the data by topic and language:
- Use value_counts() to find the most common chatbot for each combination.
- Handle cases where there might be no common chatbot for a combination.
Create a function to fill missing chatbot names:
- If a chatbot name is missing, replace it with the most common one for that topic-language pair.
- If there's no common chatbot for a particular combination when a value is missing, keep it as None type.
Apply the function to fill missing values in the chatbot_name column.
Replace any remaining missing values with Unknown.

codevalidated

Sanitize `word_count` by handling negatives and filling gaps

Handle negative values in the word_count column:
- Replace any negative values with NaN (Not a Number).
Calculate the median word count:
- Use only the remaining valid values (non-negative and non-NaN).
Fill missing values:
- Use the calculated median to fill in all NaN values in the word_count column.
Standardize data type:
- Convert all values in the word_count column to integers.

codevalidated

Fill `sentiment` gaps using mode

Calculate the mode of non-null sentiments:
- Determine the most frequent sentiment value among existing data.
Fill missing values:
- For rows where sentiment is currently missing, assign the calculated mode value.

codevalidated

Identify `user_age` outliers using IQR method

Calculate quartiles of the user_age column:
- Compute the first quartile (Q1) as the 25th percentile.
- Compute the third quartile (Q3) as the 75th percentile.
Calculate the Interquartile Range (IQR):
- Subtract Q1 from Q3 to get the IQR.
Determine the bounds for outliers:
- Calculate the lower bound as Q1 - 1.5 * IQR.
- Calculate the upper bound as Q3 + 1.5 * IQR.
Identify outliers:
- Find values falling below the lower bound or above the upper bound.
- Store these outliers in a variable called outliers.

codevalidated

Clean `user_age` using limits and median filling

Define age limits for the user_age column:
- Set minimum age to 13 years old.
- Set maximum age to 100 years old.
Clean age data outside the defined limits:
- Replace ages below 13 or above 100 with NaN.
Handle missing age values:
- Calculate the median age from the dataset.
- Fill all NaN values with the calculated median age.
Standardize age data type:
- Convert all ages in the user_age column to integers.

Note : Use df.head().T for viewing your df. df.head().T provides a compact way to view the initial rows as columns, making it easier to scan the data horizontally.

codevalidated

Sort entries by `date` and `response_time` after conversion

Standardize the date column:
- Convert the date column to datetime format.
Sort the dataset:
- Sort by date in ascending order (earliest to latest).
- Within each date, sort by response_time in descending order.
- Update the dataframe in place with the new sorting.

codevalidated

Remove `user_id` and `date` duplicates, keep first entries

Remove duplicate entries:
- Identify duplicates based on user_id and date columns.
- Keep only the first entry for each user per day.
- Perform the operation in place, modifying the original dataframe.
Reset the dataframe index:
- Discard the old index.
- Create a new sequential index.
- Update the dataframe in place.

codevalidated

Remove duplicate `chat_id` entries, retain first occurrences

Identify and remove duplicate entries:
- Use the chat_id column as the basis for identifying duplicates.
- Keep only the first occurrence of each unique chat ID.
Update the dataframe:
- Perform the deduplication operation in place.
- Modify the original dataframe directly.

codevalidated

Purge the `message` Echo Chamber

Create a composite key:
- Combine message, topic, and language columns.
- Use underscore ('_') as a separator between the values.
- Store the result in a new column called message_context.
Remove duplicate entries:
- Use the message_context column to identify duplicates.
- Keep only the first occurrence of each unique composite key.
Clean up temporary data:
- Delete the message_context column.
Update the dataframe:
- Replace the original dataframe with the deduplicated version.

codevalidated

Update `subscription_type` to most recent or premium per user

Define subscription hierarchy:
- Assign numeric values to subscription types: 'enterprise': 3, 'premium': 2, 'basic': 1, 'free': 0
Create a temporary numeric column subscription rank sub_rank:
- Map the subscription_type to its corresponding numeric value.
Sort the data:
- Sort by user ID (ascending)
- Then by date (ascending)
- Finally by subscription rank (descending)
Keep only the most recent or premium subscription:
- Drop duplicates based on user_id
- Keep the last entry for each user
Clean up:
- Remove the temporary sub_rank column

codevalidated

Capitalize first letter of each word in `topic` column

Modify the topic column by capitalizing the first letter of each word.

codevalidated

Extract month name from `last_login` after datetime conversion

Convert the last_login column to datetime format.
Extract the month name from this column.
Store the extracted month name in a new column called login_month.

codevalidated

Generate `chatbot_initials` from `chatbot_name` column

Define a function that extracts initials from a name:
- Take the first letter of each word in the name.
- Combine these letters in uppercase.
Apply this function to the chatbot_name column.
Store the result in a new column called chatbot_initials.

codevalidated

Categorize `session_duration` as Short, Medium, or Long

Create a function to categorize session durations:
- Less than 30 minutes: 'Short'
- 30 to less than 60 minutes: 'Medium'
- 60 minutes or more: 'Long'
Apply this function to the session_duration column.
Store the result in a new column called duration_category.

codevalidated

Create `full_chat_id` Using `date` and `chat_id`

Convert the date column to a string format '%Y%m%d'.
Combine this formatted date with the existing chat_id, separating them with an underscore.
Store the result in a new column called full_chat_id.
Remove the temporary formatted date column.

codevalidated

Group `user_age` into Demographic Categories

Create a function to categorize ages:
- 'Unknown': for missing or negative values
- 'Under 18': for ages below 18
- 'Young Adult': for ages 18-29
- 'Adult': for ages 30-49
- 'Senior': for ages 50 and above
Apply this function to the user_age column.
Store the result in a new column called age_group.

codevalidated

Combine `user_country` and `language` for Location Info

Task: Create a function to map language codes to full language names and generate a new user_location column in a DataFrame.

Define a function that:
- Uses a dictionary to map language codes to full names.
- Includes mappings for: 'en' (English), 'es' (Spanish), 'fr' (French), 'de' (German), and 'it' (Italian).
- Returns 'Other' for any code not in the dictionary.
Apply this function to the language column to get full language names.
Create a new user_location column by combining user_country and the full language name in the format Country (Language).
Handle missing or unknown data:
- Replace entries where the country is NaN or 'Unknown' and the language is 'Other' with 'Unknown'.
Ensure the original language column remains unchanged.

codevalidated

Normalize `response_time` by Removing Negatives and Outliers

Remove negative values from the response_time column:
- Replace any negative values with zero.
Handle outliers using the 99th percentile:
- Calculate the 99th percentile of response times.
- Cap any values above this percentile to the 99th percentile value.
Improve precision:
- Round all response times to two decimal places.

codevalidated

Standardize `language` Codes and Handle Invalid Entries

Create a list of valid language codes:
- Include 'en', 'es', 'fr', 'de', and 'it'.
Create a function to clean and validate language codes. This function should:
- Convert codes to lowercase.
- Remove any surrounding whitespace.
- Check if the code is in the valid list.
- Assign 'other' if the code is not in the valid list.
- Assign 'unknown' if the code is missing or NaN.
Apply this cleaning function to the language column.

codevalidated

Clean and Normalize `accuracy_score` Column

Create a function to clean accuracy_scores. This function should:
- Handle missing values by preserving them as NaN.
- Convert scores greater than 1 to decimals by dividing by 100 (assuming they were given as percentages).
- Ensure all other values fall between 0 and 1 using the max and min functions.
Apply this cleaning function to the accuracy_score column.

codevalidated

Standardize `device_type` Column Values

Create a list of valid device types:
- Include 'mobile', 'tablet', and 'desktop'.
Create a function to clean device types. This function should:
- Handle missing values by replacing them with 'unknown'.
- For non-missing values:
  - Convert the input to lowercase.
  - Remove any leading or trailing spaces.
  - Check if the cleaned value is in the list of valid device types.
  - Keep the value if it's valid; otherwise, label it as 'other'.
Apply this cleaning function to the device_type column.

codevalidated

Normalize `is_returning_user` to Boolean

Create a function to clean the is_returning_user values. This function should:
- Handle missing values by returning NaN.
- Preserve existing boolean values.
- For string inputs:
  - Convert to lowercase.
  - Check if the value is 'true', 'yes', or '1'.
- For other types, convert to boolean.
Apply this cleaning function to the is_returning_user column.
Convert the entire column to boolean type:
- Treat NaN values as False.
- Assume unknown users are new users.
Fill any remaining NaN values with False.

codevalidated

Standardize `sentiment` Column Categories

Create a list of valid sentiment categories:
- Include positive, neutral, and negative.
Create a function to normalize sentiment values. This function should:
- Handle missing values by replacing them with 'unknown'.
- For non-missing values:
  - Convert the input to lowercase.
  - Remove any leading or trailing spaces.
  - Check if the cleaned sentiment is in the list of valid categories.
  - Keep the sentiment if it's valid; otherwise, label it as 'unknown'.
Apply this cleaning function to the sentiment column.

codevalidated

Normalize `user_country` Names

Create a dictionary mapping common country names and abbreviations to their standardized forms. Include the following:
- 'usa', 'us', and 'united states' for 'United States'
- 'uk', 'united kingdom', and 'great britain' for 'United Kingdom'
- 'germany' for 'Germany'
- 'france' for 'France'
- 'italy' for 'Italy'
- 'spain' for 'Spain'
- 'canada' for 'Canada'
- 'australia' for 'Australia'
Create a function to clean country names. This function should:
- Handle missing values by replacing them with 'Unknown'.
- For non-missing values:
  - Convert the input to lowercase.
  - Remove any leading or trailing spaces.
  - Use the dictionary to look up the standardized form.
  - If not found in the dictionary, capitalize the first letter of each word.
Apply this cleaning function to the user_country column.

codevalidated

Remove `session_duration` Outliers Using IQR

Calculate the Interquartile Range (IQR):
- Compute the first quartile (Q1) and third quartile (Q3) of session durations.
- Calculate IQR by subtracting Q1 from Q3.
Define bounds for outliers:
- Set lower bound as Q1 minus 1.5 times the IQR.
- Set upper bound as Q3 plus 1.5 times the IQR.
Create a function to handle outliers. This function should:
- Replace values below the lower bound with the lower bound.
- Replace values above the upper bound with the upper bound.
- Keep values between these bounds unchanged.
Apply this function to the session_duration column to remove extreme outliers.

codevalidated

Extract Day of Week from `date` Column

Extract the day of the week from the date column:
- Create a new column called chat_day_of_week.
- Use the datetime method to get the day name for each date.
Create a list defining the order of days:
- Include days from Monday to Sunday in the correct order.
Count the number of chats for each day of the week:
- Use the predefined list to ensure correct ordering of days.
- Store these counts in a variable called chat_counts.

codevalidated

Calculate Days Since Last Login and Clean Data

Calculate the days between date and last_login:
- Subtract last_login from date.
- Extract the number of days from the resulting timedelta.
- Store the result in a new column called days_since_last_login.
Handle potential data errors:
- Ensure all values in days_since_last_login are non-negative.
- Set any negative values to zero.

codevalidated

Analyze Weekly Chat Trends Using `date` Column

Extract the week number from the date column:
- Create a new column called week_number.
- Use the ISO calendar system to determine the week number.
Group the data by week number and count chats:
- Group the DataFrame by the week_number column.
- Count the number of chats for each week.
- Store the results in a new DataFrame called weekly_chats.
- Ensure weekly_chats has columns for week number and chat count.
Calculate the running total of chats:
- Compute the cumulative sum of chat counts.
- Add this as a new column called 'cumulative_chats' to weekly_chats.

Dhrubaraj Roy

Project Activities

Fill `chatbot_name` gaps using topic and language groups

Sanitize `word_count` by handling negatives and filling gaps

Fill `sentiment` gaps using mode

Identify `user_age` outliers using IQR method

Clean `user_age` using limits and median filling

Sort entries by `date` and `response_time` after conversion

Remove `user_id` and `date` duplicates, keep first entries

Remove duplicate `chat_id` entries, retain first occurrences

Purge the `message` Echo Chamber

Update `subscription_type` to most recent or premium per user

Capitalize first letter of each word in `topic` column

Extract month name from `last_login` after datetime conversion

Generate `chatbot_initials` from `chatbot_name` column

Categorize `session_duration` as Short, Medium, or Long

Create `full_chat_id` Using `date` and `chat_id`

Group `user_age` into Demographic Categories

Combine `user_country` and `language` for Location Info

Normalize `response_time` by Removing Negatives and Outliers

Standardize `language` Codes and Handle Invalid Entries

Clean and Normalize `accuracy_score` Column

Standardize `device_type` Column Values

Normalize `is_returning_user` to Boolean

Standardize `sentiment` Column Categories

Normalize `user_country` Names

Remove `session_duration` Outliers Using IQR

Extract Day of Week from `date` Column

Calculate Days Since Last Login and Clean Data

Analyze Weekly Chat Trends Using `date` Column

Dhrubaraj Roy

Data Cleaning with Pandas

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database