NLP Chatbot Interaction Data Cleaning and Analysis Project
NLP Chatbot Interaction Data Cleaning and Analysis Project Data Science Project
Data Cleaning with Pandas

NLP Chatbot Interaction Data Cleaning and Analysis Project

Dive into the world of chatbots with this hands-on NLP data project! Use Pandas magic to clean messy data. Fill gaps, remove outliers, and standardize formats. Slice and dice timestamps to uncover chat trends. Wrangle text data to extract insights on sentiment and topics. Use .apply(), .fillna(), and .groupby() to transform raw logs into analysis-ready gold. Polish your data cleaning skills while exploring the inner workings of AI chatbots!
Start this project
NLP Chatbot Interaction Data Cleaning and Analysis ProjectNLP Chatbot Interaction Data Cleaning and Analysis Project
Project Created by

Dhrubaraj Roy

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Fill `chatbot_name` gaps using topic and language groups

  1. Group the data by topic and language:

    • Use value_counts() to find the most common chatbot for each combination.

    • Handle cases where there might be no common chatbot for a combination.

  2. Create a function to fill missing chatbot names:

    • If a chatbot name is missing, replace it with the most common one for that topic-language pair.

    • If there's no common chatbot for a particular combination when a value is missing, keep it as None type.

  3. Apply the function to fill missing values in the chatbot_name column.

  4. Replace any remaining missing values with Unknown.

codevalidated

Sanitize `word_count` by handling negatives and filling gaps

  1. Handle negative values in the word_count column:

    • Replace any negative values with NaN (Not a Number).
  2. Calculate the median word count:

    • Use only the remaining valid values (non-negative and non-NaN).
  3. Fill missing values:

    • Use the calculated median to fill in all NaN values in the word_count column.
  4. Standardize data type:

    • Convert all values in the word_count column to integers.
codevalidated

Fill `sentiment` gaps using mode

  1. Calculate the mode of non-null sentiments:

    • Determine the most frequent sentiment value among existing data.
  2. Fill missing values:

    • For rows where sentiment is currently missing, assign the calculated mode value.
codevalidated

Identify `user_age` outliers using IQR method

  1. Calculate quartiles of the user_age column:

    • Compute the first quartile (Q1) as the 25th percentile.
    • Compute the third quartile (Q3) as the 75th percentile.
  2. Calculate the Interquartile Range (IQR):

    • Subtract Q1 from Q3 to get the IQR.
  3. Determine the bounds for outliers:

    • Calculate the lower bound as Q1 - 1.5 * IQR.
    • Calculate the upper bound as Q3 + 1.5 * IQR.
  4. Identify outliers:

    • Find values falling below the lower bound or above the upper bound.
    • Store these outliers in a variable called outliers.
codevalidated

Clean `user_age` using limits and median filling

  1. Define age limits for the user_age column:

    • Set minimum age to 13 years old.
    • Set maximum age to 100 years old.
  2. Clean age data outside the defined limits:

    • Replace ages below 13 or above 100 with NaN.
  3. Handle missing age values:

    • Calculate the median age from the dataset.
    • Fill all NaN values with the calculated median age.
  4. Standardize age data type:

    • Convert all ages in the user_age column to integers.

Note : Use df.head().T for viewing your df. df.head().T provides a compact way to view the initial rows as columns, making it easier to scan the data horizontally.

codevalidated

Sort entries by `date` and `response_time` after conversion

  1. Standardize the date column:

    • Convert the date column to datetime format.
  2. Sort the dataset:

    • Sort by date in ascending order (earliest to latest).
    • Within each date, sort by response_time in descending order.
    • Update the dataframe in place with the new sorting.
codevalidated

Remove `user_id` and `date` duplicates, keep first entries

  1. Remove duplicate entries:

    • Identify duplicates based on user_id and date columns.
    • Keep only the first entry for each user per day.
    • Perform the operation in place, modifying the original dataframe.
  2. Reset the dataframe index:

    • Discard the old index.
    • Create a new sequential index.
    • Update the dataframe in place.
codevalidated

Remove duplicate `chat_id` entries, retain first occurrences

  1. Identify and remove duplicate entries:

    • Use the chat_id column as the basis for identifying duplicates.
    • Keep only the first occurrence of each unique chat ID.
  2. Update the dataframe:

    • Perform the deduplication operation in place.
    • Modify the original dataframe directly.
codevalidated

Purge the `message` Echo Chamber

  1. Create a composite key:

    • Combine message, topic, and language columns.
    • Use underscore ('_') as a separator between the values.
    • Store the result in a new column called message_context.
  2. Remove duplicate entries:

    • Use the message_context column to identify duplicates.
    • Keep only the first occurrence of each unique composite key.
  3. Clean up temporary data:

    • Delete the message_context column.
  4. Update the dataframe:

    • Replace the original dataframe with the deduplicated version.
codevalidated

Update `subscription_type` to most recent or premium per user

  1. Define subscription hierarchy:

    • Assign numeric values to subscription types: 'enterprise': 3, 'premium': 2, 'basic': 1, 'free': 0
  2. Create a temporary numeric column subscription rank sub_rank:

    • Map the subscription_type to its corresponding numeric value.
  3. Sort the data:

    • Sort by user ID (ascending)
    • Then by date (ascending)
    • Finally by subscription rank (descending)
  4. Keep only the most recent or premium subscription:

    • Drop duplicates based on user_id
    • Keep the last entry for each user
  5. Clean up:

    • Remove the temporary sub_rank column
codevalidated

Capitalize first letter of each word in `topic` column

Modify the topic column by capitalizing the first letter of each word.

codevalidated

Extract month name from `last_login` after datetime conversion

  1. Convert the last_login column to datetime format.
  2. Extract the month name from this column.
  3. Store the extracted month name in a new column called login_month.
codevalidated

Generate `chatbot_initials` from `chatbot_name` column

  1. Define a function that extracts initials from a name:
    • Take the first letter of each word in the name.
    • Combine these letters in uppercase.
  2. Apply this function to the chatbot_name column.
  3. Store the result in a new column called chatbot_initials.
codevalidated

Categorize `session_duration` as Short, Medium, or Long

  1. Create a function to categorize session durations:
    • Less than 30 minutes: 'Short'
    • 30 to less than 60 minutes: 'Medium'
    • 60 minutes or more: 'Long'
  2. Apply this function to the session_duration column.
  3. Store the result in a new column called duration_category.
codevalidated

Create `full_chat_id` Using `date` and `chat_id`

  1. Convert the date column to a string format '%Y%m%d'.
  2. Combine this formatted date with the existing chat_id, separating them with an underscore.
  3. Store the result in a new column called full_chat_id.
  4. Remove the temporary formatted date column.
codevalidated

Group `user_age` into Demographic Categories

  1. Create a function to categorize ages:
    • 'Unknown': for missing or negative values
    • 'Under 18': for ages below 18
    • 'Young Adult': for ages 18-29
    • 'Adult': for ages 30-49
    • 'Senior': for ages 50 and above
  2. Apply this function to the user_age column.
  3. Store the result in a new column called age_group.
codevalidated

Combine `user_country` and `language` for Location Info

Task: Create a function to map language codes to full language names and generate a new user_location column in a DataFrame.

  1. Define a function that:

    • Uses a dictionary to map language codes to full names.
    • Includes mappings for: 'en' (English), 'es' (Spanish), 'fr' (French), 'de' (German), and 'it' (Italian).
    • Returns 'Other' for any code not in the dictionary.
  2. Apply this function to the language column to get full language names.

  3. Create a new user_location column by combining user_country and the full language name in the format Country (Language).

  4. Handle missing or unknown data:

    • Replace entries where the country is NaN or 'Unknown' and the language is 'Other' with 'Unknown'.
  5. Ensure the original language column remains unchanged.

codevalidated

Normalize `response_time` by Removing Negatives and Outliers

  1. Remove negative values from the response_time column:

    • Replace any negative values with zero.
  2. Handle outliers using the 99th percentile:

    • Calculate the 99th percentile of response times.
    • Cap any values above this percentile to the 99th percentile value.
  3. Improve precision:

    • Round all response times to two decimal places.
codevalidated

Standardize `language` Codes and Handle Invalid Entries

  1. Create a list of valid language codes:

    • Include 'en', 'es', 'fr', 'de', and 'it'.
  2. Create a function to clean and validate language codes. This function should:

    • Convert codes to lowercase.
    • Remove any surrounding whitespace.
    • Check if the code is in the valid list.
    • Assign 'other' if the code is not in the valid list.
    • Assign 'unknown' if the code is missing or NaN.
  3. Apply this cleaning function to the language column.

codevalidated

Clean and Normalize `accuracy_score` Column

  1. Create a function to clean accuracy_scores. This function should:

    • Handle missing values by preserving them as NaN.
    • Convert scores greater than 1 to decimals by dividing by 100 (assuming they were given as percentages).
    • Ensure all other values fall between 0 and 1 using the max and min functions.
  2. Apply this cleaning function to the accuracy_score column.

codevalidated

Standardize `device_type` Column Values

  1. Create a list of valid device types:

    • Include 'mobile', 'tablet', and 'desktop'.
  2. Create a function to clean device types. This function should:

    • Handle missing values by replacing them with 'unknown'.
    • For non-missing values:
      • Convert the input to lowercase.
      • Remove any leading or trailing spaces.
      • Check if the cleaned value is in the list of valid device types.
      • Keep the value if it's valid; otherwise, label it as 'other'.
  3. Apply this cleaning function to the device_type column.

codevalidated

Normalize `is_returning_user` to Boolean

  1. Create a function to clean the is_returning_user values. This function should:

    • Handle missing values by returning NaN.
    • Preserve existing boolean values.
    • For string inputs:
      • Convert to lowercase.
      • Check if the value is 'true', 'yes', or '1'.
    • For other types, convert to boolean.
  2. Apply this cleaning function to the is_returning_user column.

  3. Convert the entire column to boolean type:

    • Treat NaN values as False.
    • Assume unknown users are new users.
  4. Fill any remaining NaN values with False.

codevalidated

Standardize `sentiment` Column Categories

  1. Create a list of valid sentiment categories:

    • Include positive, neutral, and negative.
  2. Create a function to normalize sentiment values. This function should:

    • Handle missing values by replacing them with 'unknown'.
    • For non-missing values:
      • Convert the input to lowercase.
      • Remove any leading or trailing spaces.
      • Check if the cleaned sentiment is in the list of valid categories.
      • Keep the sentiment if it's valid; otherwise, label it as 'unknown'.
  3. Apply this cleaning function to the sentiment column.

codevalidated

Normalize `user_country` Names

  1. Create a dictionary mapping common country names and abbreviations to their standardized forms. Include the following:

    • 'usa', 'us', and 'united states' for 'United States'
    • 'uk', 'united kingdom', and 'great britain' for 'United Kingdom'
    • 'germany' for 'Germany'
    • 'france' for 'France'
    • 'italy' for 'Italy'
    • 'spain' for 'Spain'
    • 'canada' for 'Canada'
    • 'australia' for 'Australia'
  2. Create a function to clean country names. This function should:

    • Handle missing values by replacing them with 'Unknown'.
    • For non-missing values:
      • Convert the input to lowercase.
      • Remove any leading or trailing spaces.
      • Use the dictionary to look up the standardized form.
      • If not found in the dictionary, capitalize the first letter of each word.
  3. Apply this cleaning function to the user_country column.

codevalidated

Remove `session_duration` Outliers Using IQR

  1. Calculate the Interquartile Range (IQR):

    • Compute the first quartile (Q1) and third quartile (Q3) of session durations.
    • Calculate IQR by subtracting Q1 from Q3.
  2. Define bounds for outliers:

    • Set lower bound as Q1 minus 1.5 times the IQR.
    • Set upper bound as Q3 plus 1.5 times the IQR.
  3. Create a function to handle outliers. This function should:

    • Replace values below the lower bound with the lower bound.
    • Replace values above the upper bound with the upper bound.
    • Keep values between these bounds unchanged.
  4. Apply this function to the session_duration column to remove extreme outliers.

codevalidated

Extract Day of Week from `date` Column

  1. Extract the day of the week from the date column:

    • Create a new column called chat_day_of_week.
    • Use the datetime method to get the day name for each date.
  2. Create a list defining the order of days:

    • Include days from Monday to Sunday in the correct order.
  3. Count the number of chats for each day of the week:

    • Use the predefined list to ensure correct ordering of days.
    • Store these counts in a variable called chat_counts.
codevalidated

Calculate Days Since Last Login and Clean Data

  1. Calculate the days between date and last_login:

    • Subtract last_login from date.
    • Extract the number of days from the resulting timedelta.
    • Store the result in a new column called days_since_last_login.
  2. Handle potential data errors:

    • Ensure all values in days_since_last_login are non-negative.
    • Set any negative values to zero.
codevalidated

Analyze Weekly Chat Trends Using `date` Column

  1. Extract the week number from the date column:

    • Create a new column called week_number.
    • Use the ISO calendar system to determine the week number.
  2. Group the data by week number and count chats:

    • Group the DataFrame by the week_number column.
    • Count the number of chats for each week.
    • Store the results in a new DataFrame called weekly_chats.
    • Ensure weekly_chats has columns for week number and chat count.
  3. Calculate the running total of chats:

    • Compute the cumulative sum of chat counts.
    • Add this as a new column called 'cumulative_chats' to weekly_chats.
NLP Chatbot Interaction Data Cleaning and Analysis ProjectNLP Chatbot Interaction Data Cleaning and Analysis Project
Project Created by

Dhrubaraj Roy

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

This project is part of

Data Cleaning with Pandas

Explore other projects