Advanced Data Cleaning Capstone Project Using International Football Player Dataset

codevalidated

Fill Missing `foot` Values with Mode

Check how many missing values exist in the foot column. Then, find the most common value (mode) in this column and use it to fill in the missing values. After filling in the blanks, verify that no missing values remain in the foot column.

codevalidated

Clean and Convert `international_matches` to Integers

For the international_matches column, replace any dashes with zeros. Then convert all the values in this column to integers. This will ensure that all entries in the column are numeric and can be used for calculations.

multiplechoice

According to the dataset, what is Nick Pope's market value?

multiplechoice

Which of the following statements are true about the dataset? (Select all that apply)

codevalidated

Standardize `market_value` to Millions of Euros

Create a function that transforms market values into millions of euros. This function should handle cases where the value is a dash, contains 'k' for thousands, or is already in millions. For cases where the value is a dash ('-'), replace it with 0 to represent missing or unavailable data. Remove any currency symbols and unit indicators. If the value is in thousands, divide it by 1000 to convert to millions. Apply this function to the market_value column to standardize all values to a consistent numerical format in millions of euros.

codevalidated

Extract and Clean Dates from `birthdate` Column

Create a function named clean_birthdate to extract the date from the birthdate string. This function should handle missing values and return them as Not-a-Time (NaT). Use a regular expression to find and extract the date in the format '%b %d, %Y'. If a valid date is found, convert it to a datetime object. If no valid date is found, return NaT. Apply this function to the birthdate column and store the results in a new column called birthdate_clean.

codevalidated

Clean and Convert `debut` Dates

Create a function called clean_debut to clean and convert debut dates. This function should handle empty strings and missing values, returning them as Not-a-Time (NaT). For valid date strings, attempt to convert them to datetime objects using the format '%b %d, %Y'. If the conversion fails, return NaT. Apply this function to the debut column and store the cleaned dates in a new column named debut_clean.

codevalidated

Fill Missing `debut_clean` with Estimated Adult Date

For any missing values in the debut_clean column, fill them with a calculated date. This calculated date should be the player's birthdate_clean plus 18 years (approximated as 18 * 365 days). If a debut_clean value already exists, keep it unchanged. Apply this logic to each row of the dataframe, updating the debut_clean column with either the existing or calculated debut date.

multiplechoice

How many unique teams are represented in the dataset?

codevalidated

Identify Duplicate Names in `name` Column

Count how many times each name appears in the name column. Then, identify and display the names that occur more than once in the dataset. Store the count of each name in a variable called name_counts, and keep the names that appear multiple times in a variable named duplicate_names.

input

How many `Dylan Maes` are in the `df`?

codevalidated

Create Unique `player_id` and Flag Duplicates

Create a unique player_id column:
- Combine the 'name' and 'birthdate' columns to create a new player_id column.
- Use the string concatenation operator '_' to separate the name and birthdate.
Check for duplicates in the new player_id column:
- Identify rows with duplicate player_ids.
- Mark all duplicates, not just subsequent occurrences.
- Store the result in a new DataFrame called duplicate_ids.

codevalidated

Sort by `team_id` and `debut_clean`, then remove duplicates

Sort the dataframe by team_id and debut_clean columns, with team_id in ascending order and debut_clean in descending order (most recent first). Store this sorted dataframe in a new variable called df_sorted. Then, remove duplicate entries based on the combination of team_id and name, keeping only the first occurrence (which will be the most recent due to our sorting). Store this cleaned dataframe in a new variable called df_cleaned.

codevalidated

Normalize `team_name` into new `normalized_team` column

Create a new column called normalized_team by transforming the team_name column. Convert all team names to lowercase, remove any occurrence of "fc" (case insensitive), and then convert the result to title case. This normalization process will help standardize team names across the dataset, making it easier to compare and analyze team-related information.

codevalidated

Standardize foot column into `dominant_foot`

Create a function called standardize_foot that takes a foot value as input. Convert the input to lowercase. If the input contains 'right', return 'Right'. If it contains 'left', return 'Left'. If it contains 'both', return 'Both'. For any other input, return 'Unknown'. Apply this function to the foot column and store the result in a new column called dominant_foot.

Note : New column added at the end of the df

input

How many unique positions are listed in the dataset? (Enter a number)

multiplechoice

Which of the following is NOT one of the standardized options for the `dominant_foot` column after applying the `standardize_foot` function?

codevalidated

Clean and lowercase `club` names

Clean up the club column by removing any leading or trailing whitespace from each entry. Then, convert all the club names to lowercase. Store the results back in the club column.

codevalidated

Convert height to centimeters in new `height_cm` column

Create a function called height_to_cm that takes a height value as input. Inside the function, replace any comma with a dot and remove the m character from the input. Then, convert the resulting string to a float, multiply it by 100, and round the result. If any error occurs during this process, return np.nan. Apply this function to the height column and store the results in a new column called height_cm.

codevalidated

Extract age from `birthdate` using regex

Extract age from the birthdate column:
- Use the str.extract() method with a regular expression (regex) pattern to find and extract age information.
- The pattern r'\((\d+)\)' looks for one or more digits enclosed in parentheses.
- Apply this extraction to the birthdate column.
Create a new age column:
- Store the extracted age values in a new column called age.
- Convert the extracted values to float data type for numerical operations.

multiplechoice

Multi-select Question:

Which of the following operations are performed in the data cleaning and conversion process? (Select all that apply)

codevalidated

Categorize jersey numbers as `Standard` or `Special`

Create a function called categorize_jersey that takes a jersey number as input. Inside the function, try to convert the input to an integer. If successful and the number is between 1 and 99 (inclusive), return Standard. For any other number or if the conversion fails, return Special. Apply this function to the number column and store the results in a new column called jersey_category.

codevalidated

Flag top 1% goal scorers after cleaning `goals` data

Clean the goals column by replacing any - with 0, then convert the values to integers. Store the results in a new column called goals_clean. If any conversion fails, use 0 instead. Next, calculate the 99th percentile of goals scored. Create a new column called high_scorer and set it to True for players whose goals_clean value exceeds this threshold, and False otherwise.

codevalidated

Generate unique player codes from `name` and `birthdate`

Define a function to generate a player code:
- Take name and birthdate as inputs.
- Split the name into parts.
- Extract the first 3 letters of the last name, converting to uppercase.
- Extract the first 2 letters of the first name, converting to uppercase.
- Extract the last 2 digits of the year from the birthdate using a regex pattern.
- If no year is found, use '00' as a default.
- Combine these elements to create the player code.
Apply the player code generation to the DataFrame:
- Create a new player_code column in the DataFrame.
- Use the apply method to generate a unique code for each row.
- Pass the name and birthdate values from each row to the function.

Note: If a player doesn't have a last name, assume the first name to be the last name as well. For example, if a player's name is Rodri (and he doesn't have a last name), then assume his full name to be Rodri Rodri.

codevalidated

Flag players in prime age range (25-29)

Identify players who are in their prime years. Create a new column called in_prime. Set it to True for players whose age falls between 25 and 29 years old (inclusive), and False otherwise. This boolean column will help you quickly filter or analyze players who are typically considered to be at the peak of their careers

Dhrubaraj Roy

Project Activities

Fill Missing `foot` Values with Mode

Clean and Convert `international_matches` to Integers

According to the dataset, what is Nick Pope's market value?

Which of the following statements are true about the dataset? (Select all that apply)

Standardize `market_value` to Millions of Euros

Extract and Clean Dates from `birthdate` Column

Clean and Convert `debut` Dates

Fill Missing `debut_clean` with Estimated Adult Date

How many unique teams are represented in the dataset?

Identify Duplicate Names in `name` Column

How many `Dylan Maes` are in the `df`?

Create Unique `player_id` and Flag Duplicates

Sort by `team_id` and `debut_clean`, then remove duplicates

Normalize `team_name` into new `normalized_team` column

Standardize foot column into `dominant_foot`

How many unique positions are listed in the dataset? (Enter a number)

Which of the following is NOT one of the standardized options for the `dominant_foot` column after applying the `standardize_foot` function?

Clean and lowercase `club` names

Convert height to centimeters in new `height_cm` column

Extract age from `birthdate` using regex

Multi-select Question:

Categorize jersey numbers as `Standard` or `Special`

Flag top 1% goal scorers after cleaning `goals` data

Generate unique player codes from `name` and `birthdate`

Flag players in prime age range (25-29)

Dhrubaraj Roy

Data Cleaning with Pandas

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database