All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Check how many missing values exist in the foot
column. Then, find the most common value (mode) in this column and use it to fill in the missing values. After filling in the blanks, verify that no missing values remain in the foot
column.
For the international_matches
column, replace any dashes with zeros. Then convert all the values in this column to integers. This will ensure that all entries in the column are numeric and can be used for calculations.
Create a function that transforms market values into millions of euros. This function should handle cases where the value is a dash, contains 'k'
for thousands, or is already in millions. For cases where the value is a dash ('-'), replace it with 0 to represent missing or unavailable data. Remove any currency symbols and unit indicators. If the value is in thousands, divide it by 1000 to convert to millions. Apply this function to the market_value
column to standardize all values to a consistent numerical format in millions of euros.
Create a function named clean_birthdate
to extract the date from the birthdate string. This function should handle missing values and return them as Not-a-Time (NaT
). Use a regular expression to find and extract the date in the format '%b %d, %Y'
. If a valid date is found, convert it to a datetime object. If no valid date is found, return NaT
. Apply this function to the birthdate
column and store the results in a new column called birthdate_clean
.
Create a function called clean_debut
to clean and convert debut dates. This function should handle empty strings and missing values, returning them as Not-a-Time (NaT
). For valid date strings, attempt to convert them to datetime objects using the format '%b %d, %Y'
. If the conversion fails, return NaT
. Apply this function to the debut
column and store the cleaned dates in a new column named debut_clean
.
For any missing values in the debut_clean
column, fill them with a calculated date. This calculated date should be the player's birthdate_clean
plus 18 years (approximated as 18 * 365 days). If a debut_clean
value already exists, keep it unchanged. Apply this logic to each row of the dataframe, updating the debut_clean
column with either the existing or calculated debut date.
Count how many times each name appears in the name
column. Then, identify and display the names that occur more than once in the dataset. Store the count of each name in a variable called name_counts
, and keep the names that appear multiple times in a variable named duplicate_names
.
Create a unique player_id
column:
player_id
column.'_'
to separate the name and birthdate.Check for duplicates in the new player_id
column:
player_ids
.duplicate_ids
.Sort the dataframe by team_id
and debut_clean
columns, with team_id
in ascending order and debut_clean
in descending order (most recent first). Store this sorted dataframe in a new variable called df_sorted
. Then, remove duplicate entries based on the combination of team_id
and name
, keeping only the first occurrence (which will be the most recent due to our sorting). Store this cleaned dataframe in a new variable called df_cleaned
.
Create a new column called normalized_team
by transforming the team_name
column. Convert all team names to lowercase, remove any occurrence of "fc" (case insensitive), and then convert the result to title case. This normalization process will help standardize team names across the dataset, making it easier to compare and analyze team-related information.
Create a function called standardize_foot
that takes a foot value as input. Convert the input to lowercase. If the input contains 'right'
, return 'Right'
. If it contains 'left'
, return 'Left'
. If it contains 'both'
, return 'Both'
. For any other input, return 'Unknown'
. Apply this function to the foot
column and store the result in a new column called dominant_foot
.
Note : New column added at the end of the
df
Clean up the club
column by removing any leading or trailing whitespace from each entry. Then, convert all the club names to lowercase. Store the results back in the club
column.
Create a function called height_to_cm
that takes a height value as input. Inside the function, replace any comma with a dot and remove the m
character from the input. Then, convert the resulting string to a float, multiply it by 100, and round the result. If any error occurs during this process, return np.nan
. Apply this function to the height
column and store the results in a new column called height_cm
.
Extract age from the birthdate
column:
str.extract()
method with a regular expression (regex) pattern to find and extract age information.r'\((\d+)\)'
looks for one or more digits enclosed in parentheses.birthdate
column.Create a new age
column:
age
.Which of the following operations are performed in the data cleaning and conversion process? (Select all that apply)
Create a function called categorize_jersey
that takes a jersey number as input. Inside the function, try to convert the input to an integer. If successful and the number is between 1 and 99 (inclusive), return Standard
. For any other number or if the conversion fails, return Special
. Apply this function to the number
column and store the results in a new column called jersey_category
.
Clean the goals
column by replacing any -
with 0
, then convert the values to integers. Store the results in a new column called goals_clean
. If any conversion fails, use 0 instead. Next, calculate the 99th percentile of goals scored. Create a new column called high_scorer
and set it to True
for players whose goals_clean
value exceeds this threshold, and False
otherwise.
Define a function to generate a player code:
name
and birthdate
as inputs.'00'
as a default.Apply the player code generation to the DataFrame:
player_code
column in the DataFrame.name
and birthdate
values from each row to the function.Note: If a player doesn't have a last name, assume the first name to be the last name as well. For example, if a player's name is Rodri (and he doesn't have a last name), then assume his full name to be Rodri Rodri.
Identify players who are in their prime years. Create a new column called in_prime
. Set it to True for players whose age falls between 25 and 29 years old (inclusive), and False otherwise. This boolean column will help you quickly filter or analyze players who are typically considered to be at the peak of their careers