All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Enter the number of exact duplicate rows in the dataset. Exact duplicates are rows where all column values are identical. Use pandas functionality to identify these duplicates and sum up the total count. Store the result in a variable called duplicate_count
.
Identify Equipment IDs that appear more than once in the dataset. First, count the occurrences of each unique Equipment_ID
. Then, filter this count to keep only the IDs that occur multiple times. Store the result in a variable called duplicate_ids
.
Count the number of duplicate Equipment_ID
and remove these duplicates from the dataset. First, tally the number of duplicate Equipment_ID
and store this count in a variable called duplicates
. Then, create a new dataframe called df_cleaned
that keeps only the first occurrence of each unique Equipment_ID
, discarding any subsequent duplicates.
Enter the number that the duplicates
variable contains.
Standardize the Status
column values and remove any remaining duplicates from the dataset. First, convert all Status
values to lowercase and remove any leading or trailing whitespace. Then, create a new dataframe called df_cleaned
that contains only unique rows, removing any exact duplicates across all columns.
Identify duplicate rows based on specific columns and analyze the manufacturers of these duplicates. First, find rows with identical values in the Type
, Model
, and Year_Acquired
columns, keeping all instances of duplicates. Store these in a duplicates
variable. Then, count the occurrences of each manufacturer within these duplicates. Finally, calculate the percentage each manufacturer represents among the duplicates, rounding to two decimal places. Store the counts and percentages in mfg_counts
and mfg_pct
respectively.
Select all columns except Operational_Cost
and Last_Maintenance
as comparison columns. Use these to identify near-duplicate entries in the dataset. Employ the duplicated()
method with keep=False
to flag all duplicate rows based on the selected columns. Store these near-duplicates in the near_duplicates
variable.
Define key equipment details (Equipment_ID
, Type
, Model
, Manufacturer
, Year_Acquired
, Status
, and Current_Base
) in comparison_cols
. Sort near_duplicates
by these columns and Last_Maintenance
. Keep only the last (most recent) maintenance record for each unique piece of equipment using drop_duplicates()
, considering only the comparison_cols
for determining duplicates. Store the result in resolved_maintenance
. Finally, reset the index of the resulting DataFrame, dropping the old index.
Identify equipment with the same Type
and Model
in the original df
. Use the duplicated()
method with keep=False
to flag all duplicate entries based on the Type
and Model
columns, including the first occurrence. Filter the DataFrame to keep only these flagged duplicate rows. Store these potential duplicate entries in a variable called hidden_duplicates
.
Count how many times each unique combination of Type
and Model
appears in the hidden_duplicates
DataFrame using the value_counts()
method. Then, sort these counts from to lowest to highest. Store the resulting Series of counts in type_model_counts
.