Cleaning Duplicate Data in Military Equipment Records

input

Count exact duplicate rows

Enter the number of exact duplicate rows in the dataset. Exact duplicates are rows where all column values are identical. Use pandas functionality to identify these duplicates and sum up the total count. Store the result in a variable called duplicate_count.

codevalidated

Find `Equipment_ID` with multiple entries

Identify Equipment IDs that appear more than once in the dataset. First, count the occurrences of each unique Equipment_ID. Then, filter this count to keep only the IDs that occur multiple times. Store the result in a variable called duplicate_ids.

codevalidated

Identify Duplicate Equipment Entries by `Equipment_ID`

Count the number of duplicate Equipment_ID and remove these duplicates from the dataset. First, tally the number of duplicate Equipment_ID and store this count in a variable called duplicates. Then, create a new dataframe called df_cleaned that keeps only the first occurrence of each unique Equipment_ID, discarding any subsequent duplicates.

input

How many duplicates are there?

Enter the number that the duplicates variable contains.

codevalidated

Normalize `Status` and deduplicate rows

Standardize the Status column values and remove any remaining duplicates from the dataset. First, convert all Status values to lowercase and remove any leading or trailing whitespace. Then, create a new dataframe called df_cleaned that contains only unique rows, removing any exact duplicates across all columns.

codevalidated

Calculate `Manufacturer` percentages in `Type`-`Model`-`Year_Acquired` duplicates

Identify duplicate rows based on specific columns and analyze the manufacturers of these duplicates. First, find rows with identical values in the Type, Model, and Year_Acquired columns, keeping all instances of duplicates. Store these in a duplicates variable. Then, count the occurrences of each manufacturer within these duplicates. Finally, calculate the percentage each manufacturer represents among the duplicates, rounding to two decimal places. Store the counts and percentages in mfg_counts and mfg_pct respectively.

multiplechoice

Which pandas method is used to identify duplicate rows?

multiplechoice

How can you check for duplicates in a specific column?

codevalidated

Identify similar entries excluding `Operational_Cost` and `Last_Maintenance`

Select all columns except Operational_Cost and Last_Maintenance as comparison columns. Use these to identify near-duplicate entries in the dataset. Employ the duplicated() method with keep=False to flag all duplicate rows based on the selected columns. Store these near-duplicates in the near_duplicates variable.

codevalidated

Select most recent `Last_Maintenance` record for each equipment

Define key equipment details (Equipment_ID, Type, Model, Manufacturer, Year_Acquired, Status, and Current_Base) in comparison_cols. Sort near_duplicates by these columns and Last_Maintenance. Keep only the last (most recent) maintenance record for each unique piece of equipment using drop_duplicates(), considering only the comparison_cols for determining duplicates. Store the result in resolved_maintenance. Finally, reset the index of the resulting DataFrame, dropping the old index.

multiplechoice

When handling near-duplicates by keeping the most recent entry, which operation is typically performed first?

multiplechoice

What's a potential downside of always keeping the lower `Operational_Cost` entry when resolving near-duplicates?

codevalidated

Identify equipment with identical `Type` and `Model`

Identify equipment with the same Type and Model in the original df. Use the duplicated() method with keep=False to flag all duplicate entries based on the Type and Model columns, including the first occurrence. Filter the DataFrame to keep only these flagged duplicate rows. Store these potential duplicate entries in a variable called hidden_duplicates.

codevalidated

Count and rank `Type`, `Model` combinations in duplicates

Count how many times each unique combination of Type and Model appears in the hidden_duplicates DataFrame using the value_counts() method. Then, sort these counts from to lowest to highest. Store the resulting Series of counts in type_model_counts.

multiplechoice

Which pandas method is most useful for creating a hash of multiple columns?

multiplechoice

Dhrubaraj Roy

Project Activities

Count exact duplicate rows

Find `Equipment_ID` with multiple entries

Identify Duplicate Equipment Entries by `Equipment_ID`

How many duplicates are there?

Normalize `Status` and deduplicate rows

Calculate `Manufacturer` percentages in `Type`-`Model`-`Year_Acquired` duplicates

Which pandas method is used to identify duplicate rows?

How can you check for duplicates in a specific column?

Identify similar entries excluding `Operational_Cost` and `Last_Maintenance`

Select most recent `Last_Maintenance` record for each equipment

When handling near-duplicates by keeping the most recent entry, which operation is typically performed first?

What's a potential downside of always keeping the lower `Operational_Cost` entry when resolving near-duplicates?

Identify equipment with identical `Type` and `Model`

Count and rank `Type`, `Model` combinations in duplicates

Which pandas method is most useful for creating a hash of multiple columns?

What's a potential risk of always keeping the entry with the lowest `Equipment_ID` when resolving hidden duplicates?

Dhrubaraj Roy

Data Cleaning with Pandas

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database