Cleaning and Standardizing Pharmaceutical Product Data
Cleaning and Standardizing Pharmaceutical Product Data Data Science Project
Data Cleaning with Pandas

Cleaning and Standardizing Pharmaceutical Product Data

Unveil the intricacies of pharmaceutical products through precise data cleansing and standardization! Join this practical project. You'll turn raw data into a reliable source. Manage missing values, eliminate duplicates, and standardize text fields. Acquire valuable data curation skills. They ensure the accuracy of info. The info lays the groundwork for life-saving discoveries. Seize this opportunity to master essential techniques that propel medical advancements forward.
Start this project
Cleaning and Standardizing Pharmaceutical Product DataCleaning and Standardizing Pharmaceutical Product Data
Project Created by

Dhrubaraj Roy

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Identify Missing and Null Values

Determine the number of missing (null) values in each column of the df. Assign the result to the variable missing_values.

input

Missing Values in `END_MARKETING_DATE` Column

How many missing values are present in END_MARKETING_DATE?

codevalidated

Identify Numerical Columns

Identify the columns in the df that have numerical data types (float64 and int64). Store the column names of these numerical columns in the variable numerical_cols.

codevalidated

Impute Missing Values in Numerical Columns Using Mean

Using numerical_cols, replace missing (null) values in the numerical columns of the df with the mean value of the respective column.

codevalidated

Identify Categorical Columns

Find the columns in the df that have categorical (object) data types. Store the column names of these categorical columns in the variable categorical_cols.

codevalidated

Impute Missing Values in Categorical Columns Using Mode

Using categorical_cols, replace missing (null) values in the categorical columns of the df with the mode (most frequent) value of the respective column.

multiplechoice

Identify Duplicate Rows

How many duplicate rows (records) are present in the df?

codevalidated

Identify Duplicates Based on Specific Columns

PRODUCT_NDC is a column representing the National Drug Code for products, while LABELER_NAME is a column indicating the name of the product labeler or manufacturer. Find duplicate rows (records) in the df based on the values in the PRODUCT_NDC and LABELER_NAME columns. Store the duplicate rows in the variable duplicate_specific.

multiplechoice

How many records are in `duplicate_specific`?

codevalidated

Remove duplicate rows based on specific columns, keeping the last occurrence

Eliminate duplicate rows (records) from the df based on the values in the PRODUCT_NDC and LABELER_NAME columns, keeping the last occurrence of each duplicate group.

codevalidated

Standardize Product Type Names

The PRODUCT_TYPE_NAME column categorizes the type of drug or product, such as BULK INGREDIENT, DRUG FOR FURTHER PROCESSING. Normalize the values in the PRODUCT_TYPE_NAME column by mapping the values DRUG FOR FURTHER PROCESSING to DRUG and BULK INGREDIENT to INGREDIENT.

codevalidated

Extract Relevant Information from Product ID

Retrieve the product code from the PRODUCT_ID column of the df and store it in a new column named PRODUCT_CODE.

Note : New column added at the end of the df

codevalidated

Standardize Active Ingredient Units

Normalize the units in the ACTIVE_INGRED_UNIT column of the df. Apply a custom function called standardize_unit to achieve this. This function will follow specific rules:

Convert units like iu/ml and [iu]/ml to IU/mL. Change mg/1 and mg/ml to mg/mL. Modify mg/.5ml to mg/0.5mL. For units containing ml like mg/1.4ml, replace with mg/1.4mL. Standardize g/g and kg/kg to g/g. Rewrite u/5ml and ml/ml as U/5mL and mL/mL, respectively. Finally, convert all other units to uppercase.

codevalidated

Create a Concatenated Product Description

Create a new column PRODUCT_DESCRIPTION in the df by concatenating values from multiple columns: NON_PROPRIETARY_NAME, DOSAGEFORM_NAME, ACTIVE_NUMERATOR_STRENGTH, and ACTIVE_INGRED_UNIT. The ACTIVE_NUMERATOR_STRENGTH column is converted to a string before concatenation. The resulting concatenated string is assigned to the PRODUCT_DESCRIPTION column, and any leading or trailing whitespaces are removed.

Note : Use a single space (" ") for concatenating the values from different columns. New column added at the end of the df.

codevalidated

Identify Invalid Dates

Find rows with invalid or improperly formatted dates in the END_MARKETING_DATE column. Convert the column to datetime format using the format %d_%m_%Y. When converting, use the errors="coerce" parameter to handle invalid date formats properly by converting them to null values. Filter rows where the conversion results in null values, indicating invalid dates. Store these rows in the invalid_dates variable.

codevalidated

Identify Outliers or Extreme Values

Perform two operations on the ACTIVE_NUMERATOR_STRENGTH column of the df. Firstly, convert the column to a numeric data type. Then, identify outliers or extreme values in the column based on the interquartile range (IQR) method. In the IQR method, values below the lower bound (calculated as the first quartile - 1.5 * IQR) and above the upper bound (calculated as the third quartile + 1.5 * IQR) are considered outliers. Store the rows containing these outlier values in the outliers variable.

codevalidated

Format `START_MARKETING_DATE` Column

Perform two operations on the START_MARKETING_DATE column in the df, firstly convert the column to datetime format using the format %Y%m%d. Then format the datetime values in the column to the %d-%m-%Y string format.

codevalidated

Drop `END_MARKETING_DATE` Column

We observed that the values in the END_MARKETING_DATE column are invalid. Let's drop that column.

codevalidated

Drop Outliers Based on `ACTIVE_NUMERATOR_STRENGTH` Column

In the previous activity, we identified outliers in the ACTIVE_NUMERATOR_STRENGTH column using the interquartile range (IQR) method. The lower_bound was calculated as the first quartile (25th percentile) minus 1.5 times the IQR, and the upper_bound was calculated as the third quartile (75th percentile) plus 1.5 times the IQR. Values below the lower_bound or above the upper_bound were considered outliers. Drop all outliers from the ACTIVE_NUMERATOR_STRENGTH column based on these calculated bounds.

Cleaning and Standardizing Pharmaceutical Product DataCleaning and Standardizing Pharmaceutical Product Data
Project Created by

Dhrubaraj Roy

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

This project is part of

Data Cleaning with Pandas

Explore other projects