All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Determine the number of missing (null) values in each column of the df
. Assign the result to the variable missing_values
.
How many missing values are present in END_MARKETING_DATE
?
Identify the columns in the df
that have numerical data types (float64 and int64). Store the column names of these numerical columns in the variable numerical_cols
.
Using numerical_cols
, replace missing (null) values in the numerical columns of the df
with the mean value of the respective column.
Find the columns in the df
that have categorical (object) data types. Store the column names of these categorical columns in the variable categorical_cols
.
Using categorical_cols
, replace missing (null) values in the categorical columns of the df
with the mode (most frequent) value of the respective column.
How many duplicate rows (records) are present in the df
?
PRODUCT_NDC
is a column representing the National Drug Code for products, while LABELER_NAME
is a column indicating the name of the product labeler or manufacturer. Find duplicate rows (records) in the df
based on the values in the PRODUCT_NDC
and LABELER_NAME
columns. Store the duplicate rows in the variable duplicate_specific
.
Eliminate duplicate rows (records) from the df
based on the values in the PRODUCT_NDC
and LABELER_NAME
columns, keeping the last occurrence of each duplicate group.
The PRODUCT_TYPE_NAME
column categorizes the type of drug or product, such as BULK INGREDIENT
, DRUG FOR FURTHER PROCESSING
. Normalize the values in the PRODUCT_TYPE_NAME
column by mapping the values DRUG FOR FURTHER PROCESSING
to DRUG
and BULK INGREDIENT
to INGREDIENT
.
Retrieve the product code from the PRODUCT_ID
column of the df
and store it in a new column named PRODUCT_CODE
.
Note : New column added at the end of the
df
Normalize the units in the ACTIVE_INGRED_UNIT
column of the df. Apply a custom function called standardize_unit
to achieve this. This function will follow specific rules:
Convert units like iu/ml
and [iu]/ml
to IU/mL
. Change mg/1
and mg/ml
to mg/mL
. Modify mg/.5ml
to mg/0.5mL
. For units containing ml
like mg/1.4ml
, replace with mg/1.4mL
. Standardize g/g
and kg/kg
to g/g
. Rewrite u/5ml
and ml/ml
as U/5mL
and mL/mL
, respectively. Finally, convert all other units to uppercase.
Create a new column PRODUCT_DESCRIPTION
in the df
by concatenating values from multiple columns: NON_PROPRIETARY_NAME
, DOSAGEFORM_NAME
, ACTIVE_NUMERATOR_STRENGTH
, and ACTIVE_INGRED_UNIT
. The ACTIVE_NUMERATOR_STRENGTH
column is converted to a string before concatenation. The resulting concatenated string is assigned to the PRODUCT_DESCRIPTION
column, and any leading or trailing whitespaces are removed.
Note : Use a single space
(" ")
for concatenating the values from different columns. New column added at the end of thedf
.
Find rows with invalid or improperly formatted dates in the END_MARKETING_DATE
column. Convert the column to datetime
format using the format %d_%m_%Y
. When converting, use the errors="coerce"
parameter to handle invalid date formats properly by converting them to null values. Filter rows where the conversion results in null values, indicating invalid dates. Store these rows in the invalid_dates
variable.
Perform two operations on the ACTIVE_NUMERATOR_STRENGTH
column of the df
. Firstly, convert the column to a numeric data type. Then, identify outliers or extreme values in the column based on the interquartile range (IQR) method. In the IQR method, values below the lower bound (calculated as the first quartile - 1.5 * IQR) and above the upper bound (calculated as the third quartile + 1.5 * IQR) are considered outliers. Store the rows containing these outlier values in the outliers
variable.
Perform two operations on the START_MARKETING_DATE
column in the df
, firstly convert the column to datetime
format using the format %Y%m%d
. Then format the datetime
values in the column to the %d-%m-%Y
string format.
We observed that the values in the END_MARKETING_DATE
column are invalid. Let's drop that column.
In the previous activity, we identified outliers in the ACTIVE_NUMERATOR_STRENGTH
column using the interquartile range (IQR) method. The lower_bound
was calculated as the first quartile (25th percentile) minus 1.5 times the IQR, and the upper_bound
was calculated as the third quartile (75th percentile) plus 1.5 times the IQR. Values below the lower_bound
or above the upper_bound
were considered outliers. Drop all outliers from the ACTIVE_NUMERATOR_STRENGTH
column based on these calculated bounds.