All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Replace all missing author entries with "Anonymous"
. The author
column contains the names of the book authors.
Create a custom one using the format "CUSTOM-"
followed by the book's ID padded to 5 digits. The isbn
column contains the International Standard Book Numbers for the books.
Replace all missing genre entries with "Unclassified"
. Then, display a count of books in each genre category after making this change. The genre
column categorizes books by their literary genre or subject matter.
Create a function that assigns ratings to books with missing ratings based on the average rating of their genre. If a book's genre doesn't have an average rating, use the overall average rating. Apply this function to fill in all missing ratings. Then, display summary statistics of ratings for each genre to verify the changes. The rating
column contains numerical ratings for the books.
Convert the publication_date
column to a datetime format, ignoring any errors. Find the earliest publication date in the dataset. Replace any missing publication dates with this earliest date. The publication_date
column contains the dates when the books were published.
Create a new dataframe df_cleaned
by removing duplicates from df
, keeping only the first occurrence of each duplicate group. Do not modify the original df
.
Clean the dataset by removing duplicate books. Keep only the first occurrence of each unique combination of title
, author
, and isbn
, prioritizing entries with the lowest book_id
. Sort the data by book_id
before deduplication to ensure consistency. Store the result in df_cleaned
. The resulting df_cleaned
will contain one row per unique book, retaining the earliest added entry for each.
Group the data by title
, author
, and isbn
. For each group, calculate the median price
, keep the first book_id
and publication_date
, retain the first genre
, compute the mean rating
, and sum the sales
. Include these columns along with the grouping columns (title
, author
, isbn
) in the result. Reset the index to turn the grouping columns back into regular columns. Store the result in df_cleaned
. The final df_cleaned
will contain one row per unique book with aggregated information.
Group the data by title
, author
, and isbn
. For each group, combine unique genres into a comma-separated string using the set()
function to remove duplicates and ', '.join()
method to concatenate. Keep the first book_id
and publication_date
, calculate the mean price
and rating
, and sum the sales. Include these columns along with the grouping columns (title
, author
, isbn
) in the result. Reset the index to turn the grouping columns back into regular columns. Store the result in df_cleaned
. The final df_cleaned
will contain one row per unique book with aggregated information and combined genres.
Create a list called marketing_phrases
containing catchy phrases: ["Must-Read!", "Unputdownable!", "Life-Changing!", "Epic Tale!", "Mind-Blowing!"]
. Generate a new slogan
column by combining the existing title
with a randomly chosen phrase from marketing_phrases
, separated by " - "
. The resulting slogan will be in the format: "Existing Title - Random Marketing Phrase"
.
Note: Use
random_seed = 42
to ensure consistent results.
Create a hidden message for each book using characters from its ISBN. Extract the characters at positions 5, 8, and 11 from the ISBN and combine them. If a book doesn't have an ISBN or the extraction isn't possible, use "No message"
instead. Store this hidden message in a new column called hidden_message
.
The isbn
column contains the ISBN numbers for the books.
Extract the year from the publication_date
column and create a new column called year
. Then, generate a decade
column by rounding down the year to the nearest 10 and appending 's'
to the end (e.g., 1990s). For any missing values in the decade column, replace them with "Unknown"
. The decade
column will contain values like "1990s"
, "2000s"
, etc., or "Unknown"
for missing data.
The publication_date
column contains the original publication dates of the books.
Create unique pseudonyms for each author. Extract the author's first name, take the first three letters of the book's genre (in uppercase), and generate a random two-digit number. Combine these elements to form the pseudonym. If any information is missing, use "Anonymous"
as the pseudonym. Store the result in a new column called pseudonym
. Remove any temporary columns created during this process.
Note: Use random seed 42 for producing the same result.
You're a data analyst working for a publishing house. Your task is to analyze the diversity of book titles in your catalog. You've written a Python script to count unique letters in each title, but now you need to interpret the results.
Given the following code snippet:
df_copy = df.copy()
df_copy['unique_letter_count'] = df_copy['title'].str.lower().str.replace(' ', '').apply(lambda x: len(set(x)))
most_unique = df_copy.loc[df_copy['unique_letter_count'].idxmax()]
Question: What does this code accomplish, and what insight does it provide about the book titles?
Calculate the average price for each book genre, excluding any books with zero or negative prices. For any book with a negative price, replace it with the average price of its genre. If the genre's average price is not available, use the overall average price of all books. Update the price
column with these corrected values.
In your current df
, the minimum price should be zero. As you can see here.
df['price'].describe()
Standardize the in_stock
information for each book. Convert various representations of "yes"
(like 'Y'
, 'y'
, 'Yes'
, 'yes'
) to True, and various representations of "no"
(like 'N'
, 'n'
, 'No'
, 'no'
) to False. If the stock information is missing, assume the book is not in stock. Store the result as a boolean value in the in_stock
column.
The original in_stock
column contains various text representations of stock status.
Extract the year from the publication_date
column and create a new column called publication_year
. Assign the extracted year values to this new column.
You're a data analyst at a publishing company. You've been given the following Python code that processes the 'genre' column in your book database:
df_copy = df.copy()
# Get unique genres
unique_genres = df_copy['genre'].dropna().unique()
# Create a categorical data type
df_copy['genre_cat'] = pd.Categorical(df_copy['genre'], categories=unique_genres)
# Display information about the new categorical column
print(df_copy['genre_cat'].describe())
After running this code, you get the following output:
Question: Based on the code and output, which of the following statements is correct?
Analyze the isbn
column and determine the most appropriate data type for it. The isbn
column contains the International Standard Book Number for each book. Consider the format of ISBN and the operations you might want to perform on this column.
You've run the following code on your book database:
# Calculate the percentage of books with unknown authors
unknown_authors_percentage = ((df_copy['author'] == 'Unknown') | (df_copy['author'] == 'Anonymous')).mean() * 100
# Print the result
print(f"Percentage of books with unknown authors: {unknown_authors_percentage:.2f}%")
The output shows: Percentage of books with unknown authors: 20.55%
Question: What does this result suggest about the author data in your database?
You're a data quality specialist at a library. You've run the following code on your book database to clean up the ISBN column:
df['isbn'] = df['isbn'].fillna('Unknown')
df.loc[df['isbn'] == 'N/A', 'isbn'] = 'Unknown'
df.loc[df['isbn'] != 'Unknown', 'isbn'] = df.loc[df['isbn'] != 'Unknown', 'isbn'].apply(lambda x: re.sub(r'\W+', '', x))
# Display unique values to check for remaining inconsistencies
print(df['isbn'].unique())
After running this code, you see the following output of unique ISBN values:
['CUSTOM00001' 'ISBN0000001' 'ISBN0000002' ... 'ISBN0000684' 'CUSTOM02197' 'CUSTOM02198']
Question: What does this output suggest about the ISBN data after cleaning?