Capstone Project: Cleaning and Preparing Book Sales Data
Capstone Project: Cleaning and Preparing Book Sales Data Data Science Project
Data Cleaning with Pandas

Capstone Project: Cleaning and Preparing Book Sales Data

Dive into the world of book sales data! Hone your Pandas skills by solving real-world problems. These include filling missing authors, standardizing genres, and decoding hidden ISBN messages. Use methods like .fillna(), .map(), and .apply() to clean messy data. Create catchy book slogans and author pseudonyms. Learn to handle duplicates and inconsistent formats. Turn raw book data into insights. Do it like a pro data analyst in publishing!
Start this project
Capstone Project: Cleaning and Preparing Book Sales DataCapstone Project: Cleaning and Preparing Book Sales Data
Project Created by

Dhrubaraj Roy

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Replace Missing Authors in `author` Column

Replace all missing author entries with "Anonymous". The author column contains the names of the book authors.

codevalidated

Generate Custom `ISBNs` for Missing Entries in `isbn` Column

Create a custom one using the format "CUSTOM-" followed by the book's ID padded to 5 digits. The isbn column contains the International Standard Book Numbers for the books.

codevalidated

Categorize Missing Genres as `'Unclassified'` in `genre` Column

Replace all missing genre entries with "Unclassified". Then, display a count of books in each genre category after making this change. The genre column categorizes books by their literary genre or subject matter.

codevalidated

Impute Missing `rating` Values Using Genre Averages

Create a function that assigns ratings to books with missing ratings based on the average rating of their genre. If a book's genre doesn't have an average rating, use the overall average rating. Apply this function to fill in all missing ratings. Then, display summary statistics of ratings for each genre to verify the changes. The rating column contains numerical ratings for the books.

codevalidated

Standardize `publication_date` and Fill Missing Values

Convert the publication_date column to a datetime format, ignoring any errors. Find the earliest publication date in the dataset. Replace any missing publication dates with this earliest date. The publication_date column contains the dates when the books were published.

codevalidated

Remove Duplicate Books Using `title` and `author` Columns

Create a new dataframe df_cleaned by removing duplicates from df, keeping only the first occurrence of each duplicate group. Do not modify the original df.

codevalidated

Deduplicate Entries Using Lowest `book_id` for Identical Books

Clean the dataset by removing duplicate books. Keep only the first occurrence of each unique combination of title, author, and isbn, prioritizing entries with the lowest book_id. Sort the data by book_id before deduplication to ensure consistency. Store the result in df_cleaned. The resulting df_cleaned will contain one row per unique book, retaining the earliest added entry for each.

codevalidated

Consolidate Duplicate Books Using Median `price` and Aggregate Data

Group the data by title, author, and isbn. For each group, calculate the median price, keep the first book_id and publication_date, retain the first genre, compute the mean rating, and sum the sales. Include these columns along with the grouping columns (title, author, isbn) in the result. Reset the index to turn the grouping columns back into regular columns. Store the result in df_cleaned. The final df_cleaned will contain one row per unique book with aggregated information.

codevalidated

Merge Multi-Genre Books and Aggregate Related Data

Group the data by title, author, and isbn. For each group, combine unique genres into a comma-separated string using the set() function to remove duplicates and ', '.join() method to concatenate. Keep the first book_id and publication_date, calculate the mean price and rating, and sum the sales. Include these columns along with the grouping columns (title, author, isbn) in the result. Reset the index to turn the grouping columns back into regular columns. Store the result in df_cleaned. The final df_cleaned will contain one row per unique book with aggregated information and combined genres.

codevalidated

Create catchy slogans for `title` column using random phrases

Create a list called marketing_phrases containing catchy phrases: ["Must-Read!", "Unputdownable!", "Life-Changing!", "Epic Tale!", "Mind-Blowing!"]. Generate a new slogan column by combining the existing title with a randomly chosen phrase from marketing_phrases, separated by " - ". The resulting slogan will be in the format: "Existing Title - Random Marketing Phrase".

Note: Use random_seed = 42 to ensure consistent results.

codevalidated

Extract hidden message from `isbn` column

Create a hidden message for each book using characters from its ISBN. Extract the characters at positions 5, 8, and 11 from the ISBN and combine them. If a book doesn't have an ISBN or the extraction isn't possible, use "No message" instead. Store this hidden message in a new column called hidden_message.

The isbn column contains the ISBN numbers for the books.

codevalidated

Determine publication decade from `publication_date` column

Extract the year from the publication_date column and create a new column called year. Then, generate a decade column by rounding down the year to the nearest 10 and appending 's' to the end (e.g., 1990s). For any missing values in the decade column, replace them with "Unknown". The decade column will contain values like "1990s", "2000s", etc., or "Unknown" for missing data.

The publication_date column contains the original publication dates of the books.

codevalidated

Generate author pseudonyms using `author` and `genre` columns

Create unique pseudonyms for each author. Extract the author's first name, take the first three letters of the book's genre (in uppercase), and generate a random two-digit number. Combine these elements to form the pseudonym. If any information is missing, use "Anonymous" as the pseudonym. Store the result in a new column called pseudonym. Remove any temporary columns created during this process.

Note: Use random seed 42 for producing the same result.

multiplechoice

The Literary Diversity Challenge

You're a data analyst working for a publishing house. Your task is to analyze the diversity of book titles in your catalog. You've written a Python script to count unique letters in each title, but now you need to interpret the results.

Given the following code snippet:

df_copy = df.copy()

df_copy['unique_letter_count'] = df_copy['title'].str.lower().str.replace(' ', '').apply(lambda x: len(set(x)))
most_unique = df_copy.loc[df_copy['unique_letter_count'].idxmax()]

Question: What does this code accomplish, and what insight does it provide about the book titles?

codevalidated

Adjust negative prices in `price` column using genre averages

Calculate the average price for each book genre, excluding any books with zero or negative prices. For any book with a negative price, replace it with the average price of its genre. If the genre's average price is not available, use the overall average price of all books. Update the price column with these corrected values.

In your current df, the minimum price should be zero. As you can see here.

df['price'].describe()
codevalidated

Standardize `in_stock` column to boolean values

Standardize the in_stock information for each book. Convert various representations of "yes"(like 'Y', 'y', 'Yes', 'yes') to True, and various representations of "no" (like 'N', 'n', 'No', 'no') to False. If the stock information is missing, assume the book is not in stock. Store the result as a boolean value in the in_stock column.

The original in_stock column contains various text representations of stock status.

codevalidated

Extract Year from `publication_date` to New Column

Extract the year from the publication_date column and create a new column called publication_year. Assign the extracted year values to this new column.

multiplechoice

Categorizing Book Genres

You're a data analyst at a publishing company. You've been given the following Python code that processes the 'genre' column in your book database:

df_copy = df.copy()

# Get unique genres
unique_genres = df_copy['genre'].dropna().unique()

# Create a categorical data type
df_copy['genre_cat'] = pd.Categorical(df_copy['genre'], categories=unique_genres)

# Display information about the new categorical column
print(df_copy['genre_cat'].describe())

After running this code, you get the following output:

act20

Question: Based on the code and output, which of the following statements is correct?

multiplechoice

Which Data Type Best Suits the `isbn` Column?

Analyze the isbn column and determine the most appropriate data type for it. The isbn column contains the International Standard Book Number for each book. Consider the format of ISBN and the operations you might want to perform on this column.

multiplechoice

Mystery of the Missing Authors

You've run the following code on your book database:

# Calculate the percentage of books with unknown authors
unknown_authors_percentage = ((df_copy['author'] == 'Unknown') | (df_copy['author'] == 'Anonymous')).mean() * 100

# Print the result
print(f"Percentage of books with unknown authors: {unknown_authors_percentage:.2f}%")

The output shows: Percentage of books with unknown authors: 20.55%

Question: What does this result suggest about the author data in your database?

multiplechoice

The ISBN Cleanup Challenge

You're a data quality specialist at a library. You've run the following code on your book database to clean up the ISBN column:

df['isbn'] = df['isbn'].fillna('Unknown')
df.loc[df['isbn'] == 'N/A', 'isbn'] = 'Unknown'
df.loc[df['isbn'] != 'Unknown', 'isbn'] = df.loc[df['isbn'] != 'Unknown', 'isbn'].apply(lambda x: re.sub(r'\W+', '', x))
# Display unique values to check for remaining inconsistencies
print(df['isbn'].unique())

After running this code, you see the following output of unique ISBN values:

['CUSTOM00001' 'ISBN0000001' 'ISBN0000002' ... 'ISBN0000684' 'CUSTOM02197' 'CUSTOM02198']

Question: What does this output suggest about the ISBN data after cleaning?

Capstone Project: Cleaning and Preparing Book Sales DataCapstone Project: Cleaning and Preparing Book Sales Data
Project Created by

Dhrubaraj Roy

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

This project is part of

Data Cleaning with Pandas

Explore other projects