All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Some quick trivia before we dive head first into it!
Notice how in the first exercise the result was in full date format? Let's create a column that extracts the years from the PublishDate
column. Having a year_published
column improves the clarity and readability of the dataset, making it easier for users to understand and interpret the data.
Ensure the column in question is in the correct data type before extracting the year.
Who are the rockstars of the book world? Can you find the authors with the highest average ratings? Were your favorite authors on the list?
In this activity, you will calculate the average rating for each author by grouping the DataFrame by the author
column. Then compute the average rating. The result is reset and saved to a new variable: author_avg_ratings
using reset_index()
. Finally, the DataFrame is sorted in descending order by the rating
column.
First, remove the texts in parenthesis in the author
column. Some entries look like this : Markus Zusak (Goodreads Author)
, make it so that it looks just like this: Markus Zusak
. Then group the DataFrame by the author
column and count the number of books associated with each author.
Reset and save your result to a variable named author_book_count
.
Calculate the mean price of books written by each author. Make sure the price
column is the correct data type before grouping.
Reset index and save your result to a variable named author_avg_price
.
The BBE score is an indicator of the overall reader feedback and engagement with a particular book How much is the combined BBE score for each author, obtained by summing the BBE scores of all books written by that author?
Reset index and save your result to a variable named author_total_bbe_score
. Finally, sort the result in descending order.
Calculate the average number of pages written per author. Then organize this information into a table using the reset_index()
method and sort it based on the average pages, with the highest at the top. Finally, select the top 10 authors from this sorted list to identify those who wrote the most pages on average. Save your final result in the variable : author_avg_pages
.
This will help us understand the distribution of books across different languages, providing insight into language-specific publishing trends and audience diversity.
Save your result in the variable: books_per_language
and reset the index.
Here, we will analyze the yearly distribution of book publications, which will reveal trends in the publishing industry and highlight significant years of activity or growth. But first filter your dataset to only include years before 2022. The column has anomalous entries like dates in the future (e.g., 2027) that might be due to typos.
Save your result in the variable: books_per_year
and organize this information into a table using the .reset_index()
method.
In this activity, your task is to identify the year with the least number of books published.
Filter the DataFrame to include only books written in English, then sort the filtered DataFrame by the pages
column in descending order and select the Top 10 entries.
Sort the dataset by the PublishDate
column. Then, select the top 10 records and create a new dataframe containing only the title
and PublishDate
columns. Save this result to a variable named oldest_books
.
Which genre reign supreme in the reading world? Can you identify the most popular genres based on the average ratings?
Begin by extracting a new column: first_genre
from the genres
column . Subsequently, group the data accordingly to unveil the top five most popular genres based on ratings. Reset the index and save your result in a variable named : top_5_genres_by_rating
Is genre a key factor in book pricing? Calculate the average price for each genre to see if fantasy novels leave your wallet feeling fictional, or if self-help books offer more bang for your buck!
Sort your final result in descending order and save it in the variable: average_price_by_genre_sorted
. Organize this information into a table using the .reset_index()
method.
Using the first_genre
column, identify the most popular genre.
Which authors are most commonly associated with each genre? Begin by grouping the dataset by first_genre
and author
, then count the occurrences.
Afterwards, identify the top author for each genre based on frequency, sorting in descending order.
Save your final result to a variable named sorted_genre_author_count
and organize this information into a table using the .reset_index()
method.
Group the DataFrame by first_genre
and then filter for books with ratings above 4.5
Here, we'll calculate the average rating for each language to see if certain tongues tend to inspire higher praise (or criticism) from readers.
Reset the index of your final result and save in a variable named average_rating_by_language
Group the dataframe by language
and calculate the average book length (pages
).
This explores potential differences in book length across languages.
Save your result in the variable: avg_book_length_by_language
and reset the index
Group the dataframe by language
and split the data into highly_rated_genres
(above 3.5) and lowly_rated_genres
(3.5 or below) groups for each language in our data.
Analyze the distribution of genres within each group for each language. The goal is to explore potential connections between language, book rating, and genre preference.
This question aims to assess the popularity of books by calculating the ratio of the percentage of users who liked the book (LikedPercent
) to the total number of ratings (NumRatings
).
By ranking these books based on this calculated ratio to identify those that are highly regarded compared to reader engagement levels.
First calculate the ratio and store the calculated ratio in a new column named Liked_to_NumRatings_ratio
, then rank the books based on this ratio in descending order, creating a Rank column. Finally, sort the dataframe by this rank
and reset the index. Save your final result in the varaible: df_ratio
.
Remember that likedPercent
is given as a percentage. This means that to obtain a ratio, likedPercent
would have to be divided by 100 first.
Using the df_ratio
calculated in the previous activity, Identify the year that had the highest rating