All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Create a new column Price-to-Rating Ratio
in the DataFrame that calculates the price-to-rating ratio for each book. This ratio will help us understand how the price of a book relates to its average rating.
The "isbn" column is not needed for our analysis. Write a script to remove this column from the dataframe.
Write a script to extract the publication year from the publishDate
column and create a new column named YearPublished
in the dataframe.
After extracting the year, convert it to a datetime format with only the year (e.g., 2000, 2001, etc.).
Create a new dataframe that only includes books with ratings equal to or above 4.5. Name this new dataframe best_books
.
Each book is associated with multiple genres in the form of list of strings. Create a new column GenreCount
that stores the number of genres associated with each book.
Some analyses might require having the author's first and last names in separate columns. Write a script to create two new columns, FirstName
and LastName
, from the author
column. For simplicity, assume the last word in the author
field is the last name and everything before it is the first name.
Some entries in the dataset might represent short stories or other short works. For this activity, remove all rows from the dataframe where the number of pages is less than 100.
Each book can belong to multiple genres. Create a new column PrimaryGenre
that contains only the first genre listed for each book. Genre column contains a list of genres.
Note: The
genres
column contains a string representation of a list of genres. You can use theeval
function to convert the string representation back into a list.Also if the
genres
column contains an empty list, set the value ofPrimaryGenre
toNone
.
Create a new column MultipleAwards
that flags books that have won multiple awards. If a book has won more than one award, set the value of MultipleAwards
to True
; otherwise, set it to False
.
Assuming an average reading speed of 250 words per minute and approximately 300 words per page, create a new column ReadingTimeHours
that estimates the reading time in hours for each book.
Create a new column Published21stCentury
that flags (True/False
) whether a book was published in the 21st century (year 2000 and onwards).
Drop the coverImg
, description
, and ratingsByStars
columns from the dataframe as they will not be used in further analysis. Drop these columns permanently by setting the inplace
parameter to True
.
Add a new book entry to the dataframe with the following details:
new_book = {
"bookID": 10000,
"title": "The Great Gatsby",
"author": "F. Scott Fitzgerald",
"rating": 3.9,
"pages": 180,
"publishDate": '1925-04-10',
"publisher": "Scribner",
"price": 7.99,
"genres": "['Fiction', 'Classics']",
"GenreCount": 2,
"FirstName": "F.",
"LastName": "Fitzgerald",
"PrimaryGenre": "Fiction",
"MultipleAwards": False,
"ReadingTimeHours": 9.0,
"Published21stCentury": True
}
Add this new entry to the index
len(df)
.
The publishDate
and firstPublishDate
columns contain dates in object(string) format. Convert these columns into datetime objects to enable more sophisticated date-based operations and analyses.
Use the format
"%Y-%m-%d"
to convert the string dates into datetime objects.
Add multiple new book entries to the DataFrame at once. This activity involves creating a list of dictionaries, where each dictionary represents a new book entry with values for all the relevant columns, and then appending this list to the existing DataFrame.
Below are the details of the new book entries:
new_books = [
{
"bookID": 10001,
"title": "To Kill a Mockingbird",
"author": "Harper Lee",
"rating": 4.3,
"pages": 281,
"publishDate": pd.to_datetime('1960-07-11'),
"firstPublishDate": pd.to_datetime('1960-07-11'),
"publisher": "J.B. Lippincott & Co.",
"price": 9.99,
"genres": "['Fiction', 'Classics']",
"GenreCount": 2,
"FirstName": "Harper",
"LastName": "Lee",
"PrimaryGenre": "Fiction",
"MultipleAwards": False,
"ReadingTimeHours": 11.24,
"Published21stCentury": False
},
{
"bookID": 10002,
"title": "1984",
"author": "George Orwell",
"rating": 4.2,
"pages": 328,
"publishDate": pd.to_datetime('1949-06-08'),
"firstPublishDate": pd.to_datetime('1949-06-08'),
"publisher": "Secker & Warburg",
"price": 12.99,
"genres": "['Fiction', 'Classics']",
"GenreCount": 2,
"FirstName": "George",
"LastName": "Orwell",
"PrimaryGenre": "Fiction",
"MultipleAwards": False,
"ReadingTimeHours": 13.12,
"Published21stCentury": False
}
]
Add these new entries to the dataframe at position
len(df)
andlen(df) + 1
respectively.