All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
The year
column represents the release year of each game, and the Global_Sales
column represents the total worldwide sales of the game in millions of units.
Create a new column called Revenue_Per_Year
to store the calculated values. For games released before 2024, divide the Global_Sales
by the number of years between the release year and 2024. For games released in or after 2024, simply use the Global_Sales
value as the revenue per year.
Note : New column added at the end of the
df
Create a new column called Title_Length
to store the length of each game's title. Calculate the number of characters in each game's name from the Name
column and assign this count to the new Title_Length
column.
Create a new column called Genre_Category
to store a simplified categorization of game genres. First, define a function categorize_genre()
that takes a genre as input and returns a broader category. This function should map genres to four main categories: 'Action/Adventure'
, 'Sports/Racing'
, 'RPG/Strategy'
, or 'Other'
. Then, apply this function to the Genre
column to populate the new Genre_Category
column.
The resulting Genre_Category
column will contain simplified genre categories for each game, providing a more streamlined classification of game types.
Create a new column Total_Sales
that represents the overall sales performance of each game globally. This column will sum up the individual sales values from the NA_Sales
, EU_Sales
, JP_Sales
, and Other_Sales
columns for each game entry.
The Total_Sales
column provides a comprehensive view of each game's commercial success across all markets, allowing for easier comparison and analysis of overall game performance.
Calculate regional sales percentages for each game. Use the Total_Sales
column to determine the percentage of sales in North America, Europe, Japan, and Other regions. If Total_Sales
is zero, set all percentages to 0. The result will be stored in new columns: NA_Percentage
, EU_Percentage
, JP_Percentage
, and Other_Percentage
.
Note : New column added at the end of the
df
Create a new column called Sales_Category
to categorize each game's sales performance. First, define a function categorize_sales()
that takes a sales value as input and returns a category: 'Low' for sales less than 1 million, 'Medium' for sales between 1 and 5 million, and 'High' for sales of 5 million or more. Then, apply this function to the Total_Sales
column to populate the new Sales_Category
column.
The resulting Sales_Category
column will contain the sales category ('Low', 'Medium', or 'High') for each game based on its total sales value.
Create a new column called Summary
to store a concise description of each game. For every row in the DataFrame, generate a string that combines the game's name from the Name
column, its platform from the Platform
column, and its total sales figure from the Total_Sales
column. Format the total sales value to display two decimal places and append "million" to indicate the unit.
The resulting Summary
column will contain a formatted string for each game, providing a quick overview of the game's key information in a single field.
Create a new column called Years_Since_Release
to calculate how long ago each game was released. Use the Year
column to determine the time elapsed since each game's release date. For games released in or before the current year (2024), calculate the difference between the current year and the release year. For any games with future release dates, assign a value of 0.
The resulting Years_Since_Release
column will show the age of each game in years, with unreleased games marked as 0 years old.
Note : New column added at the end of the
df
Create a new column called Formatted_Global_Sales
to present the global sales figures in a more readable format. Transform the values from the Global_Sales
column by rounding them to two decimal places, replacing the decimal point with a comma, and adding an "M" suffix to represent millions.
The resulting Formatted_Global_Sales
column will display the sales figures in a standardized format, making it easier to read and compare values across different games.
Create a new column called Release_Decade
to group games by their release decade. Use the Year
column to determine which decade each game belongs to. Round down each year to the nearest decade and format the result as a string ending with "s".
The resulting Release_Decade
column will display the decade of release for each game (e.g., "1980s", "1990s", "2000s"), allowing for easy categorization and analysis of games by era.
Note : New column added at the end of the
df
The NA_Sales
, EU_Sales
, JP_Sales
, and Other_Sales
columns represent the sales figures for each game in North America, Europe, Japan, and other regions, respectively.
Calculate and flag top 10% selling games for each region. Create new columns Top_NA_Sales
, Top_EU_Sales
, Top_JP_Sales
, and Top_Other_Sales
to indicate if a game's sales in that region are in the top 10%. Use the quantile()
function to determine the threshold for each region, then apply a lambda function to compare each game's sales against this threshold
Note : New columns are added at the end of the
df
Create two new columns: Decade
and Is_Recent
.
For Decade
, use the Year
column to determine which decade each game belongs to. Round down each year to the nearest decade and format the result as a string ending with "s".
For Is_Recent
, identify the most recent decade from the Decade
column. Then, mark each game as 'Yes' if it belongs to this most recent decade, or 'No' if it doesn't.
These columns will help categorize games by their release era and highlight those from the most recent decade.
Compare each game's sales performance to the average sales of its genre. First, calculate the mean global sales for each genre. Then, create a new column Sales_vs_Genre_Avg
that shows how each game's global sales compare to its genre's average. For genres with zero average sales, set the ratio to zero to avoid division errors.
The Sales_vs_Genre_Avg
column will contain values above 1 for games that outperform their genre average, and below 1 for those that underperform. This metric helps identify standout games within each genre.
Create two new columns: Platform_Count
and Cross_Platform_Hit
.
For Platform_Count
, use the Name
and Platform
columns to count how many different platforms each game was released on.
For Cross_Platform_Hit
, identify games that were released on multiple platforms (more than one) and had global sales exceeding 5 million units. Mark these games as 'Yes', and others as 'No'.
These columns will help analyze the prevalence and success of multi-platform game releases.
Create a new column called Publisher_Success_Rate
. First, define a function called success_rate
that calculates the percentage of a publisher's games with global sales exceeding 1 million units. Then, apply this success_rate
function to each publisher's game catalog using the Publisher
and Global_Sales
columns.
The Publisher_Success_Rate
column will show the percentage of successful games for each publisher, providing insight into which publishers consistently produce high-selling games.
Create a new column Dominant_Region
that identifies the region with the highest sales. Use a dictionary to compare sales across North America, Europe, Japan, and Other regions. If North America is the dominant region, represent it as NaN; otherwise, return the region code.
NA_Sales
, EU_Sales
, JP_Sales
, and Other_Sales
columns represent sales figures for each region.
Calculate the performance ranking of each genre within its release decade based on total global sales. Group the data by decade and genre, then rank genres within each decade according to their total sales. Create a new column Genre_Rank_in_Decade
to store these rankings.
Next, analyze how each genre's popularity changes over time. Compare a genre's rank in its current decade to its average rank in previous decades. Create a Genre_Trend
column, marking a genre as 'Rising' if its current rank is better than its historical average, and 'Falling' otherwise.
This analysis helps track how different game genres perform and evolve in popularity across different decades.
Create a new column called NA_EU_Ratio
to compare North American and European sales for each game. Use the NA_Sales
and EU_Sales
columns to calculate this ratio. For each game, divide the North American sales by the European sales, but only if European sales are greater than zero. If European sales are zero, set the ratio to zero to avoid division errors.
The NA_EU_Ratio
column will show how North American sales compare to European sales for each game, with higher values indicating stronger performance in North America relative to Europe.
Format all numeric columns in the dataset to display two decimal places. First, select columns with float or integer data types. Then, apply string formatting to these numeric values, rounding them to two decimal places. Finally, create a new DataFrame df_numeric
containing only these formatted numeric columns.
df_numeric
will contain all numeric columns from the original dataset, with values formatted as strings with two decimal places.