All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Start by merging the datasets and count the number of unique ITEM TYPE
there are in the DataFrame.
Merge the datasets containing sales information for these two years. Once combined, filter the merged DataFrame to isolate retail sales data for each year individually. Sum up these filtered values, calculate the total retail sales for each year and sum them up. Your final answer should be a float, like 7.65
.
Merge the data from all years. Then, calculate the total RETAIL SALES
for each unique item across all years. Finally, use the .sort_values()
function in descending order to identify the top-selling items for the entire period (2017-2020)in hierarchical. Store your final output in the variable top_sellers_all_years
.
From the output of the previous activity, determine the ITEM TYPE
with the highest value.
Calculate the percentage change in sales for each item year-over-year (YoY). Start by adding a new column to the combined DataFrame:TOTAL SALES
as the sum of RETAIL SALES
and WAREHOUSE SALES
. Group the data by YEAR
to compute annual total sales, then calculate the year-over-year percentage change in these sales. Finally, store your final result in the variable: annual_sales
.
Calculating the year-over-year percentage change in sales is important because it provides valuable insights into the company's growth and performance over time. It helps identify trends, such as consistent growth or decline, and can reveal the impact of market conditions, strategic decisions, or external factors on sales.
In this activity, the goal is to determine and rank the top-selling items based on their total sales for each individual year from 2017 to 2020. Group merged data by YEAR
and ITEM TYPE
and reset the index. Then sort the aggregated data in descending order based on total sales for each year. Your final result should be stored in the variable: yearly_item_sales_sorted
.
Concatenate selected datasets to ensure the analysis focuses solely on years 2018, 2019, and 2020. Next, group the data by SUPPLIER
and TOTAL SALES
(Make sure to create this column anew and reset the index ) and calculate the total sales contributed by each supplier across these three years. Finally sort the results to identify the top ten performing suppliers in terms of sales contribution. Store your result in the variable: supplier_sales_contribution_sorted
.
Begin by merging the datasets in question, then filter the merged dataset to include WINE
, BEER
and LIQUOR
. Next, aggregate the filtered data by YEAR
and SUPPLIER
, calculating the total sales for wine attributed to each supplier annually and reset the index. Using these aggregated results, identify the supplier with the highest total wine sales for each year within the dataset. Store your final output in the variable: top_supplier_wine
.
Combine the datasets for all years and group the combined dataset by the SUPPLIER
column and count the unique ITEM TYPE
entries for each supplier, then calculate the number of unique items each supplier provides. The resulting DataFrame(using .reset_index()
) will contain two columns: SUPPLIER
and UNIQUE_ITEM_COUNT
.
Store your final output in the variable: unique_items_per_supplier
.
Using the output from the previous, sort unique_items_per_supplier
in descending order based on the count of unique item codes to identify the top suppliers with the most diverse product portfolios. Finally, display the top 10 result in the variable top_10_suppliers_by_items
Begin by applying a dictionary that maps numeric month values to month names to a combined dataset of all years, creating a new MONTH_NAME
column. Group this DataFrame by YEAR
and MONTH_NAME
and calculate the average sales per month. Reset the index and then identify the months with the highest average sales yearly. Store your output as best_months
.
Start by merging the datasets for years before 2020. Filter the combined DataFrame to include only entries where the ITEM TYPE
is BEER
. Next, compute a TOTAL SALES
column and group the data by YEAR
and MONTH
(resetting the index), summing the TOTAL SALES
for each month.
Store your final result in beer_sales_by_month
.
Calculate and compare the total sales revenue generated through warehouse channels versus retail channels for each year from 2017 to 2020.
Aggregate the merged DataFrame by YEAR
, summing up the total WAREHOUSE SALES
and RETAIL SALES
for each year. The resulting output: sales_comparison
DataFrame is then filtered for retail and warehouse sales individually.
Merge the result of retail and warehouse sales, and store in the variable: merged_sales
.
We aim to find out the average value of each transaction for every sales channel across all years from 2017 to 2020. To do this, we need the total sales amount per transaction, aggregated by year and sales channel. Once we have this, we can determine whether transaction sizes vary depending on whether the sales were warehouse or retail.
First, use the merged DataFrame and aggregate it by both YEAR
and ITEM TYPE
. Calculate the total sales using .sum()
function and count the number of transactions using .count()
function for each item type in each year.
With the total sales and transaction count in hand, compute the average sales per transaction by dividing the total sales by the number of transactions. Append this calculation as a new column AVG_SALES_PER_TRANSACTION
to your DataFrame.
Finally, store your output to the variable: :average_sales_comparison
.
Merge selected datasets, to include the years in question, then group the merged DataFrame by YEAR
and aggregate the sum of WAREHOUSE SALES
, RETAIL SALES
, and TOTAL SALES
. Next, calculate the percentage contribution of warehouse and retail sales to the total sales for each year. This is achieved by dividing the sum of WAREHOUSE SALES
and RETAIL SALES
by TOTAL SALES
and multiplying by 100. Store your final output in the variable: sales_contribution
.
Do certain suppliers dominate one channel over the other, and what factors might influence these preferences?
First, create a new DataFrame by grouping the merged DataFrame for all years by SUPPLIER
and ITEM TYPE
and aggregating the sum of RETAIL SALES
and WAREHOUSE SALES
. Then calculate the total sales for each supplier across both channels by adding RETAIL SALES
and WAREHOUSE SALES
. Next, compute the percentage of sales contributed by each channel relative to the total sales for each supplier. This is done by dividing RETAIL SALES
and WAREHOUSE SALES
by TOTAL SALES
and multiplying by 100. Finally, identify suppliers that dominate each channel based on higher percentage sales. Suppliers with a higher percentage of retail sales compared to warehouse sales are categorized as retail-dominant, while those with higher warehouse sales percentages are classified as warehouse-dominant.
Combine both results and store in the variable: dominant_suppliers
.
Group the combined data for all years by Year
and Item Type
, aggregating the sum of Retail Sales
and Warehouse Sales
. At the same time, calculate the total sales per year across both channels; for this, reset the index for both computations. Merge the aggregated item sales data with the total sales data to compute the percentage of total sales contributed by each item type in each channel. Achieve this by dividing the Retail Sales Item and Warehouse Sales Item by the Retail Sales Total and Warehouse Sales Total, respectively, and then multiplying by 100.
The resulting DataFrame: item_sales
is your result.