From Continuous to Categorical: Data Wrangling with Dairy Sales data
From Continuous to Categorical: Data Wrangling with Dairy Sales data  Data Science Project
Data Wrangling with Pandas

From Continuous to Categorical: Data Wrangling with Dairy Sales data

Prepare for an exploration in our lab on Data Transformation techniques, focusing on Discretization, Binning, and Dummies. We'll analyze the "Dairy Goods Sales" dataset to enhance your Pandas skills, categorizing sales, creating binary indicators, and uncovering hidden patterns for data-driven success in dairy products. Join us for analytical excellence.
Start this project
From Continuous to Categorical: Data Wrangling with Dairy Sales data From Continuous to Categorical: Data Wrangling with Dairy Sales data
Project Created by

Anurag Verma

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Categorize Land Area into Small, Medium, Large, and Extra Large Categories

To explore the relationship between land size and total sales, we want to categorize the land area using the Total Land Area (acres) column into meaningful categories: Small, Medium, Large, and Extra Large. We will use the following ranges to define each category: 0, 50, 100, 200, and inf.

Create a new column named LandAreaCategory and store the result in this column. The categories should be as follows:

  • Small: land areas between 0 and 50 acres.
  • Medium: land areas between 50 and 100 acres.
  • Large: land areas between 100 and 200 acres.
  • Extra Large: land areas above 200 acres.

Your result should look similar to the series shown in the image below:

activity2-answer

codevalidated

Categorize Product Prices into Different Price Categories

In order to analyze the distribution of sales across different price levels, we need to categorize the prices of our products. We will use the Price per Unit column to do this.

Below are the price categories and their corresponding price ranges:

  • Low: prices between 0 and 10
  • Medium: prices between 10 and 20
  • High: prices between 20 and 30
  • Premium: prices above 30

Create a new column called PriceCategory to store these categories. Your result should look similar to the series shown in the image below:

activity3-answer

Please make sure to create the categories and store them in the PriceCategory column as mentioned.

codevalidated

Creating Dummy Variables for Dairy Brands

To examine the impact of different dairy brands on sales performance, we will create dummy variables for each brand using the Brand column. The dummy variables should have the prefix Brand and use the default separator.

Store the resulting dummy variables in the variable brand_dummies.

Please note that the resulting dataframe should look similar to the example shown below:

activity4-answer

codevalidated

Investigating the Effect of Product Shelf Life on Customer Demand

We want to analyze the impact of product shelf life on customer demand by discretizing the Shelf Life (days) column into different categories. To do this, create a new column called ShelfLifeCategory to store the results. The categories should be as follows:

  • Short: shelf life between 0 and 7 days
  • Medium: shelf life between 7 and 14 days
  • Long: shelf life between 14 and 30 days
  • Very Long: shelf life above 30 days

Note: Make sure that the first interval is included in your analysis. The resulting categories should be similar to the series shown below:

activity5-answer

codevalidated

Categorizing Storage Conditions Using Dummy Variables for Sales Analysis

Create dummy variables for the Storage Condition column, which contains different conditions such as refrigerated and ambient. This will allow us to examine the effects of storage conditions on sales. Use the prefix "Storage" and the default separator when creating the dummy variables. Save the resulting dataframe in a variable named storage_dummies.

In the resulting dataframe, the storage conditions should be represented as separate columns with binary values indicating whether a particular storage condition is present or not, similar to the example shown below:

activity6-answer

codevalidated

Create Binary Indicators for Different Sales Channels to Identify Effective Channels for Dairy Sales

Create binary indicators for different sales channels, such as online and offline, to identify the most effective channels for dairy sales. Use the prefix Channel: as a separator for each indicator. Save the results in a variable called channel_dummies.

Make sure your output dataframe looks similar to the example provided above.

activity7-answer

codevalidated

Discretize Production Dates into Monthly Time Periods for Analyzing Seasonal Sales Trends

In order to analyze the seasonal sales trends, we need to categorize the production dates into specific time periods (months) and create a new column called ProductionMonth.

The production dates will be divided into different time periods based on month ranges throughout the dataset, leaving the first interval inclusive. Each time period will be labeled with a number from 1 to the length of the month bins.

Once the categorization is completed, the results should be stored in the ProductionMonth column. The final outcome should resemble the series shown in the provided image.

activity8-answer

codevalidated

Categorize Stock Levels into Low, Medium, and High Categories

To assess the adequacy of stock levels and determine optimal replenishment strategies, we need to analyze the Quantity in Stock (liters/kg) column. We will categorize the stock levels into three meaningful categories:

  • Low - for stock levels below the first quantile
  • Medium - for stock levels between the first and second quantiles
  • High - for stock levels above the second quantile

Create a new column named StockCategory to store the results. Additionally, store the bins in a variable stock_bins.

Note that the expected output should resemble the provided image.

activity9-answer

codevalidated

Categorize Revenue into Low, Medium, High, and Premium Segments

Sort the revenue, from the 'Approx. Total Revenue(INR)' column, into four distinctive categories: Low, Medium, High, and Premium.

The classification guidelines are as follows:

  • Low for the first quarter of revenue (first 25%)
  • Medium for the second quarter (25% to 50%)
  • High for the third quarter (50% to 75%)
  • Premium for the last quarter (75% to 100%)

Generate these categories and include them in a newly created column named RevenueCategory.

Below is an illustration of what the results might look like:

activity10-answer

codevalidated

Analyze Reorder Quantities: Segregate the 'Reorder Quantity (liters/kg)' Into Four Sections - Low, Medium, High, Very High

To optimize reorder amounts for various dairy products, characterize the 'Reorder Quantity (liters/kg)' column into four categories: Low, Medium, High, and Very High.

The outcome should be saved in a new column known as 'ReorderCategory'. Detailed information on how to categorize would look like this:

  • 'Low' refers to the first quarter of reorder quantities (first 25%).
  • 'Medium' implies the second quarter (25% to 50%).
  • 'High' comprises the third quarter (50% to 75%).
  • 'Very High' represents the fourth quarter (75% to 100%).

Please see the preview below:

activity11-answer

codevalidated

Plot a bar chart to examine the stock levels `StockCategory` by counting the number of goods in each category

Store the bar chart in the variable stock_category_plot.

Notes:

  • Your bar chart should:

  • Have a Distribution of Stock Categories title

  • Have Stock Category as the X label value
  • Have Count as the Y label value
  • Have the default color of the bars
  • Have (10, 6) as the value of figure size

The desired output chart should look like this:

activity12-answer

codevalidated

Generate a line plot to analyze the monthly production trends `ProductionMonth` by calculating the sum of `Quantity (liters/kg)` per ProductionMonth

Store the line chart in the variable monthly_production_plot.

Notes:

  • Your line chart should:

  • Have a Monthly Production Trends title

  • Have Production Month as the X label value
  • Have Total Quantity as the Y label value
  • Have the default color of the bars
  • Have (10, 6) as the value of figure size
  • Have an o marker

The desired output chart should look like this:

activity13-answer

codevalidated

Compute the average quantity sold (`Quantity Sold (liters/kg)`) for each shelf life category (`ShelfLifeCategory`)

Store the resulting series in the variable median_quantity_sold.

The resulting series should look like this:

activity14-answer

codevalidated

Determine the total revenue `Approx. Total Revenue(INR)` for each price category (`PriceCategory`)

Store the resulting series in the variable total_revenue.

The resulting series should look like this:

activity15-answer

From Continuous to Categorical: Data Wrangling with Dairy Sales data From Continuous to Categorical: Data Wrangling with Dairy Sales data
Project Created by

Anurag Verma

What's up, friends! 👋 I'm a computer science student about to finish my last year of college. 🎓 I LOVE writing code! ❤️ It makes me so happy! 😄 Whether I'm goofing in notebooks 📓 or coding in Python 🐍, writing programs is a blast! 💥

What's up, friends! 👋 I'm a computer science student about to finish my last year of college. 🎓 I LOVE writing code! ❤️ It makes me so happy! 😄 Whether I'm goofing in notebooks 📓 or coding in Python 🐍, writing programs is a blast! 💥

This project is part of

Data Wrangling with Pandas

Explore other projects