Pandas Data Wrangling Practices Using YouTube Trending Videos Dataset
Pandas Data Wrangling Practices Using YouTube Trending Videos Dataset Data Science Project
Data Wrangling with Pandas

Pandas Data Wrangling Practices Using YouTube Trending Videos Dataset

Want to know what makes YouTube videos go viral? Here's your chance! We'll use real YouTube data to find out. You'll learn to group videos by type, turn view counts into easy categories, and clean up messy data. We'll use cool Pandas tricks like `groupby()`, `cut()`, and `merge()` to dig deep into video stats. By the end, you'll be a pro at spotting trends in big data. Come on, let's figure out what makes YouTube videos popular!
Start this project
Pandas Data Wrangling Practices Using YouTube Trending Videos DatasetPandas Data Wrangling Practices Using YouTube Trending Videos Dataset
Project Created by

Dhrubaraj Roy

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Find Video with Maximum `views` and Extract `title`, `channel_title` and `views`

Using the pandas DataFrame df, find the video with the highest number of views. Extract its title, channel name, and view count. Store this information in the max_views_video variable as a pandas Series. This analysis will reveal the single most popular video in the dataset, providing insights into what content has achieved the highest level of viewer engagement.

codevalidated

Analyze Video Distribution and Mean `views` per `category_id`

Group the videos by their category_id and count how many videos are in each category. Then, calculate the mean number of views for each category. Combine these results into a single DataFrame called category_analysis with columns Video Count and Mean Views. Use pd.concat() to merge the results along axis=1 (columns) This analysis will reveal which categories are most prevalent in the dataset and which tend to attract the most views.

codevalidated

Calculate `likes_ratio` and Analyze Channel Engagement Metrics

Create a new column likes_ratio by dividing likes by views for each video. Then, group the data by channel_title and calculate the mean and maximum likes ratio for each channel. Store these results in channel_engagement, sorting channels by their mean likes ratio in descending order. This analysis will reveal which channels are most effective at generating likes relative to their view counts, potentially indicating higher audience engagement or content quality.

codevalidated

Categorize Videos as 'Above Average' or 'Below Average' Based on `views` Compared to Category Mean

Calculate the mean number of views for each video category and store it in a new column category_mean_views. Then, compare each video's view count to its category average. Create a new column popularity_tier that labels videos as Above Average if their views exceed their category mean, and Below Average otherwise. This analysis will help identify which videos are outperforming others in their category.

Note : New column added at the end of the df

codevalidated

Analyze Distribution of Video Categories Across Publishing Hours Using `category_id` and `publish_hour`

Convert publish_time to datetime, then extract the hour into a new publish_hour column. Create a cross-tabulation of video categories and publishing hours, storing the result in category_hour_crosstab. Ensure the final cross-tab has no missing values by filling it with 0 and uses integer data type.

codevalidated

Identify Top 10 Most Frequently Used Video Tags by Splitting `tags` Column

Split the tags column, which contains pipe-separated tags for each video, into a list of individual tags. Store this list in a new column called tags_list in the original DataFrame. Then, create a new DataFrame tags_df by exploding the tags_list column, so each tag gets its own row. Count the occurrences of each unique tag and find the 10 most common ones. Store these in common_tags. This analysis will reveal which tags are most frequently used by popular YouTube videos.

codevalidated

Analyze Impact of Disabling Comments and Ratings on Average Video Engagement Metrics

Do disabled features affect popularity? Group videos based on whether comments and ratings are enabled or disabled. Calculate the average views, likes, and dislikes for each group. Store the results in disabled_impact. The comments_disabled and ratings_disabled columns indicate whether these features are turned off for each video. This analysis will reveal how disabling user interactions affects key engagement metrics.

codevalidated

Analyze Trending Video Distribution by Day of Week Using `trending_date`

Convert the trending_date column to a datetime format, then extract the day of the week for each trending date. Store this in a new column called trending_day in the original DataFrame. Count how many videos trend on each day of the week. The trending_date column represents when a video appeared on the trending list. Store the results in trending_day_counts. This analysis will reveal if there's a pattern in which days videos are more likely to become trending.

Note: Dates in trending_date are formatted as '%y.%d.%m' (e.g., '23.15.07' for July 15, 2023). Also, use value_counts() method to count the values as using different method, you might get different results.

codevalidated

Categorize Videos into 5 Popularity Tiers Based on `views`

Divide the range of view counts into five equal-width bins and assign each video to one of these categories: 'Very Low', 'Low', 'Medium', 'High', or 'Very High'. Store the results in a new column called views_category at the end of the df dataframe. This categorization will help you understand the distribution of video popularity and identify which videos fall into different viewership tiers.

Note : New column added at the end of the df

codevalidated

Group `likes` into Quartiles and Analyze Average `views` per `likes_quartile`

Divide videos into four equal-sized groups (quartiles) based on their number of likes, labeling them Q1 to Q4 in a new column likes_quartile. Then, calculate the average number of views for each quartile. Store the results in likes_analysis and sort them in descending order. This analysis will reveal if videos with more likes tend to have higher view counts.

codevalidated

Categorize `publish_time` into Time Periods and Analyze `views` by `publish_time_category`

Extract the hour from the publish_time column (publish_hour is already created in 5th activity) and categorize it into time periods: Night (9PM-4AM), Morning (5AM-11AM), Afternoon (12PM-4PM), and Evening (5PM-8PM). Create new column publish_time_category in the original DataFrame with these labels. Then, calculate the average views for videos published in each time category, storing the results in time_analysis. Sort the time_analysis results in descending order of average views. This analysis will reveal if certain publishing times tend to result in higher view counts.

codevalidated

Categorize Videos by Comment Count and Analyze Average `views` per `comment_category`

Create a function that bins videos into 'Low' (0-100), 'Medium' (101-1000), 'High' (1001-10000), or 'Viral' (>10000) categories based on their comment count. Apply this function to the comment_count column, which represents the number of comments each video received. Store the results in a new comment_category column. Then, calculate the average views for each comment category, storing the results in comment_analysis. Sort the comment_analysis results in descending order of average views. This analysis will reveal if videos with more comments tend to have higher view counts.

codevalidated

Analyze Relationship Between `likes_ratio` and Average `views` Using Discretized Categories

Using the likes_ratio column created in activity 3, create a new column called likes_ratio_category by dividing it into five equal-sized groups. Label these groups as Very Low, Low, Medium, High, and Very High. Then, calculate the average number of views for each category and sort the results in descending order. Store the final result in a variable called ratio_analysis.

codevalidated

Categorize `title` Length and Analyze Average `views` by `title_length_category`

Calculate the length of each video title and store it in a new title_length column. Then categorize it as 'Short' (0-30 characters), 'Medium' (31-60 characters), or 'Long' (over 60 characters). Store these categories in a new title_length_category column. The title column contains the name of each video. Compute the average number of views for videos in each title length category, storing the results in title_analysis. Sort the title_analysis results in descending order of average views. This analysis will help determine if the length of a video's title has any correlation with its viewership.

codevalidated

Analyze Average `views` by `description` Word Count Quartiles

Count the words in each video's description and store the count in a new description_word_count column. Before counting words, replace any NaN values in the description column with an empty string. Categorize the word counts into four equal-sized groups: Brief, Moderate, Detailed, and Very Detailed. Store these categories in a new description_detail column. The description column contains the text description for each video. Then, calculate the average views for videos in each description length category, storing the results in description_analysis. Sort the description_analysis results in descending order of average views. This analysis will reveal if more detailed descriptions correlate with higher view counts.

codevalidated

Map `category_id` to Descriptive Names and Merge with Main Dataset

Create a DataFrame category_map with columns category_id and category_name that links category IDs to their corresponding names. The category names should be: 'Film & Animation', 'Autos & Vehicles', 'Music', 'Pets & Animals', 'Sports', 'Gaming', 'People & Blogs', 'Comedy', 'Entertainment', 'News & Politics', 'Howto & Style', 'Education', 'Science & Technology', and 'Nonprofits & Activism'. Perform a left merge of this mapping with the main dataset using the category_id column as the key. Store the result in df_with_categories. This process adds a new category_name column to the dataset, providing descriptive category names alongside the numeric category IDs.

Use the below mapping for category_id to category_name

category_mapping = {
    'category_id': [1, 2, 10, 15, 17, 20, 22, 23, 24, 25, 26, 27, 28, 29],
    'category_name': ['Film & Animation', 'Autos & Vehicles', 'Music', 'Pets & Animals', 'Sports', 'Gaming', 'People & Blogs', 'Comedy', 'Entertainment', 'News & Politics', 'Howto & Style', 'Education', 'Science & Technology', 'Nonprofits & Activism']
}
codevalidated

Remove Duplicate Videos Based on `video_id`

No video should appear twice! Remove any duplicate rows from the dataset, keeping only the first occurrence of each unique video. The video_id column contains a unique identifier for each video.

codevalidated

Enhancing Dataset with Sample Channel Metadata: A Merge Operation on `channel_title`

Set a random seed of 42 for reproducibility. Create a sample DataFrame channel_metadata with extra information for the first 10 unique channels, including the columns channel_title, subscriber_count, and channel_age_years. Generate random subscriber counts between 100,000 and 10,000,000, and random channel ages between 1 and 15 years. Perform a left merge of this data with the main dataset based on the channel_title column, which represents the name of each YouTube channel. Store the result in df_enriched. This process demonstrates how to combine data from multiple sources, simulating the enrichment of our dataset with external channel information to gain more comprehensive insights into video performance in relation to channel characteristics.

codevalidated

Split Dataset by `trending_date`, Concatenate Groups, and Reset Index

Split the dataset by trending_date, storing groups in a list. Concatenate these groups using unique trending_date values as keys. Reset the index to make trending_date a regular column again. Store the result in df_concat. The trending_date column shows when a video trended on YouTube. You'll have this date in two columns - keep both for now.

codevalidated

Merge Calculated Engagement Rate with Main Dataset on `video_id`

Create a new DataFrame performance_data with columns video_id and engagement_rate (calculated as (likes + dislikes + comment_count) / views * 100). Then, left merge this data with the main dataset using the video_id column as the key. Store the result in df_with_performance.

codevalidated

Aggregate Channel Performance Metrics by Cleaned `channel_title`

Clean the channel_title column by creating a new column clean_channel_title with extra spaces removed and text converted to lowercase. Then, group the data by these cleaned channel names and calculate: the number of videos, average views, average likes, and average dislikes. Store the results in a new DataFrame called channel_stats. Reset the index after grouping and rename columns to channel, video_count, avg_views, avg_likes, and avg_dislikes. This process creates a summary of channel performance for comparison.

codevalidated

Classify Video Titles as 'Positive', 'Negative', or 'Neutral' Based on Sentiment

Create a function to analyze sentiment in titles. Define lists of positive words (amazing, awesome, best, great, love, perfect, top) and negative words (worst, terrible, bad, awful, hate, disappointing). Extract words from lowercase titles using regex. Check if any positive or negative words are present. Assign sentiment based on these conditions: 'Neutral' if both positive and negative words exist or neither exist, 'Positive' if only positive words are found, and 'Negative' if only negative words are present. Apply this function to the title column to create a new title_sentiment column.

codevalidated

Log-Transform Numerical Columns for Normalized Analysis

Select the numerical columns views, likes, dislikes, and comment_count. Create a new DataFrame called df_log by applying a logarithmic transformation (log(x+1)) to these columns. This transformation helps to normalize the distribution of these metrics, which often have a wide range of values. The resulting df_log DataFrame will allow for easier visualization and comparison of these metrics across different scales.

Pandas Data Wrangling Practices Using YouTube Trending Videos DatasetPandas Data Wrangling Practices Using YouTube Trending Videos Dataset
Project Created by

Dhrubaraj Roy

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

This project is part of

Data Wrangling with Pandas

Explore other projects