All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Using the pandas DataFrame df, find the video with the highest number of views. Extract its title, channel name, and view count. Store this information in the max_views_video variable as a pandas Series. This analysis will reveal the single most popular video in the dataset, providing insights into what content has achieved the highest level of viewer engagement.
Group the videos by their category_id and count how many videos are in each category. Then, calculate the mean number of views for each category. Combine these results into a single DataFrame called category_analysis with columns Video Count and Mean Views. Use pd.concat() to merge the results along axis=1 (columns) This analysis will reveal which categories are most prevalent in the dataset and which tend to attract the most views.
Create a new column likes_ratio by dividing likes by views for each video. Then, group the data by channel_title and calculate the mean and maximum likes ratio for each channel. Store these results in channel_engagement, sorting channels by their mean likes ratio in descending order. This analysis will reveal which channels are most effective at generating likes relative to their view counts, potentially indicating higher audience engagement or content quality.
Calculate the mean number of views for each video category and store it in a new column category_mean_views. Then, compare each video's view count to its category average. Create a new column popularity_tier that labels videos as Above Average if their views exceed their category mean, and Below Average otherwise. This analysis will help identify which videos are outperforming others in their category.
Note : New column added at the end of the
df
Convert publish_time to datetime, then extract the hour into a new publish_hour column. Create a cross-tabulation of video categories and publishing hours, storing the result in category_hour_crosstab. Ensure the final cross-tab has no missing values by filling it with 0 and uses integer data type.
Split the tags column, which contains pipe-separated tags for each video, into a list of individual tags. Store this list in a new column called tags_list in the original DataFrame. Then, create a new DataFrame tags_df by exploding the tags_list column, so each tag gets its own row. Count the occurrences of each unique tag and find the 10 most common ones. Store these in common_tags. This analysis will reveal which tags are most frequently used by popular YouTube videos.
Do disabled features affect popularity?
Group videos based on whether comments and ratings are enabled or disabled. Calculate the average views, likes, and dislikes for each group. Store the results in disabled_impact. The comments_disabled and ratings_disabled columns indicate whether these features are turned off for each video. This analysis will reveal how disabling user interactions affects key engagement metrics.
Convert the trending_date column to a datetime format, then extract the day of the week for each trending date. Store this in a new column called trending_day in the original DataFrame. Count how many videos trend on each day of the week. The trending_date column represents when a video appeared on the trending list. Store the results in trending_day_counts. This analysis will reveal if there's a pattern in which days videos are more likely to become trending.
Note: Dates in
trending_dateare formatted as'%y.%d.%m'(e.g., '23.15.07' for July 15, 2023). Also, usevalue_counts()method to count the values as using different method, you might get different results.
Divide the range of view counts into five equal-width bins and assign each video to one of these categories: 'Very Low', 'Low', 'Medium', 'High', or 'Very High'. Store the results in a new column called views_category at the end of the df dataframe. This categorization will help you understand the distribution of video popularity and identify which videos fall into different viewership tiers.
Note : New column added at the end of the
df
Divide videos into four equal-sized groups (quartiles) based on their number of likes, labeling them Q1 to Q4 in a new column likes_quartile. Then, calculate the average number of views for each quartile. Store the results in likes_analysis and sort them in descending order. This analysis will reveal if videos with more likes tend to have higher view counts.
Extract the hour from the publish_time column (publish_hour is already created in 5th activity) and categorize it into time periods: Night (9PM-4AM), Morning (5AM-11AM), Afternoon (12PM-4PM), and Evening (5PM-8PM). Create new column publish_time_category in the original DataFrame with these labels. Then, calculate the average views for videos published in each time category, storing the results in time_analysis. Sort the time_analysis results in descending order of average views. This analysis will reveal if certain publishing times tend to result in higher view counts.
Create a function that bins videos into 'Low' (0-100), 'Medium' (101-1000), 'High' (1001-10000), or 'Viral' (>10000) categories based on their comment count. Apply this function to the comment_count column, which represents the number of comments each video received. Store the results in a new comment_category column. Then, calculate the average views for each comment category, storing the results in comment_analysis. Sort the comment_analysis results in descending order of average views. This analysis will reveal if videos with more comments tend to have higher view counts.
Using the likes_ratio column created in activity 3, create a new column called likes_ratio_category by dividing it into five equal-sized groups. Label these groups as Very Low, Low, Medium, High, and Very High. Then, calculate the average number of views for each category and sort the results in descending order. Store the final result in a variable called ratio_analysis.
Calculate the length of each video title and store it in a new title_length column. Then categorize it as 'Short' (0-30 characters), 'Medium' (31-60 characters), or 'Long' (over 60 characters). Store these categories in a new title_length_category column. The title column contains the name of each video. Compute the average number of views for videos in each title length category, storing the results in title_analysis. Sort the title_analysis results in descending order of average views. This analysis will help determine if the length of a video's title has any correlation with its viewership.
Count the words in each video's description and store the count in a new description_word_count column. Before counting words, replace any NaN values in the description column with an empty string. Categorize the word counts into four equal-sized groups: Brief, Moderate, Detailed, and Very Detailed. Store these categories in a new description_detail column. The description column contains the text description for each video. Then, calculate the average views for videos in each description length category, storing the results in description_analysis. Sort the description_analysis results in descending order of average views. This analysis will reveal if more detailed descriptions correlate with higher view counts.
Create a DataFrame category_map with columns category_id and category_name that links category IDs to their corresponding names. The category names should be: 'Film & Animation', 'Autos & Vehicles', 'Music', 'Pets & Animals', 'Sports', 'Gaming', 'People & Blogs', 'Comedy', 'Entertainment', 'News & Politics', 'Howto & Style', 'Education', 'Science & Technology', and 'Nonprofits & Activism'. Perform a left merge of this mapping with the main dataset using the category_id column as the key. Store the result in df_with_categories. This process adds a new category_name column to the dataset, providing descriptive category names alongside the numeric category IDs.
Use the below mapping for
category_idtocategory_name
category_mapping = {
'category_id': [1, 2, 10, 15, 17, 20, 22, 23, 24, 25, 26, 27, 28, 29],
'category_name': ['Film & Animation', 'Autos & Vehicles', 'Music', 'Pets & Animals', 'Sports', 'Gaming', 'People & Blogs', 'Comedy', 'Entertainment', 'News & Politics', 'Howto & Style', 'Education', 'Science & Technology', 'Nonprofits & Activism']
}
No video should appear twice! Remove any duplicate rows from the dataset, keeping only the first occurrence of each unique video. The video_id column contains a unique identifier for each video.
Set a random seed of 42 for reproducibility. Create a sample DataFrame channel_metadata with extra information for the first 10 unique channels, including the columns channel_title, subscriber_count, and channel_age_years. Generate random subscriber counts between 100,000 and 10,000,000, and random channel ages between 1 and 15 years. Perform a left merge of this data with the main dataset based on the channel_title column, which represents the name of each YouTube channel. Store the result in df_enriched. This process demonstrates how to combine data from multiple sources, simulating the enrichment of our dataset with external channel information to gain more comprehensive insights into video performance in relation to channel characteristics.
Split the dataset by trending_date, storing groups in a list. Concatenate these groups using unique trending_date values as keys. Reset the index to make trending_date a regular column again. Store the result in df_concat. The trending_date column shows when a video trended on YouTube. You'll have this date in two columns - keep both for now.
Create a new DataFrame performance_data with columns video_id and engagement_rate (calculated as (likes + dislikes + comment_count) / views * 100). Then, left merge this data with the main dataset using the video_id column as the key. Store the result in df_with_performance.
Clean the channel_title column by creating a new column clean_channel_title with extra spaces removed and text converted to lowercase. Then, group the data by these cleaned channel names and calculate: the number of videos, average views, average likes, and average dislikes. Store the results in a new DataFrame called channel_stats. Reset the index after grouping and rename columns to channel, video_count, avg_views, avg_likes, and avg_dislikes. This process creates a summary of channel performance for comparison.
Create a function to analyze sentiment in titles. Define lists of positive words (amazing, awesome, best, great, love, perfect, top) and negative words (worst, terrible, bad, awful, hate, disappointing). Extract words from lowercase titles using regex. Check if any positive or negative words are present. Assign sentiment based on these conditions: 'Neutral' if both positive and negative words exist or neither exist, 'Positive' if only positive words are found, and 'Negative' if only negative words are present. Apply this function to the title column to create a new title_sentiment column.
Select the numerical columns views, likes, dislikes, and comment_count. Create a new DataFrame called df_log by applying a logarithmic transformation (log(x+1)) to these columns. This transformation helps to normalize the distribution of these metrics, which often have a wide range of values. The resulting df_log DataFrame will allow for easier visualization and comparison of these metrics across different scales.