Football Data Analysis: Merging and Joining with Pandas

codevalidated

Create a unique list of teams that have played in the `Premier League`.

Create a list of unique team names that have played in the Premier League. The list should be stored in prem_teams and be sorted lexicographically in ascending order. It'll look something like:

prem_teams = [
 'Arsenal',
 'Aston Villa',
 ...
 'West Ham',
 'Wolverhampton Wanderers'
]

codevalidated

Calculate the total number of goals scored by each team in the `2020` season.

Merge the teamstats_df and teams_df datasets using the teamID column. Next, filter the merged dataset to only include games from the 2020 season. Once you've done this, group the filtered dataset by team name and calculate the sum of the goals for each team. The resulting data showing team names with their respective total goals scored is stored in the total_goals variable.

The result should be like this series:

Act2-Answer

input

Which team has the highest number of goals in 2020 season?

Solve the previous activity to guide you to this activity's answer.

codevalidated

Which top 10 players have the most assists in `2017` season?

Merge the games_df, appearances_df, and players_df datasets using the gameID and playerID columns. Ensure that you only include games from the 2017 season. Once filtered, group the dataset by player name, then calculate the total assists for each player by summing the assists column. Save the series of player names and total assists to the top_assists_2017 variable.

The result should be like below series:

Act4-Answer

The result may contains more players than the image above.

codevalidated

Count the number of appearances in `2017` season for each player besides his number of assists.

Merge the appearances_df and games_df datasets using gameID as the key. Only include the games played in the 2017 season. Next, merge the filtered dataset with the player_top_assists_df dataset using playerID as the key. Group the new merged dataset by player name and count each player's appearances. Merge the created dataframe with the top_assists_2017 dataframe using the player name as the index. Rename the gameID column to appearances_2017. Finally, store the counts of appearances and assists for each player in the 2017 season in a variable called appearances_assists_2017.

The result should be like this dataframe:

Act5-Answer

codevalidated

Calculate the average goals per game for each team in a season.

Merge the teamstats_df and teams_df datasets using the teamID column, employing a many-to-one relationship for the merge. Ensure each team is uniquely represented in the merged dataset. Group this merged dataset by team name and season, and compute the average of the goals column for each group. Following this, reset the index of the resultant dataframe. Finally, store the average goals per game for each team in the season in the average_goals_per_game variable.

The result should be like below dataframe:

Act6-Answer

multiplechoice

Which merging relationship resulted in the last activity (Activity 6) if we put `teamstats_df` in the left?

codevalidated

Count the number of games in each league in `2015` season.

Merge the games_df and leagues_df datasets using the leagueID column, making sure to validate it as m:1. Then, filter the merged dataset to only include games that were played during the 2015 season. Calculate the number of games played in each league and save these results in a new dataframe, named games_per_league. Finally, rename the columns of the games_per_league dataframe as league and games_count.

The result should be like this dataframe:

Act8-Answer

codevalidated

Count the number of shots for each player in each game and its league.

Merge the players_df and shots_df datasets using the playerID column and a left join. This ensures that all players are included in the merged dataset, even if they didn't take any shots. Next, merge this dataframe with the game_league dataframe (which is already a result of merging games_df and leagues_df datasets on leagueID column) on the gameID column. Group the final merged dataframe by player name, game ID, and league name. Then count the number of shots each player took in every league and game. Afterward, reset the index of the resulting dataframe and label the new column as shots. Finally, store the shot count for each player in each game and its respective league in the player_game_shots variable.

The result should be like this dataframe:

Act9-Answer

codevalidated

Count how many of these shots had a `shotResult` of `Goal`.

Merge the players_df, shots_df, and leagues_df datasets using the playerID, shooterID, and leagueID columns, respectively.

Group the merged dataframe by player name, game ID, and league name, then count the number of shots taken by each player in each game and league. Self-merge this resulting dataframe on the player name, game ID, and league name columns, but filter the second dataframe to only include shots that led to a goal. Fill in any missing values in the goal_shots column with 0. Finally, store the number of shots that led to a goal for each player in each game and league in the player_game_goal_shots variable.

The result should be like this dataframe:

Act10-Answer

codevalidated

Count the number of goals for each team in `2017` season and add them besides each team.

Filter the teamstats_df dataframe to only include rows where the season column is equal to 2017. Group the filtered dataframe by teamID and calculate the sum of the goals column for each team. Merge the resulting dataframe with the teams_df dataframe on the teamID column. Use a left join to ensure that all teams are included in the merged dataframe, even if they didn't score any goals in the 2017 season. Replace any missing values in the goals column with zero. Rename the goals column to goals_2017. Store the result, which displays the number of goals for each team in the 2017 season, in the teams_goals_2017_df variable.

The result should be like this dataframe:

Act11-Answer

codevalidated

Count the number of MissedShots for each player across all seasons and add them besides each player.

Filter the shots_df dataframe to include only rows where the shotResult column is MissedShots. Group the filtered dataframe by playerID and count the number of rows within each group. Then, merge the resulting dataframe with the players_df dataframe on the playerID column. Use a left join to ensure that every player is included in the merged dataframe, even if they have not missed any shots. Replace any missing values in the shotResult column with 0. Afterward, rename the shotResult column as shotsMissed. Finally, store the number of each player's missed shots for all seasons in the players_missed variable.

The result should be like this dataframe:

Act12-Answer

input

Name of the player who had the most redcards in `2020` season.

Merge the appearances_df and games_df datasets on the gameID column. Afterward, filter the combined dataset to only include games from the 2020 season. Group the filtered dataframe by playerID and calculate the number of red cards each player has received. Following that, merge this dataset with the players_df dataset using the playerID column. Finally, search for the row in the merged dataset that has the highest value in the redCard column.

input

Name of the player who had the most shots with LeftFoot in `2019` season

Filter the shots_df dataframe to only include shots taken with the left foot. Merge this filtered dataframe with the games_df dataframe, using the gameID column for the merge. Then, filter the merged dataframe to display games played only in the 2019 season. Merge this new filtered dataframe with the players_df dataframe using the shooterID column. Find the row in the merged dataframe with the name column's highest value. Finally, extract the player's name from that particular row.

codevalidated

Select the player in each season who played the most time in his game appearances with his team.

Merge the appearances_df and games_df datasets using the gameID column. Further, merge this newly combined dataframe with the players_df dataframe on the playerID column. Group the final merged dataframe by season and name, then calculate the sum of the time column for each group. Afterward, reset the index of this grouped dataframe.

Identify the rows with the highest time value for each season in the grouped dataframe. From these rows, extract the player's name.

Store this final dataframe in the players_most_appeared variable.

The result should be like below dataframe:

Act15-Answer

Anurag Verma

Project Activities

Create a unique list of teams that have played in the `Premier League`.

Calculate the total number of goals scored by each team in the `2020` season.

Which team has the highest number of goals in 2020 season?

Which top 10 players have the most assists in `2017` season?

Count the number of appearances in `2017` season for each player besides his number of assists.

Calculate the average goals per game for each team in a season.

Which merging relationship resulted in the last activity (Activity 6) if we put `teamstats_df` in the left?

Count the number of games in each league in `2015` season.

Count the number of shots for each player in each game and its league.

Count how many of these shots had a `shotResult` of `Goal`.

Count the number of goals for each team in `2017` season and add them besides each team.

Count the number of MissedShots for each player across all seasons and add them besides each player.

Name of the player who had the most redcards in `2020` season.

Name of the player who had the most shots with LeftFoot in `2019` season

Select the player in each season who played the most time in his game appearances with his team.

Anurag Verma

Data Wrangling with Pandas

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database