All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Create a list of unique team names that have played in the Premier League. The list should be stored in prem_teams
and be sorted lexicographically in ascending order. It'll look something like:
prem_teams = [
'Arsenal',
'Aston Villa',
...
'West Ham',
'Wolverhampton Wanderers'
]
Merge the teamstats_df
and teams_df
datasets using the teamID
column. Next, filter the merged dataset to only include games from the 2020 season. Once you've done this, group the filtered dataset by team name and calculate the sum of the goals for each team. The resulting data showing team names with their respective total goals scored is stored in the total_goals
variable.
The result should be like this series:
Solve the previous activity to guide you to this activity's answer.
Merge the games_df
, appearances_df
, and players_df
datasets using the gameID
and playerID
columns. Ensure that you only include games from the 2017 season. Once filtered, group the dataset by player name, then calculate the total assists for each player by summing the assists
column. Save the series of player names and total assists to the top_assists_2017
variable.
The result should be like below series:
The result may contains more players than the image above.
Merge the appearances_df
and games_df
datasets using gameID
as the key. Only include the games played in the 2017 season. Next, merge the filtered dataset with the player_top_assists_df
dataset using playerID
as the key. Group the new merged dataset by player name and count each player's appearances. Merge the created dataframe with the top_assists_2017
dataframe using the player name as the index. Rename the gameID
column to appearances_2017
. Finally, store the counts of appearances and assists for each player in the 2017 season in a variable called appearances_assists_2017
.
The result should be like this dataframe:
Merge the teamstats_df
and teams_df
datasets using the teamID
column, employing a many-to-one relationship for the merge. Ensure each team is uniquely represented in the merged dataset. Group this merged dataset by team name and season, and compute the average of the goals column for each group. Following this, reset the index of the resultant dataframe. Finally, store the average goals per game for each team in the season in the average_goals_per_game
variable.
The result should be like below dataframe:
Merge the games_df
and leagues_df
datasets using the leagueID
column, making sure to validate it as m:1
. Then, filter the merged dataset to only include games that were played during the 2015 season. Calculate the number of games played in each league and save these results in a new dataframe, named games_per_league
. Finally, rename the columns of the games_per_league
dataframe as league
and games_count
.
The result should be like this dataframe:
Merge the players_df
and shots_df
datasets using the playerID
column and a left join
. This ensures that all players are included in the merged dataset, even if they didn't take any shots. Next, merge this dataframe with the game_league
dataframe (which is already a result of merging games_df
and leagues_df
datasets on leagueID
column) on the gameID
column. Group the final merged dataframe by player name, game ID, and league name. Then count the number of shots each player took in every league and game. Afterward, reset the index of the resulting dataframe and label the new column as shots
. Finally, store the shot count for each player in each game and its respective league in the player_game_shots
variable.
The result should be like this dataframe:
Merge the players_df
, shots_df
, and leagues_df
datasets using the playerID
, shooterID
, and leagueID
columns, respectively.
Group the merged dataframe by player name, game ID, and league name, then count the number of shots taken by each player in each game and league. Self-merge this resulting dataframe on the player name, game ID, and league name columns, but filter the second dataframe to only include shots that led to a goal. Fill in any missing values in the goal_shots
column with 0. Finally, store the number of shots that led to a goal for each player in each game and league in the player_game_goal_shots
variable.
The result should be like this dataframe:
Filter the teamstats_df
dataframe to only include rows where the season
column is equal to 2017. Group the filtered dataframe by teamID
and calculate the sum of the goals
column for each team. Merge the resulting dataframe with the teams_df
dataframe on the teamID
column. Use a left
join to ensure that all teams are included in the merged dataframe, even if they didn't score any goals in the 2017 season. Replace any missing values in the goals
column with zero. Rename the goals
column to goals_2017
. Store the result, which displays the number of goals for each team in the 2017 season, in the teams_goals_2017_df
variable.
The result should be like this dataframe:
Filter the shots_df
dataframe to include only rows where the shotResult
column is MissedShots
. Group the filtered dataframe by playerID
and count the number of rows within each group. Then, merge the resulting dataframe with the players_df
dataframe on the playerID
column. Use a left
join to ensure that every player is included in the merged dataframe, even if they have not missed any shots. Replace any missing values in the shotResult
column with 0. Afterward, rename the shotResult
column as shotsMissed
. Finally, store the number of each player's missed shots for all seasons in the players_missed
variable.
The result should be like this dataframe:
Merge the appearances_df
and games_df
datasets on the gameID
column. Afterward, filter the combined dataset to only include games from the 2020 season. Group the filtered dataframe by playerID
and calculate the number of red cards each player has received. Following that, merge this dataset with the players_df
dataset using the playerID
column. Finally, search for the row in the merged dataset that has the highest value in the redCard
column.
Filter the shots_df
dataframe to only include shots taken with the left foot. Merge this filtered dataframe with the games_df
dataframe, using the gameID
column for the merge. Then, filter the merged dataframe to display games played only in the 2019 season. Merge this new filtered dataframe with the players_df
dataframe using the shooterID
column. Find the row in the merged dataframe with the name column's highest value. Finally, extract the player's name from that particular row.
Merge the appearances_df
and games_df
datasets using the gameID
column. Further, merge this newly combined dataframe with the players_df
dataframe on the playerID
column. Group the final merged dataframe by season
and name
, then calculate the sum of the time
column for each group. Afterward, reset the index of this grouped dataframe.
Identify the rows with the highest time
value for each season
in the grouped dataframe. From these rows, extract the player's name.
Store this final dataframe in the players_most_appeared
variable.
The result should be like below dataframe: