All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Merge s2017_df
and players_df
using a left outer join, that means, we want to have all the stats information, but if there are missing values that can't be matched from players_df
, we want to set those season values as null.
Store the results from the merge in the variable df
.
As we performed a left outer join, if some values in from s2017_df
couldn't be matched in players_df
, the result will be null values (also referred as "misses" or "mismatches"). Are there any?
How many misses were in the resulting frame?
Extract the names of the players from s2017_df
that couldn't be matched with players_df
in a list (it must be a list
), under the variable name player_misses
.
Now it's time to do some detective work...
We can guarantee that the players missing in players_df
do exist, but they just have different names. Now, you must find the players discrepancies and update the names in the players_df
dataframe.
For example, if in s2017_df
the player's name was "Michael J. Jordan"
, and in players_df
it was just "Michael Jordan"
, the task is to modify players_df
to make it now "Michael J. Jordan"
.
Important: Modify the players_df
dataframe in place! If you "break" something, you can always read the data again.
Now that you've fixed the data in players_df
, perform the merge between s2017_df
and players_df
again. Should be the same merge as before, left outer. Store the result in df
.
We won't use some columns in our follow up analysis, so we can drop them to simplify the understanding of the data. Drop from df
the following columns:
columns_to_drop = [
"Year",
"PER",
"TS%",
"3PAr",
"FTr",
"USG%",
"blanl",
"OWS",
"DWS",
"WS",
"WS/48",
"blank2",
"OBPM",
"DBPM",
"BPM",
"VORP",
"FG%",
"3P%",
"eFG%",
"FT%",
"name",
]
Important: you must modify df
in place, removing the columns directly in the same dataframe. If you think you've made a mistake, re-read the data and perform the joins again.
The Tm
column contains an acronym of the team. For example, GSW
for Golden State Warriors. Create a new column Team
with the full name of the team. In the associated notebook, you can find a mapping to help you in the process.
The column birth_date
is a string in the format Month Day, Year
(August 1, 1993
).
Convert the column to a datetime object.
Finally, if you explore the dataset, you'll notice that there's a team TOT
. In reality, that team doesn't exist, and it's just an aggregation for players that have switched teams in the season.
Your task is to delete all the rows that have TOT
in the column Tm
. Perform the modification in the df
dataframe, in place.
Count the number of players registered in each team and answer which team has the most players.
What's the team with the lowest sum of field goals (FG
)?
FG%
is defined as FG / FGA
, that is, total field goals divided by the number of attempts. What team has the best FG%
in the league? Enter the full team name below (example, Dallas Mavericks
).
It is known that Shooting Guards (SG) are the best 3P throwers (by efficacy). The question is, what's the difference (in accuracy / efficacy) with the worst 3P throwers, always considering by position?
Note: use the position from the Pos
column.
Create a new dataframe containing the best scorers per team (by PTS
, total points scored). The resulting dataframe should contain the columns Player
, Team
, Pos
and PTS
, and should be stored in the variable best_scorers_per_team
. It should be sorted by PTS
in descending mode.
Here's a preview of the expected result:
Calculate the average player age per team and answer which team has the "youngest squad"?