All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Merge s2017_df and players_df using a left outer join, that means, we want to have all the stats information, but if there are missing values that can't be matched from players_df, we want to set those season values as null.
Store the results from the merge in the variable df.
As we performed a left outer join, if some values in from s2017_df couldn't be matched in players_df, the result will be null values (also referred as "misses" or "mismatches"). Are there any?
How many misses were in the resulting frame?
Extract the names of the players from s2017_df that couldn't be matched with players_df in a list (it must be a list), under the variable name player_misses.
Now it's time to do some detective work...
We can guarantee that the players missing in players_df do exist, but they just have different names. Now, you must find the players discrepancies and update the names in the players_df dataframe.
For example, if in s2017_df the player's name was "Michael J. Jordan", and in players_df it was just "Michael Jordan", the task is to modify players_df to make it now "Michael J. Jordan".
Important: Modify the players_df dataframe in place! If you "break" something, you can always read the data again.
Now that you've fixed the data in players_df, perform the merge between s2017_df and players_df again. Should be the same merge as before, left outer. Store the result in df.
We won't use some columns in our follow up analysis, so we can drop them to simplify the understanding of the data. Drop from df the following columns:
columns_to_drop = [
"Year",
"PER",
"TS%",
"3PAr",
"FTr",
"USG%",
"blanl",
"OWS",
"DWS",
"WS",
"WS/48",
"blank2",
"OBPM",
"DBPM",
"BPM",
"VORP",
"FG%",
"3P%",
"eFG%",
"FT%",
"name",
]
Important: you must modify df in place, removing the columns directly in the same dataframe. If you think you've made a mistake, re-read the data and perform the joins again.
The Tm column contains an acronym of the team. For example, GSW for Golden State Warriors. Create a new column Team with the full name of the team. In the associated notebook, you can find a mapping to help you in the process.
The column birth_date is a string in the format Month Day, Year (August 1, 1993).
Convert the column to a datetime object.
Finally, if you explore the dataset, you'll notice that there's a team TOT. In reality, that team doesn't exist, and it's just an aggregation for players that have switched teams in the season.
Your task is to delete all the rows that have TOT in the column Tm. Perform the modification in the df dataframe, in place.
Count the number of players registered in each team and answer which team has the most players.
What's the team with the lowest sum of field goals (FG)?
FG% is defined as FG / FGA, that is, total field goals divided by the number of attempts. What team has the best FG% in the league? Enter the full team name below (example, Dallas Mavericks).
It is known that Shooting Guards (SG) are the best 3P throwers (by efficacy). The question is, what's the difference (in accuracy / efficacy) with the worst 3P throwers, always considering by position?
Note: use the position from the Pos column.
Create a new dataframe containing the best scorers per team (by PTS, total points scored). The resulting dataframe should contain the columns Player, Team, Pos and PTS, and should be stored in the variable best_scorers_per_team. It should be sorted by PTS in descending mode.
Here's a preview of the expected result:

Calculate the average player age per team and answer which team has the "youngest squad"?