All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
The first step is to identify null values in our dataframe. Let's start by name
. Count the number of null values in the name
column and answer, how many np.nan
values are there?
Let's clean the dataframe now by removing any rows that have null values (in any columns). Perform the cleaning task in-place, that is, modifying the original df
variable.
If there are multiple names with the same quantity, enter the name with the highest index value.
If there are multiple names with the same quantity, enter the name with the highest index value.
Input answer as an integer, without any commas or dots.
Input answer as an integer, without any commas or dots.
The growth rate measures the change from one period to another. In this case, we want to see the total change of quantity of babies between 1930
and 1990
. Select the option that better matches the growth rate. Keep in mind that these are all approximate figures.
Create a plot showing the total babies born per year. Use the fig
and ax
variables already defined, create your plot on the axis ax
. The plot must have the title "Number of babies born per year"
and the y-axis should be formatted using ,
as thousands separators.
Your plot must match perfectly the figure that you see below:
Note: Plot activity checks are performed on a pixel-by-pixel basis, so your plot has to match perfectly what you see in the image above, including the values of the axis, labels, titles, etc.
We want to analyze how parents were with newborns across the years. To do so, we'll compare the number of unique names of each year to the total number of babies born. Uniqueness is defined then as: Total Unique Names / Total Newborns
. For example, given the following baby names in a year:
John
Jane
Jane
Mary
We get a "uniqueness" score of .75 (3 unique names / 4 total babies
)
Store your results in a new dataframe named unique_names_df
. The dataframe should contain the columns Total Unique Names
, Total Newborns
and Uniqueness
. It should be indexed by year, in ascending mode.
It should look something like this:
Using the dataframe created before, which is the year with the most variations of names? Or, what's the same, the highest uniqueness score.
Similar to the previous activity, now answer: which was the year with the least uniqueness?
Using the dataframe unique_names_df
, create a plot displaying the uniqueness of names across the years.
The title of your chart should be "Baby name uniqueness across the years"
and it should contain the legend "Uniqueness of names"
for the single series plotted. It should look like this:
Juan Carlos, Jose Carlos, Giancarlos, or just plain Carlos...
Carlos is a very popular name in spanish speaking countries
So, answer the following: how many people were named "Carlos" throughout history?
Warning! The following are all valid "Carlos", so be mindful about casing: Juan Carlos
, Carlos
, Giancarlos
.
Input your answer as an integer, without any commas or dots.
Is it Carlo Alberto
, Roberto Carlos
, or just plain Carlos
?
What's the most popular name containing carlos?
"Diego Maradona" was a renowned Soccer/Football player from Argentina that played in the 80s/90s. He was an absolute sensation in Argentina. We want to know if he impacted new baby names.
Create a Dataframe containing an aggregation of the total number of babies named "Diego" per year, in any variation: "Diego Martin", "Diego Alejandro", or just "Diego". In this case, we don't want to count any names that contain "diego" (in lowercase), just the names that contain the actual "Diego" name.
Create the aggregation and store the result in the series diegos_per_year_s
. It should look something like:
year
1922 22
1923 21
1924 21
1925 33
1926 41
Name: quantity, dtype: int64
Create a plot showing the number of "Diegos" born between 1960 and 2015, including both limits ([1960, 2015]
).
The plot should have the title Total 'Diegos' born per year [1960-2015]
, and it should look something like:
Create a DataFrame containing information of the most popular name for each year (that with the highest quantity). Store it in the variable most_popular_per_year_df
. It should be sorted by year in ascending order.
Important! The index of the dataframe is important and must be respected based on the most popular name of each year. It should look like this:
For example, the most popular name of 1999 was Valentina
with 3084
occurrences, or in 2015
it was Benjamin
with 3695
occurrences.
Which name got the "most popular name of the year" the most times?