All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Select the columns that you have identified having null/missing values. We encourage you to use the missingno
library.
This is a 3-part activity:
Rating
(if any). Just set them as NaN
.Rating
column using the mean()
Perform the modifications "in place", modifying df
. If you make a mistake, re-load the data.
You'll notice that some columns from this dataframe which should be numeric, were parsed as object
(string). That's because sometimes the numbers are expressed with M
, or k
to indicate Mega
or kilo
.
Clean the Review
column by transforming the values to the correct numeric representation. For example, 5M
should be 5000000
.
Count the number of duplicated rows. That is, if the app Twitter
appears 2 times, that counts as 2.
Now that the Reviews
column is numeric, we can use it to clean duplicated apps.
Drop duplicated apps, keeping just one copy of each, the one with the greatest number of reviews.
Hint: you'll need to sort the dataframe by App
and Reviews
, and that will change the order of your df
.
Categories are all uppercase and words are separated using underscores. Instead, we want them with capitalized in the first character and the underscores transformed as whitespaces.
Example, the category AUTO_AND_VEHICLES
should be transformed to: Auto and vehicles
Clean and transform Installs
as a numeric type. Some values in Installs
will have a +
modifier. Just remove the string and honor the original number (for example +2,500
or 2,500+
should be transformed to the number 2500
).
The Size
column is of type object. Some values contain either a M
or a k
that indicate Kilobytes (1024 bytes) or Megabytes (1024 kb).
These values should be transformed to their corresponding value in bytes. For example, 898k
will become 919552
(898 * 1024
).
Some other values are completely invalid (there's no way to infer the numeric type from them). For these, just replace the value for 0.
Some other rules are related to +
modifiers, apply the same rules as the previous task.
Values of the Price
column are strings representing price with special symbol '$'.
Now that you have cleaned the Price
column, let's create another auxiliary Distribution
column.
This column should contain Free
/Paid
values depending on the app's price.
What company has the greatest number of reviews?
Find the most expensive app in the Game
category and enter its name:
What app (from the Finance
category) has the most installs?
What app from the Game
category and catalogued as Teen
in Content Rating
has the greatest number of reviews?
What free app (ie. price == 0
) from the Game
category has the greatest number of reviews?
This app produced the greatest amount of bytes transfer. Enter your answer in Terabytes as a whole number (rounding down to the nearest integer). Example, if you find the total transfer to be 780.9581
TB, just enter 780
.