Capstone Project: Cleaning Google Playstore data
Capstone Project: Cleaning Google Playstore data Data Science Project
Data Cleaning with Pandas

Capstone Project: Cleaning Google Playstore data

Unlock the hidden potential of the Google Play Store dataset! Join our project to clean, refine, and enhance this treasure trove of mobile app information. From handling missing data to tackling outliers, we're on a mission to ensure you get the most accurate insights for smarter decisions. Dive in now!
Start this project
Capstone Project: Cleaning Google Playstore dataCapstone Project: Cleaning Google Playstore data
Project Created by

Matias Caputti

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

Which of the following column(s) has/have null values?

Select the columns that you have identified having null/missing values. We encourage you to use the missingno library.

codevalidated

Clean the `Rating` column and the other columns containing null values

This is a 3-part activity:

  • Remove the invalid values from Rating (if any). Just set them as NaN.
  • Fill the null values in the Rating column using the mean()
  • Clean any other non-numerical columns by just dropping the values

Perform the modifications "in place", modifying df. If you make a mistake, re-load the data.

codevalidated

Clean the column `Reviews` and make it numeric

You'll notice that some columns from this dataframe which should be numeric, were parsed as object (string). That's because sometimes the numbers are expressed with M, or k to indicate Mega or kilo.

Clean the Review column by transforming the values to the correct numeric representation. For example, 5M should be 5000000.

input

How many duplicated apps are there?

Count the number of duplicated rows. That is, if the app Twitter appears 2 times, that counts as 2.

codevalidated

Drop duplicated apps keeping only the ones with the greatest number of reviews

Now that the Reviews column is numeric, we can use it to clean duplicated apps. Drop duplicated apps, keeping just one copy of each, the one with the greatest number of reviews.

Hint: you'll need to sort the dataframe by App and Reviews, and that will change the order of your df.

codevalidated

Format the `Category` column

Categories are all uppercase and words are separated using underscores. Instead, we want them with capitalized in the first character and the underscores transformed as whitespaces.

Example, the category AUTO_AND_VEHICLES should be transformed to: Auto and vehicles

codevalidated

Clean and convert the `Installs` column to numeric type

Clean and transform Installs as a numeric type. Some values in Installs will have a + modifier. Just remove the string and honor the original number (for example +2,500 or 2,500+ should be transformed to the number 2500).

codevalidated

Clean and convert the `Size` column to numeric (representing bytes)

The Size column is of type object. Some values contain either a M or a k that indicate Kilobytes (1024 bytes) or Megabytes (1024 kb). These values should be transformed to their corresponding value in bytes. For example, 898k will become 919552 (898 * 1024).

Some other values are completely invalid (there's no way to infer the numeric type from them). For these, just replace the value for 0.

Some other rules are related to + modifiers, apply the same rules as the previous task.

codevalidated

Clean and convert the `Price` column to numeric

Values of the Price column are strings representing price with special symbol '$'.

codevalidated

Paid or free?

Now that you have cleaned the Price column, let's create another auxiliary Distribution column.

This column should contain Free/Paid values depending on the app's price.

input

What company has the most reviews?

What company has the greatest number of reviews?

input

Which is the category with the most most uploaded apps?

input

To which category belongs the most expensive app?

input

What's the name of the most expensive game?

Find the most expensive app in the Game category and enter its name:

input

Which is the most popular Finance App?

What app (from the Finance category) has the most installs?

input

What *Teen* Game has the most reviews?

What app from the Game category and catalogued as Teen in Content Rating has the greatest number of reviews?

input

What free game has the most reviews?

What free app (ie. price == 0) from the Game category has the greatest number of reviews?

input

How many TB (terabytes) were transferred (overall) for the most popular Lifestyle app?

This app produced the greatest amount of bytes transfer. Enter your answer in Terabytes as a whole number (rounding down to the nearest integer). Example, if you find the total transfer to be 780.9581 TB, just enter 780.

Capstone Project: Cleaning Google Playstore dataCapstone Project: Cleaning Google Playstore data
Project Created by

Matias Caputti

This project is part of

Data Cleaning with Pandas

Explore other projects