All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
As ratings in Google Play Store fall in the range of 0-5; However by observing the histogram, you will find invalid values that lie oustide this range.
Perform the selection of invalid values and store the results in the variable df_invalid_ratings
.
As it is not reasonable for an app to have a rating greater than 0 without being installed, invalid values are defined as any app with maximum installs of 0 and has a rating above 0.
Perform the selection of invalid values and store the results in the variable df_invalid_install_ratings
.
As the whole population in the world now is around 9 billion people, invalid values are defined as any value above or equal to 9 billion.
Perform the selection of invalid values and store the results in the variable df_invalid_installs
.
Take a look at the histogram that is in the Notebook. By analyzing it, outliers are defined as any values 3 or more std to the left or right of the mean.
Perform the outlier identification and store the results in a new column df_rating['Rating_cleaned']
.
Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.5 IQR to the left or right.
Perform the outlier identification and store the results in a new column df_Price['Price_cleaned']
.
Take a look at the size counts that is in the Notebook. By analyzing it, outliers are defined as any value in GigaByte (G).
Perform the outlier identification and store the results in a new column df['Size_cleaned']
.
Invalid values are defined as any value that contains a date in the future (later than now).
Perform the selection of invalid values and store the results in the variable df_invalid_release_date
.
Invalid values are defined as any value that does not contain @
in the email.
Perform the selection of invalid values and store the results in the variable invalid_emails
.
Take a look at the histogram that is in the Notebook. By analyzing it, you will find different size units. As the mobile phones nowadays have a maximum storage of 1TB, let's define invalid values as any value above or equal 1TB.
Perform the selection of invalid values and store the results in the variable df_invalid_size
.
Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are to the right of the 95% percentile (>= 95% percentile).
Perform the outlier identification and store the results in a new column df_installs['Installs_cleaned']
.
Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values 2.5 or more std to the left or right of the mean.
Perform the outlier identification and store the results in a new series Category_outliers
.
Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.8 IQR to the left or right.
Perform the outlier identification and store the results in a new column df_release_year['Release_Year_cleaned']
.