All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
From the df.info
summary, what are the number of rows in the dataset?
Get the sum of all missing entries in each column and store it in a variable named null_counts
.
One of these columns primarily serves as an index and the other is full of null values. They do not contribute meaningfully to our analysis. Therefore, it can be safely removed.
Here, your objective is to remove columns that contain only null values and then eliminate rows with any missing values, resetting the index afterward.
Round up your result to one decimal place.
In this activity, you will have to ensure that each header in the DataFrame is correctly capitalized. Capitalizing column headers can enhance the readability and comprehensibility of the data. The Pandas library in Python offers numerous methods to modify the column headers, such as capitalizing the first letter of each word. This step forms a part of the Data Cleaning process.
What are the unique London neighbourhoods in the dataframe? Store the unique values in the variable neighborhood
.
Quick break:
We've been cleaning so far; let's satisfy a curiosity (I know I'm dying to know).
This ensures data consistency and facilitates numerical analysis without the influence of formatting characters. Separating the currency symbol allows for easier currency conversion and comparison across different datasets. Also, change the data type from an object to numeric.
The Last_reviews
column is being read as an object, convert it to the right datatype. Also, rename it from Last_reviews
to Last_review
.
To facilitate future data manipulation, it would be beneficial to have a year
column in your dataset. Add a Review_year
column to your dataframe, ensuring to pull the relevant year data from the appropriate source in your dataset.
The columns Price
, Minimum_nights
, Number_of_reviews
,Reviews_per_month
, and Calculated_host_listings_count
contain outliers that can skew our analysis. To improve accuracy, first convert the columns to numeric values, then calculate the first quartile (values below 25% of the data), third quartile (values below 75% of the data), and the interquartile range (IQR). Use these to determine the lower and upper bounds; then create a mask to identify rows with values outside these bounds and filter the DataFrame to exclude these outliers.
Remove any listings that are not available for at least one day in the year from the dataset.
You have a dataset that includes columns for longitude and latitude. To determine how many rows contain coordinates that fall within the actual geographical boundaries of London, use the following rough coordinates:
Latitude: 51.28°N to 51.70°N
Longitude: -0.51°W to 0.33°E
Store your output in the variable valid_coordinates
and compare the number of rows to the original dataset df
. If the results are identical, the answer is "Yes"; otherwise, the answer is "No".
Write your answer in the input box in this format: number of rows, number of columns
. For example 22345, 34
.