The curious case of London's Airbnbs
The curious case of London's Airbnbs Data Science Project
Data Cleaning with Pandas

The curious case of London's Airbnbs

Welcome to London, a bustling city filled with history, culture, and vibrant neighborhoods. In the modern age, Airbnb has become a popular way to explore the city, offering a glimpse into local life and hidden gems. In this project, we will analyze a dataset of London Airbnbs to identify hidden gems and interesting trends. But there's a catch - the data is a mess! Missing values, inconsistencies, and strange characters are lurking everywhere. This is where you, our data whiz, come in!
Start this project
The curious case of London's AirbnbsThe curious case of London's Airbnbs
Project Created by

Adeyinka Odiaka

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

input

How many rows are in the dataset?

From the df.info summary, what are the number of rows in the dataset?

codevalidated

Find the missing entries in the dataset

Get the sum of all missing entries in each column and store it in a variable named null_counts.

codevalidated

Drop the `Unnamed ` and `Unnamed: 14` columns

One of these columns primarily serves as an index and the other is full of null values. They do not contribute meaningfully to our analysis. Therefore, it can be safely removed.

codevalidated

Eliminate rows with missing values in the DataFrame

Here, your objective is to remove columns that contain only null values and then eliminate rows with any missing values, resetting the index afterward.

input

What is the average number of reviews per month?

Round up your result to one decimal place.

codevalidated

Capitalize all the column headers in the DataFrame

In this activity, you will have to ensure that each header in the DataFrame is correctly capitalized. Capitalizing column headers can enhance the readability and comprehensibility of the data. The Pandas library in Python offers numerous methods to modify the column headers, such as capitalizing the first letter of each word. This step forms a part of the Data Cleaning process.

codevalidated

Identify the London neighbourhoods in the dataset.

What are the unique London neighbourhoods in the dataframe? Store the unique values in the variable neighborhood.

input

Identify the most patronised `Room_type`

Quick break:

We've been cleaning so far; let's satisfy a curiosity (I know I'm dying to know).

codevalidated

Clean the `Price` column.

This ensures data consistency and facilitates numerical analysis without the influence of formatting characters. Separating the currency symbol allows for easier currency conversion and comparison across different datasets. Also, change the data type from an object to numeric.

codevalidated

Convert the `Last_reviews` Column to Correct Data Type

The Last_reviews column is being read as an object, convert it to the right datatype. Also, rename it from Last_reviews to Last_review.

codevalidated

Generate a `Review_year` Column in Your Dataframe

To facilitate future data manipulation, it would be beneficial to have a year column in your dataset. Add a Review_year column to your dataframe, ensuring to pull the relevant year data from the appropriate source in your dataset.

codevalidated

Drop the outliers in the dataset.

The columns Price, Minimum_nights, Number_of_reviews,Reviews_per_month, and Calculated_host_listings_count contain outliers that can skew our analysis. To improve accuracy, first convert the columns to numeric values, then calculate the first quartile (values below 25% of the data), third quartile (values below 75% of the data), and the interquartile range (IQR). Use these to determine the lower and upper bounds; then create a mask to identify rows with values outside these bounds and filter the DataFrame to exclude these outliers.

codevalidated

Filter the DataFrame to exclude listings less than 1 in the `Availability_365` column

Remove any listings that are not available for at least one day in the year from the dataset.

input

Does the longitude and latitude entries fall within the real London coordinates?

You have a dataset that includes columns for longitude and latitude. To determine how many rows contain coordinates that fall within the actual geographical boundaries of London, use the following rough coordinates: Latitude: 51.28°N to 51.70°N Longitude: -0.51°W to 0.33°E Store your output in the variable valid_coordinates and compare the number of rows to the original dataset df. If the results are identical, the answer is "Yes"; otherwise, the answer is "No".

input

How many rows and columns are left in the dataset after cleaning it?

Write your answer in the input box in this format: number of rows, number of columns. For example 22345, 34.

The curious case of London's AirbnbsThe curious case of London's Airbnbs
Project Created by

Adeyinka Odiaka

This project is part of

Data Cleaning with Pandas

Explore other projects