Data Cleaning Using the NASA Exoplanet Archive Dataset
Data Cleaning Using the NASA Exoplanet Archive Dataset Data Science Project
Data Cleaning with Pandas

Data Cleaning Using the NASA Exoplanet Archive Dataset

Dive into NASA's exoplanet data and uncover alien worlds! Fix star temperatures, fill in missing planet masses, and classify orbits from fast to slow. Use Pandas to wrangle real astronomical data. It includes scorching hot Jupiters and potentially habitable super-Earths. Transform raw numbers into insights about planetary systems across the galaxy. Perfect your data cleaning and feature engineering skills while exploring the cosmos!
Start this project
Data Cleaning Using the NASA Exoplanet Archive DatasetData Cleaning Using the NASA Exoplanet Archive Dataset
Project Created by

Dhrubaraj Roy

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Find 5 nearest star systems using `st_dist`

Find the five star systems closest to Earth. Sort the dataset by distance in ascending order and select the top 5 entries. Show only the planet host name and distance columns for these systems. The pl_hostname column contains the names of the host stars or planets, while st_dist represents the distance to the star system in parsecs. Store the result in the closest_systems variable.

codevalidated

Identify 3 hottest star systems using `st_teff`

Identify the three star systems with the highest temperatures. Sort the dataset by temperature in descending order and choose the top 3 entries. Display only the planet host name and temperature columns for these systems. The pl_hostname column contains the names of the host stars or planets, while st_teff represents the effective temperature of the star in Kelvin. Store the result in the hottest_systems variable.

multiplechoice

What are the shortest and longest orbital periods among all planets in the dataset, as represented by the `pl_orbper` column?

codevalidated

Rank top 5 facilities by planet discoveries

Determine the five facilities that have discovered the most exoplanets. Count the number of planets discovered by each facility and select the top 5 with the highest counts. Display these facilities along with their planet discovery counts. The pl_facility column contains the names of the facilities responsible for discovering each planet. Store the result in the top_facilities variable.

input

Identify Missing Data in `pl_orbeccen`

Enter how many planets are missing orbital eccentricity data. Calculate the total number of missing values in the eccentricity column. The pl_orbeccen column contains the orbital eccentricity values for the planets.

codevalidated

Fill Missing `pl_bmassj` with Median

Handle missing data in the planet mass column. Replace all missing values with the median mass from the dataset. The pl_bmassj column represents the planet's mass in Jupiter masses.

codevalidated

Drop Rows with Missing `st_teff`

Remove all the rows from the dataset that have missing values in the star temperature column. Delete all entries where the effective temperature of the star is not available. The st_teff column contains the effective temperature of the stars in Kelvin.

input

Investigate Missing Data Patterns in `pl_radj` and `pl_bmassj`

Enter the number of missing values present in columns pl_radj and pl_bmassj. For example 123, 321

codevalidated

Handle Missing `pl_orbper` with Forward Fill

Fill in missing values in the pl_orbper column using the forward fill method. The pl_orbper column represents the orbital period of planets.

codevalidated

Convert `pl_orbper` to Integer Days

Create a new column called orbit_days by rounding the values in the pl_orbper column to the nearest whole number and converting them to integers. This new column will represent the orbital period in whole days. The pl_orbper column contains the orbital period of planets.

codevalidated

Create Yes/No Labels for `pl_kepflag`

Create a new column named kepler_detected by mapping the values in the pl_kepflag column. Assign 'Yes' to entries with a value of 1 and 'No' to entries with a value of 0. The pl_kepflag column indicates whether a planet was detected by the Kepler mission.

codevalidated

Extract Year from `rowupdate`

Extract the year from the rowupdate column and store it in a new column called update_year. First, convert the rowupdate column to datetime format, then extract just the year component. The rowupdate column contains the date when the row was last updated. Ensure that the new update_year column has the data type int64. You may need to use .astype('int64') to convert the extracted year to the correct data type.

codevalidated

Simplify `pl_facility` Names

Create a new column called facility_short by extracting the first word from the pl_facility column. Split the text in pl_facility by spaces and take the first element of the resulting list. The pl_facility column contains the names of facilities involved in planetary discoveries.

input

Spot Negative Orbits in `pl_orbper`

Enter the number of negative values in the pl_orbper column.

input

Validate Planet Masses in `pl_bmassj`

Enter the number of negative values in the pl_bmassj column.

codevalidated

Check `pl_discmethod` Categories

Identify planets discovered using unexpected methods. Create a list of valid discovery methods, which includes 'Transit', 'Radial Velocity', 'Imaging', 'Microlensing', and 'Astrometry'. Then filter the dataframe to find rows where the pl_discmethod value is not in this list of valid methods. Store the results in a new dataframe called unexpected_methods. The pl_discmethod column contains the method used to discover each planet.

input

Find Duplicate Planet Entries

Enter the number of duplicates present in the pl_name column.

codevalidated

Validate Star Temperature Range in `st_teff`

Check for star temperatures that fall outside the expected range. Identify stars with temperatures below 2000 or above 40000. Store any stars meeting these criteria in a variable called invalid_temps. The st_teff column represents the effective temperature of each star in Kelvin.

codevalidated

Categorize Planets by Size

Create a function to categorize planet sizes based on their radius. Apply this function to each planet's radius, creating a new column called planet_size. The function should classify planets as 'Small' if their radius is less than 0.5, 'Medium' if less than 1.5 and greater than or equal to 0.5, and 'Large' if greater than or equal to 1.5. If the radius is unknown, label it as 'Unknown'. The pl_radj column represents the planet's radius in Jupiter radii.

codevalidated

Rename Columns for Clarity

Rename specific columns in the dataset to improve clarity and consistency. Change pl_hostname to star_name, pl_discmethod to discovery_method, and pl_orbper to orbital_period_days. These new column names should better reflect the data they contain.

codevalidated

Create a Simplified Dataset

Create a simplified version of the dataset by selecting a subset of columns. Include the columns pl_name, star_name, discovery_method, orbital_period_days, pl_radj, and pl_bmassj. Store this new, streamlined dataset in a variable called simple_df.

codevalidated

Classify Orbital Periods

Create a function to classify orbital periods into categories. Apply this function to the orbital_period_days column, creating a new orbit_type column. The function should return Unknown for missing values, Short for periods less than 10 days, Medium for periods greater than or equal to 10 days and less than 100 days, and Long for periods of 100 days or more. Use this function to categorize each planet's orbit in the dataset.

codevalidated

Create a Binary Flag for Multi-Planet Systems

Count the number of planets for each star in the dataset. Create a new column called multi_planet_system that flags whether a star hosts more than one planet using boolean values (True/False). The star_name column contains the names of the stars. Use this information to identify and mark multi-planet systems in the dataset with True for stars hosting multiple planets and False for those hosting a single planet.

Data Cleaning Using the NASA Exoplanet Archive DatasetData Cleaning Using the NASA Exoplanet Archive Dataset
Project Created by

Dhrubaraj Roy

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

Project Author at DataWars, responsible for leading the development and delivery of innovative machine learning and data science projects.

This project is part of

Data Cleaning with Pandas

Explore other projects