All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Find the five star systems closest to Earth. Sort the dataset by distance in ascending order and select the top 5 entries. Show only the planet host name and distance columns for these systems. The pl_hostname
column contains the names of the host stars or planets, while st_dist
represents the distance to the star system in parsecs. Store the result in the closest_systems
variable.
Identify the three star systems with the highest temperatures. Sort the dataset by temperature in descending order and choose the top 3 entries. Display only the planet host name and temperature columns for these systems. The pl_hostname
column contains the names of the host stars or planets, while st_teff
represents the effective temperature of the star in Kelvin. Store the result in the hottest_systems
variable.
Determine the five facilities that have discovered the most exoplanets. Count the number of planets discovered by each facility and select the top 5 with the highest counts. Display these facilities along with their planet discovery counts. The pl_facility
column contains the names of the facilities responsible for discovering each planet. Store the result in the top_facilities
variable.
Enter how many planets are missing orbital eccentricity data. Calculate the total number of missing values in the eccentricity column. The pl_orbeccen
column contains the orbital eccentricity values for the planets.
Handle missing data in the planet mass column. Replace all missing values with the median mass from the dataset. The pl_bmassj
column represents the planet's mass in Jupiter masses.
Remove all the rows from the dataset that have missing values in the star temperature column. Delete all entries where the effective temperature of the star is not available. The st_teff
column contains the effective temperature of the stars in Kelvin.
Enter the number of missing values present in columns pl_radj
and pl_bmassj
. For example 123, 321
Fill in missing values in the pl_orbper
column using the forward fill method. The pl_orbper
column represents the orbital period of planets.
Create a new column called orbit_days
by rounding the values in the pl_orbper
column to the nearest whole number and converting them to integers. This new column will represent the orbital period in whole days. The pl_orbper
column contains the orbital period of planets.
Create a new column named kepler_detected
by mapping the values in the pl_kepflag
column. Assign 'Yes' to entries with a value of 1 and 'No' to entries with a value of 0. The pl_kepflag
column indicates whether a planet was detected by the Kepler mission.
Extract the year from the rowupdate
column and store it in a new column called update_year
. First, convert the rowupdate
column to datetime format, then extract just the year component. The rowupdate
column contains the date when the row was last updated. Ensure that the new update_year
column has the data type int64
. You may need to use .astype('int64')
to convert the extracted year to the correct data type.
Create a new column called facility_short
by extracting the first word from the pl_facility
column. Split the text in pl_facility
by spaces and take the first element of the resulting list. The pl_facility
column contains the names of facilities involved in planetary discoveries.
Enter the number of negative values in the pl_orbper
column.
Enter the number of negative values in the pl_bmassj
column.
Identify planets discovered using unexpected methods. Create a list of valid discovery methods, which includes 'Transit'
, 'Radial Velocity'
, 'Imaging'
, 'Microlensing'
, and 'Astrometry'
. Then filter the dataframe to find rows where the pl_discmethod
value is not in this list of valid methods. Store the results in a new dataframe called unexpected_methods
. The pl_discmethod
column contains the method used to discover each planet.
Enter the number of duplicates present in the pl_name
column.
Check for star temperatures that fall outside the expected range. Identify stars with temperatures below 2000 or above 40000. Store any stars meeting these criteria in a variable called invalid_temps
. The st_teff
column represents the effective temperature of each star in Kelvin.
Create a function to categorize planet sizes based on their radius. Apply this function to each planet's radius, creating a new column called planet_size
. The function should classify planets as 'Small'
if their radius is less than 0.5, 'Medium'
if less than 1.5 and greater than or equal to 0.5, and 'Large'
if greater than or equal to 1.5. If the radius is unknown, label it as 'Unknown'
. The pl_radj
column represents the planet's radius in Jupiter radii.
Rename specific columns in the dataset to improve clarity and consistency. Change pl_hostname
to star_name
, pl_discmethod
to discovery_method
, and pl_orbper
to orbital_period_days
. These new column names should better reflect the data they contain.
Create a simplified version of the dataset by selecting a subset of columns. Include the columns pl_name
, star_name
, discovery_method
, orbital_period_days
, pl_radj
, and pl_bmassj
. Store this new, streamlined dataset in a variable called simple_df
.
Create a function to classify orbital periods into categories. Apply this function to the orbital_period_days
column, creating a new orbit_type
column. The function should return Unknown
for missing values, Short
for periods less than 10 days, Medium
for periods greater than or equal to 10 days and less than 100 days, and Long
for periods of 100 days or more. Use this function to categorize each planet's orbit in the dataset.
Count the number of planets for each star in the dataset. Create a new column called multi_planet_system
that flags whether a star hosts more than one planet using boolean values (True/False). The star_name
column contains the names of the stars. Use this information to identify and mark multi-planet systems in the dataset with True for stars hosting multiple planets and False for those hosting a single planet.