species island bill_length_mm ... flipper_length_mm body_mass_g sex 0 Adelie Torgersen 39.1 ... 181.0 3750.0 Male 1 Adelie Torgersen 39.5 ... 186.0 3800.0 Female 2 Adelie Torgersen 40.3 ... 195.0 3250.0 Female 3 Adelie Torgersen NaN ... NaN NaN NaN 4 Adelie Torgersen 36.7 ... 193.0 3450.0 Female [5 rows x 7 columns] species 0 island 0 bill_length_mm 2 bill_depth_mm 2 flipper_length_mm 2 body_mass_g 2 sex 11 dtype: int64
So, the dataset has 7 columns and some of the columns have missing values. Here, we are interested in the numerical columns. So, let’s drop the categorical columns except for the species column. And then, fill all the missing values with the median value of the column.
import seaborn from sklearn.linear_model import LinearRegression from sklearn.feature_selection import RFE from sklearn.preprocessing import LabelEncoder df = seaborn.load_dataset("penguins") print(df.head()) print(df.isnull().sum()) df.drop(labels=["island", "sex"], axis=1, inplace=True) df = df[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "species"]] df.bill_length_mm.fillna(value=df["bill_length_mm"].median(), inplace=True) df.bill_depth_mm.fillna(value=df["bill_depth_mm"].median(), inplace=True) df.flipper_length_mm.fillna(value=df["flipper_length_mm"].median(), inplace=True) df.body_mass_g.fillna(value=df["body_mass_g"].median(), inplace=True) print(df.isnull().sum())
The output shows the following: …






0 Comments