import seaborn from matplotlib import pyplot df = seaborn.load_dataset("titanic") seaborn.boxplot(data=df, x="age") pyplot.savefig("titanic-age-outliers.png") pyplot.close() mean_age = df["age"].mean() std_age = df["age"].std() lower_cutoff = mean_age - 3 * std_age upper_cutoff = mean_age + 3 * std_age print("Lower cutoff of age: ", lower_cutoff) print("Upper cutoff of age: ", upper_cutoff) print(df[(df["age"] > upper_cutoff) | (df["age"] < lower_cutoff)]) df = df[(df["age"] >= lower_cutoff) & (df["age"] <= upper_cutoff)] print(df.head()) seaborn.boxplot(data=df, x="age") pyplot.savefig("titanic-age-without-outliers-2.png") pyplot.close()
Here, we are first calculating the mean and standard deviation of the data in the age column. After that, we are using the previosuly mentioned formula to calculate the lower_cutoff and upper_cutoff of age.
We are then printing the outliers using the following Python statement.
print(df[(df["age"] > upper_cutoff) | (df["age"] < lower_cutoff)])
After that, we are removing the outliers from the dataset using the following Python statement. This statement selects only those rows from the dataset for which the age in the age column is more than the lower cutoff or less than the upper cutoff.
df = df[(df["age"] >= lower_cutoff) & (df["age"] <= upper_cutoff)]
The output of the above program will be: …






0 Comments