read the titanic dataset and plot a box plot based on the age column of the dataset.
import seaborn from matplotlib import pyplot df = seaborn.load_dataset("titanic") seaborn.boxplot(data=df, x="age") pyplot.savefig("titanic-age-outliers.png") pyplot.close()
The resulting box plot will look like the following:
As we can see the outliers are displayed as dots on the right side of the box.
How to remove outliers using the mean and standard deviation of data?
Before we understand how to perform outlier trimming using the mean and standard deviation of data, we need to understand the 3 sigma rule or empirical rule of normal distribution. As per this rule, if data in a column is normally distributed, then 68.27% of the data falls within one standard deviation from the mean. About 95.45% of data fall within 2 standard deviations from the mean. And about 99.73% of data fall within 3 standard deviations from the mean.
So, if we calculate the mean and standard deviation of data, then we can say that values that are more than (mean + 3 x standard deviation) or less than (mean – 3 x standard deviation) are outliers.
Let’s read the titanic dataset. We can use the following Python code to remove outliers using the mean and standard deviation of the age column…






0 Comments