What is the end-of-distribution imputation?
If data in a numerical column are missing randomly, then mean or median imputation is a good technique. But, if data are not missing randomly, then we may want to perform end-of-distribution or end-of-tail imputation.
In the end-of-distribution imputation, a value is chosen from the end of the distribution of the data and the value is used to fill in the missing values of the column.
For example, if data is normally distributed, then a value v is chosen such that:
After that, the computed value v is used to fill in the missing values of the column.
How to perform the end-of-distribution imputation in machine learning?
Let’s read the titanic dataset. Let’s say we want to perform end-of-distribution imputation to fill the missing values of the age column of the dataset. We can use the following Python code for that purpose.
import seaborn df = seaborn.load_dataset("titanic") eod_value = df["age"].mean() + 3*df["age"].std() df.age.fillna(value=eod_value, inplace=True) print(df.isnull().mean()*100)
Here, eod_value is the value that is calculated using the mentioned formula. After that, the calculated value is used to fill …






0 Comments