Let’s say we want to know about the distribution of data in one column of the dataset. We may be interested in knowing the same before filling in missing values in a column. For example, if the distribution of data is normal, then we can fill in the missing values with the mean value of the column. And if the distribution of data is skewed, we need to use the median value to fill in the missing values of the column. And we can use a histogram to see the distribution of data in a column in a dataset.
Let’s look at an example. Let’s read the titanic dataset. The dataset contains various information, such as age of passengers, embark town of each passenger, whether the passenger survived, etc. Let’s say we want to know the distribution of data in the age column. We can use the following Python code for that purpose.
import pandas from matplotlib import pyplot df = pandas.read_csv("titanic.csv") pyplot.hist(df["age"], bins=20) pyplot.savefig("matplotlib-histogram.png") pyplot.close()
Here, we are using the pyplot.hist() function to plot the histogram. The bins parameter indicates into how many bins the data should be divided. The resulting histogram will look like the following:






0 Comments