A histogram visualizes the distribution of values in a dataset.
To plot data with Histogram, we use:
plt.hist(data)
bins
sets the number of points in our histogram.
By default, matplotlib creates a histogram with 10 bins of equal size spanning from the smallest sample to the largest sample in our data.
To change the number of bins, we use the keyword bin
plt.hist(data, bins=nbins)
For example, we can divide the histogram into 20 bins to see more details.
plt.hist(data, bins=nbins)
range
sets the minimum and maximum datapoints that we will include in our histogram.
plt.hist(data, range=(xmin, xmax))
For example
plt.hist(df.height, range=(50, 180))
Normalization reduces the height of each bar by a constant factor so that the sum of the areas of each bar adds to one.
It makes two histograms comparable even if the sample sizes are different.
We can normalize histogram with keyword density=True
. Each bar will represent a proportion of the entire dataset.
plt.hist(df.male_weight, density=True)
plt.hist(df.female_weight, density=True)
When having multiple histograms, it can be difficult to read histograms on top of each other.
To solve this problem, we can use:
alpha
(between 0 and 1) to set the transparency of the histogramhisttype='step'
to draw just the outline of a histogramplt.hist(budget1, bins=20, alpha=0.4)
plt.hist(budget2, bins=20, alpha=0.4)
To classify a dataset, we can count the number of distinct peaks present in the graph