normal distribution curve doesn't fit well over histogram in subplots using matplotlib

Question

I am using "plt.subplots(2, 2, sharex=True, sharey=True)" to draw a 2*2 subplots. Each subplot has two Y axis and contains normal distribution curve over a histogram. Noting I particularly set "sharex=True, sharey=True" here in order to make all subplots share the same X axis and Y axis.

After running my code, everything is fine except the second, three, and fourth subplots where the normal distribution curve doesn't fit the histogram very well (please see the figure here)

I did googling but failed to get this issue solved. However, if I set "sharex=True, sharey=False" in my code, then the figure looks correct, but all subplots use their own Y axix which isn't what I want. Please see the figure here

Hope this issue can be fixed by experts in StackOverflow. Many thanks in advance!

Below is my code:

import matplotlib.pyplot as plt

from scipy.stats import norm

def align_yaxis(ax1, v1, ax2, v2):
    #adjust ax2 ylimit so that v2 in ax2 is aligned to v1 in ax1
    _, y1 = ax1.transData.transform((0, v1))
    _, y2 = ax2.transData.transform((0, v2))
    inv = ax2.transData.inverted()
    _, dy = inv.transform((0, 0)) - inv.transform((0, y1-y2))
    miny, maxy = ax2.get_ylim()
    ax2.set_ylim(miny+dy, maxy+dy)
    
def drawSingle(myax, mydf , title, offset):
    
    num_bins = 200
    xs = mydf["gap"]
    x = np.linspace(-1,1,1000)
    
    mu =np.mean(x) 
    sigma =np.std(xs)
    n, bins, patche =  myax.hist(xs, num_bins, alpha=0.8, facecolor='blue', density=False) 
    

    myax.set_ylabel('frequency',color="black",fontsize=12, weight = "bold")
    myax.set_xlabel('X', fontsize=12, weight = "bold",horizontalalignment='center')

    ax_twin = myax.twinx()
    y_normcurve = norm.pdf(bins, mu, sigma)
    ax_twin.plot(bins, y_normcurve, 'r--') 

    align_yaxis(myax,0,ax_twin,0)
    peakpoint = norm.pdf(mu,loc=mu,scale=sigma)
    plt.vlines(mu, 0, peakpoint, 'y', '--', label='example')
    
    ax_twin.set_ylabel("probablility dense",color="black",fontsize=12, weight = "bold")
    
         
def drawSubplots(mydf1,mydf2,mydf3,mydf4, pos1,pos2,pos3,pos4, title, filename):
    plt.rcParams['figure.figsize'] = (18,15 )
    
    my_x_ticks = np.arange(-0.8, 0.8,0.1)
   
    rows, cols = 2, 2
    fig, ax = plt.subplots(2, 2, sharex=True, sharey=True)

    drawSingle(ax[0][0], mydf1, "Subplot1", pos1)
    drawSingle(ax[0][1], mydf2, "Subplot2", pos2)
    drawSingle(ax[1][0], mydf3, "Subplot3", pos3)
    drawSingle(ax[1][1], mydf4, "Subplot4", pos4)
    
    plt.text(-1, -1, title, horizontalalignment='center', fontsize=18)
    
    plt.show()

    
drawSubplots(df1, df2,df3,df4,3.2,3.1,2.7,2.85,"test9", "test9")

Note that with sharey=True, the histograms of dataframes with less rows will be smaller. If you want those to have a similar height, you need to normalize the heights with hist(..., density=True) (this scales their area to 1). — JohanC
You need density=True to properly fit the normal curve. Or, otherwise, multiply y_normcurve with len(xs) and by the binwidth (y_normcurve*len(xs)*(bins[1]-bins[0)) to have their area equal. Better leave out the twinx() and plot everything on the same ax. Also, mu=np.mean(xs) is more appropriate than mu=np.mean(x). Finally, the curve would look better with y_normcurve = norm.pdf(x, mu, sigma) and also draw it as myax.plot(x, ....) — JohanC
Hi JohanC, thanks a lot for your help!. I followed your advice and revised the code in the function of drawSingle. (set density=True) Now, the normal distribution curve looks fitting better over the histogram, however the left Y axis in each plot is not frequency. Can I still have two Y axis (the left one for frequency, the right one for probablity dense)? Thanks again! — Tony Zeng
also, I am not quite understanding what you mean (y_normcurve*len(xs)*(bins[1]-bins[0)) to have their area equal). Could you kindly please advise me how to do it in the code? Thanks again! — Tony Zeng
With density=True, the y-axis will be the height of the "probabiliy distribution function". Note that "frequency" is only a useful measure if you have a well-defined bin width. If you use bins=200, the bin width will be (xs.max() - xs.min()) / 200 which is different in the 4 plots. Multiplying (y_normcurve*len(xs)*(bins[1]-bins[0)) is only needed if you'd use density=False. — JohanC

JohanC JohanC · Accepted Answer · 2021-04-06T15:16:40

Here is an attempt to:

have the left y-axes being "frequency" (which is very uninformative in the case of the current bin widths) and shared among the 4 subplots
have the right y-axes be a "probability density"; note how the top of all gaussians is around y=0.02 (the twin axes can only be set at the end because the shared y axes can be updated via later subplots)
have the histogram and the normal curve aligned

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import norm

def drawSingle(myax, mydf, title):
    num_bins = 200
    xs = mydf["gap"]
    x = np.linspace(-1, 1, 1000)

    mu = np.mean(x)
    sigma = np.std(xs)
    n, bins, patches = myax.hist(xs, num_bins, alpha=0.8, facecolor='blue', density=False)

    myax.set_ylabel('frequency', color="black", fontsize=12, weight="bold")
    myax.set_xlabel('X', fontsize=12, weight="bold", horizontalalignment='center')

    normalization_factor = len(xs) * (bins[1] - bins[0])
    y_normcurve = norm.pdf(x, mu, sigma) * normalization_factor
    myax.plot(x, y_normcurve, 'r--')
    myax.vlines(mu, 0, y_normcurve.max(), 'y', '--', color='lime', label='example')
    return normalization_factor

def drawSubplots(mydf1, mydf2, mydf3, mydf4, title):
    plt.rcParams['figure.figsize'] = (18, 15)

    fig, ax = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True)

    dfs = [mydf1, mydf2, mydf3, mydf4]
    norm_factors = [drawSingle(ax_i, df, title)
                    for ax_i, df, title in zip(ax.ravel(), dfs, ["Subplot1", "Subplot2", "Subplot3", "Subplot4"])]
    for ax_i, norm_factor in zip(ax.ravel(), norm_factors):
        ax_twin = ax_i.twinx()
        ymax = ax_i.get_ylim()[1]
        ax_twin.set_ylim(0, ymax / norm_factor)
    plt.suptitle(title, fontsize=18)
    plt.tight_layout()
    plt.show()

df1, df2, df3, df4 = [pd.DataFrame({"gap": np.random.normal(0, 0.2, n)}) for n in [6000, 4000, 1800, 1200]]
drawSubplots(df1, df2, df3, df4, "Title")

normal distribution curve doesn't fit well over histogram in subplots using matplotlib

2 Answers