0
votes

I am using "plt.subplots(2, 2, sharex=True, sharey=True)" to draw a 2*2 subplots. Each subplot has two Y axis and contains normal distribution curve over a histogram. Noting I particularly set "sharex=True, sharey=True" here in order to make all subplots share the same X axis and Y axis.

After running my code, everything is fine except the second, three, and fourth subplots where the normal distribution curve doesn't fit the histogram very well (please see the figure here)

enter image description here

I did googling but failed to get this issue solved. However, if I set "sharex=True, sharey=False" in my code, then the figure looks correct, but all subplots use their own Y axix which isn't what I want. Please see the figure here

enter image description here

Hope this issue can be fixed by experts in StackOverflow. Many thanks in advance!

Below is my code:

import matplotlib.pyplot as plt

from scipy.stats import norm

def align_yaxis(ax1, v1, ax2, v2):
    #adjust ax2 ylimit so that v2 in ax2 is aligned to v1 in ax1
    _, y1 = ax1.transData.transform((0, v1))
    _, y2 = ax2.transData.transform((0, v2))
    inv = ax2.transData.inverted()
    _, dy = inv.transform((0, 0)) - inv.transform((0, y1-y2))
    miny, maxy = ax2.get_ylim()
    ax2.set_ylim(miny+dy, maxy+dy)
    
def drawSingle(myax, mydf , title, offset):
    
    num_bins = 200
    xs = mydf["gap"]
    x = np.linspace(-1,1,1000)
    
    mu =np.mean(x) 
    sigma =np.std(xs)
    n, bins, patche =  myax.hist(xs, num_bins, alpha=0.8, facecolor='blue', density=False) 
    

    myax.set_ylabel('frequency',color="black",fontsize=12, weight = "bold")
    myax.set_xlabel('X', fontsize=12, weight = "bold",horizontalalignment='center')

    ax_twin = myax.twinx()
    y_normcurve = norm.pdf(bins, mu, sigma)
    ax_twin.plot(bins, y_normcurve, 'r--') 

    align_yaxis(myax,0,ax_twin,0)
    peakpoint = norm.pdf(mu,loc=mu,scale=sigma)
    plt.vlines(mu, 0, peakpoint, 'y', '--', label='example')
    
    ax_twin.set_ylabel("probablility dense",color="black",fontsize=12, weight = "bold")
    
         
def drawSubplots(mydf1,mydf2,mydf3,mydf4, pos1,pos2,pos3,pos4, title, filename):
    plt.rcParams['figure.figsize'] = (18,15 )
    
    my_x_ticks = np.arange(-0.8, 0.8,0.1)
   
    rows, cols = 2, 2
    fig, ax = plt.subplots(2, 2, sharex=True, sharey=True)

    drawSingle(ax[0][0], mydf1, "Subplot1", pos1)
    drawSingle(ax[0][1], mydf2, "Subplot2", pos2)
    drawSingle(ax[1][0], mydf3, "Subplot3", pos3)
    drawSingle(ax[1][1], mydf4, "Subplot4", pos4)
    
    plt.text(-1, -1, title, horizontalalignment='center', fontsize=18)
    
    plt.show()

    
drawSubplots(df1, df2,df3,df4,3.2,3.1,2.7,2.85,"test9", "test9")
2
Note that with sharey=True, the histograms of dataframes with less rows will be smaller. If you want those to have a similar height, you need to normalize the heights with hist(..., density=True) (this scales their area to 1).JohanC
You need density=True to properly fit the normal curve. Or, otherwise, multiply y_normcurve with len(xs) and by the binwidth (y_normcurve*len(xs)*(bins[1]-bins[0)) to have their area equal. Better leave out the twinx() and plot everything on the same ax. Also, mu=np.mean(xs) is more appropriate than mu=np.mean(x). Finally, the curve would look better with y_normcurve = norm.pdf(x, mu, sigma) and also draw it as myax.plot(x, ....)JohanC
Hi JohanC, thanks a lot for your help!. I followed your advice and revised the code in the function of drawSingle. (set density=True) Now, the normal distribution curve looks fitting better over the histogram, however the left Y axis in each plot is not frequency. Can I still have two Y axis (the left one for frequency, the right one for probablity dense)? Thanks again!Tony Zeng
also, I am not quite understanding what you mean (y_normcurve*len(xs)*(bins[1]-bins[0)) to have their area equal). Could you kindly please advise me how to do it in the code? Thanks again!Tony Zeng
With density=True, the y-axis will be the height of the "probabiliy distribution function". Note that "frequency" is only a useful measure if you have a well-defined bin width. If you use bins=200, the bin width will be (xs.max() - xs.min()) / 200 which is different in the 4 plots. Multiplying (y_normcurve*len(xs)*(bins[1]-bins[0)) is only needed if you'd use density=False.JohanC

2 Answers

0
votes

Here is an attempt to:

  • have the left y-axes being "frequency" (which is very uninformative in the case of the current bin widths) and shared among the 4 subplots
  • have the right y-axes be a "probability density"; note how the top of all gaussians is around y=0.02 (the twin axes can only be set at the end because the shared y axes can be updated via later subplots)
  • have the histogram and the normal curve aligned
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import norm

def drawSingle(myax, mydf, title):
    num_bins = 200
    xs = mydf["gap"]
    x = np.linspace(-1, 1, 1000)

    mu = np.mean(x)
    sigma = np.std(xs)
    n, bins, patches = myax.hist(xs, num_bins, alpha=0.8, facecolor='blue', density=False)

    myax.set_ylabel('frequency', color="black", fontsize=12, weight="bold")
    myax.set_xlabel('X', fontsize=12, weight="bold", horizontalalignment='center')

    normalization_factor = len(xs) * (bins[1] - bins[0])
    y_normcurve = norm.pdf(x, mu, sigma) * normalization_factor
    myax.plot(x, y_normcurve, 'r--')
    myax.vlines(mu, 0, y_normcurve.max(), 'y', '--', color='lime', label='example')
    return normalization_factor

def drawSubplots(mydf1, mydf2, mydf3, mydf4, title):
    plt.rcParams['figure.figsize'] = (18, 15)

    fig, ax = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True)

    dfs = [mydf1, mydf2, mydf3, mydf4]
    norm_factors = [drawSingle(ax_i, df, title)
                    for ax_i, df, title in zip(ax.ravel(), dfs, ["Subplot1", "Subplot2", "Subplot3", "Subplot4"])]
    for ax_i, norm_factor in zip(ax.ravel(), norm_factors):
        ax_twin = ax_i.twinx()
        ymax = ax_i.get_ylim()[1]
        ax_twin.set_ylim(0, ymax / norm_factor)
    plt.suptitle(title, fontsize=18)
    plt.tight_layout()
    plt.show()

df1, df2, df3, df4 = [pd.DataFrame({"gap": np.random.normal(0, 0.2, n)}) for n in [6000, 4000, 1800, 1200]]
drawSubplots(df1, df2, df3, df4, "Title")

subplots with aligned histograms

0
votes

Many thanks JohanC, you are amazing.

Based on your code, I just added a few lines of code within drawSubplots function in order to make 95% of the Gaussian curve area shaded between the lower bound and upper bound for each subplot. The following is my try. It seems that ax_twin.fill_between doesn't work normally here. As you could see from the figure that the shaded area is out of the Gaussian curve enter image description here. What I want is only to shade the area under the Gaussian curve between the lower bound and upper bound. If you don't mind, would you please check it out my mistake? Thank you very much!

import matplotlib.pyplot as plt
import math

from scipy.stats import norm

def align_yaxis(ax1, v1, ax2, v2):
    #adjust ax2 ylimit so that v2 in ax2 is aligned to v1 in ax1
    _, y1 = ax1.transData.transform((0, v1))
    _, y2 = ax2.transData.transform((0, v2))
    inv = ax2.transData.inverted()
    _, dy = inv.transform((0, 0)) - inv.transform((0, y1-y2))
    miny, maxy = ax2.get_ylim()
    ax2.set_ylim(miny+dy, maxy+dy)
    
    

def drawSingle(myax, mydf , title):
    
    num_bins = 200
    xs = mydf["gap"]
    x = np.linspace(-1,1,1000)
 
    mu =np.mean(xs) 
    sigma =np.std(xs)
    
    n, bins, patches = myax.hist(xs, num_bins, alpha=0.8, facecolor='blue', density=False)
    

    myax.set_ylabel('Frequency', color="black", fontsize=12, weight="bold")
    myax.set_xlabel(title, fontsize=12, weight="bold", horizontalalignment='center')

    normalization_factor = len(xs) * (bins[1] - bins[0])
    y_normcurve = norm.pdf(x, mu, sigma) * normalization_factor
    myax.plot(x, y_normcurve, 'r--')

    myax.vlines(mu, 0, y_normcurve.max(), 'y', '--', color='lime', label='example')
    
    plt.xlim(-0.8,0.8) 
    my_x_ticks = np.arange(-0.8, 0.8,0.1)
    plt.xticks(my_x_ticks)

    return normalization_factor, mu, sigma
    
    
def drawSubplots(mydf1,mydf2,mydf3,mydf4, title):
    plt.rcParams['figure.figsize'] = (18,15 )
    
    norm_factors = []
    mus = []
    sigmas = []
    
    my_x_ticks = np.arange(-0.8, 0.8,0.1)
    
    
    rows, cols = 2, 2
    fig, ax = plt.subplots(nrows=rows, ncols=cols, sharex=True, sharey=True)
    

    dfs = [mydf1, mydf2, mydf3, mydf4]
    #norm_factors = [drawSingle(ax_i, df, title)
                    #for ax_i, df, title in zip(ax.ravel(), dfs, ["Subplot1", "Subplot2", "Subplot3", "Subplot4"])]
    
    
    for ax_i, df, title in zip(ax.ravel(), dfs, ["Subplot1", "Subplot2", "Subplot3", "Subplot4"]):
        norm_factor, mu, sigma = drawSingle(ax_i, df, title)
        norm_factors.append(norm_factor)
        mus.append(mu)
        sigmas.append(sigma)
    
    
    for ax_i, norm_factor, mu, sigma in zip(ax.ravel(), norm_factors, mus, sigmas ):
        ax_twin = ax_i.twinx()
        
        xmax = ax_i.get_xlim()[1]
        ax_twin.set_ylim(0, xmax / norm_factor)
        ax_twin.set_ylabel("probablility dense",color="black",fontsize=12, weight = "bold")
        
        
        CI_95_lower = mu - (1.96*sigma)
        CI_95_upper = mu + (1.96*sigma)

        px_shaded = np.arange(CI_95_lower,CI_95_upper,0.1)
        ax_twin.fill_between(px_shaded,norm.pdf(px_shaded,loc=mu,scale=sigma) * norm_factor,alpha=0.75, color='pink')
        area_shaded_95_CI = norm.cdf(x=CI_95_upper, loc=mu, scale=sigma)-norm.cdf(x=CI_95_lower, loc=mu, scale=sigma)
        ax_twin.text(-0.06,0.01,str(round(area_shaded_95_CI*100,1))+"%", fontsize=20)
       
        ax_twin.annotate(s=f'lower bound= {CI_95_lower:.3f}',xy=(CI_95_lower,norm.pdf(CI_95_lower,loc=mu,scale=sigma)),xytext=(-0.75,0.01),weight='bold',color='blue',\
                 arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='green'),\
                 fontsize=12
                )

        ax_twin.annotate(s=f'upper bound= {CI_95_upper:.3f}',xy=(CI_95_upper,norm.pdf(CI_95_upper,loc=mu,scale=sigma)),xytext=(0.28,0.01),weight='bold',color='blue',\
                 arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='green'),\
                  fontsize=12
                )

        ax_twin.text(0.05, 0.03, r"$\mu=" + f'{mu:.6f}' + ", \sigma=" + f'{sigma:.6f}' + "$" + ", confidence interval=95%" ,
            horizontalalignment='center', fontsize=15)
        
        
    
    plt.suptitle(title, fontsize=18)
    plt.tight_layout()
    plt.show()


df1, df2, df3, df4 = [pd.DataFrame({"gap": np.random.normal(0, 0.2, n)}) for n in [6000, 4000, 1800, 1200]]

drawSubplots(df1, df2, df3, df4, "Title")