0
votes

I have a dataframe where I have created a new column which sums the first three columns (dates) with values. Then I have created a rate for each row based on population column.

I would like to create lower and upper 95% confidence levels for the "sum_of_days_rate" for each row in this dataset.

I can create a mean of the first three columns but not sure how to create lower and upper values for the sum of these three columns rate.

Sample of the dataset below:

data= {'09/01/2021': [74,84,38],
      '10/11/2021': [43,35,35],
      "12/01/2021": [35,37,16],
      "population": [23000,69000,48000]}

df = pd.DataFrame (data, columns = ['09/01/2021','10/11/2021',  "12/01/2021", "population"])
df['sum_of_days'] = df.loc[:, df.columns[0:3]].sum(1)
df['sum_of_days_rate'] = df['sum_of_days']/df['population'] * 100000
1

1 Answers

0
votes

To estimate the confidence interval you need to make certain assumptions about the data, how it is distributed or what would be the associated error. I am not sure what your data points mean, why you are summing them up etc.

A commonly used distribution for rates would a poisson distribution and you can construct the confidence interval, given a mean:

lb, ub = scipy.stats.poisson.interval(0.95,df.sum_of_days_rate)
df['lb'] = lb
df['ub'] = ub

The arrays ub and lb are the upper and lower bound of the 95% confidence interval. Final data frame looks like this:

   09/01/2021  10/11/2021  12/01/2021  population  sum_of_days  sum_of_days_rate     lb     ub
0          74          43          35       23000          152        660.869565  611.0  712.0
1          84          35          37       69000          156        226.086957  197.0  256.0
2          38          35          16       48000           89        185.416667  159.0  213.0