lets assume that I get the following pandas dataframe for my regression analysis.
import pandas
import math
import numpy
df = pandas.DataFrame(numpy.random.randint(0,100,size=(100, 2)), columns=['labels','predictions'])
I would like now to calculate the RMSE as
math.sqrt(numpy.mean((df["predictions"] - df["lables"]) ** 2))
for values of the labels in interval of 7
Hereby a very ugly code that does the job...it would be nice if you help me to pythonize it...
# define step
step = 7
# initialize counter
idx = 0
# initialize empty dataframe
rmse = pandas.DataFrame(columns=['bout' , 'rmse'],index=range(0,len(range(int(df['labels'].min())+step,int(df['labels'].max()),step))))
# start loop to calculate rmse every 7 units
for i in range(int(df['labels'].min())+step,int(df['labels'].max()),step):
# select values in interval
df_bout = df[(df['labels']>=i-step) & (df['labels']<i)]
# calculate rmse in interval
rmse.loc[idx] = [str(i-step)+'-'+str(i),math.sqrt(numpy.mean((df_bout.predictions - df_bout.labels) ** 2))]
# increment counter
idx = idx + 1