4
votes

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?

This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.

library(corrgram)
data(baseball)

#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]

#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1

#Model the player be league A in function of the Hits,Runs,Errors and Salary  
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
  sp=sample(seq(1:length(PA)),30,replace=FALSE)
  fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])    
  prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
  fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
1
where is the "baseball" dataset?Ricardo Oliveros-Ramos
@A.R What do you mean by "show my results"? Plot the distribution of predictions? Plot the residuals? Plot other regression diagnostics? Something else?pteetor
There is some package that you loaded (and did not mention) that contains the baseball data. Maybe corrgram?Seth
@RicardoOliveros-Ramos library(corrgram)Gago-Silva
This is not logistic regression! I tried it and fit_glm$family returns "gaussian".Tomas

1 Answers

1
votes

There are a number of issues here.

  • PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
  • PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
  • prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
  • You must save the true class labels for the predictions otherwise you have nothing to compare to.

My take on your code looks like this.

library(corrgram)
data(baseball)

dataset <- baseball[complete.cases(baseball),]

fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
  sp <- sample(nrow(dataset), 30, replace=FALSE)
  fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
                   family="binomial", data=dataset[sp[1:15],])    
  preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
  truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))

The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

enter image description here