I'm working on a model to predict the probability that college baseball players will make the major leagues. My dataset has 633 observations and 13 predictors with a binary response. The code below generates smaller reproducible examples of training and testing datasets:
set.seed(1)
OBP <- rnorm(50, mean=1, sd=.2)
HR.PCT <- rnorm(50, mean=1, sd=.2)
AGE <- rnorm(50, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=50, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
df.train <- data.frame(OBP, HR.PCT, AGE, CONF)
df.train <- df.train[order(-OBP),]
df.train$MADE.MAJORS <- 0
df.train$MADE.MAJORS[1:10] <- 1
OBP <- rnorm(10, mean=1, sd=.2)
HR.PCT <- rnorm(10, mean=1, sd=.2)
AGE <- rnorm(10, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=10, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
MADE.MAJORS <- sample(0:1, size=10, replace=TRUE, prob=c(0.8,0.2))
df.test <- data.frame(OBP, HR.PCT, AGE, CONF, MADE.MAJORS)
I then used glmnet to perform the lasso with logistic regression and generate predictions. I want the predictions to be in the form of probabilities (that is, between 0 and 1).
library(glmnet)
train.mtx <- with(df.train, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
glmmod <- glmnet(x=train.mtx, y=as.factor(df.train$MADE.MAJORS), alpha=1, family="binomial")
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1)
test.mtx <- with(df.test, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
preds <- predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="response")
cv.preds <- predict.cv.glmnet(object=cv.glmmod, newx=test.mtx, s="lambda.min")
Here are the predictions:
> preds
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
> cv.preds
1
1 0.1568839
2 0.3630938
3 0.7435941
4 0.4808428
5 0.5261076
6 -0.1431655
7 0.4123054
8 0.2207381
9 -0.1446941
10 0.2962391
I have a few questions about these results. Feel free to answer any or all (or none) of them. I'm most interested in an answer for the first question.
Why are the predictions from
predict.glmnet(thepredsvector) not in the form of probabilities? I put thepredsvalues through the inverse logit function and got reasonable probabilities. Was that correct?The predictions from
predict.cv.glmnet(thecv.predsvector) mostly look like probabilities, but some of them are negative. Why is this?When I use the
glmnetfunction to create the glmmod object, I include thefamily="binomial"argument to indicate that I'm using logistic regression. However, when I use thecv.glmnetfunction to find the best value for lambda, I'm not able to specify logistic regression. Am I actually getting the best value for lambda if the cross-validation doesn't use logistic regression?Similarly, when I use the
predict.cv.glmnetfunction, I'm not able to specify logistic regression. Does this function produce the predictions that I want?
predictto provide probabilities, and how to ensure thatcv.glmnetuses alogitlink function, which are quintessentially coding questions. - eipi10