I tried implementing logistic regression using glm in R for winconsin breast cancer dataset. I analysed the dataset and found that wbc$V7 contained missing values. I imputed the missing values using the Hmisc package and performed logistic regression using glm
wbc=read.csv(file="https://archive.ics.uci.edu/ml/machine-learning-
databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header =
FALSE)
wbc[wbc=='?']=NA #replacing '?' with NA
a=sapply(wbc,function(x) sum(is.na(x))) #analyse the number of NA in each column
print(a)
library(Hmisc)
wbc$V7=impute(wbc$V7,mode) #impute missing values with mode in V7
wbc$V11[wbc$V11==2]=0; #V11 has either '2' or '4' as entries, replacing '2' with '0' and '4' with '1'
wbc$V11[wbc$V11==4]=1;
model <- glm(V11~V2+V3+V4+V5+V6+V7+V8+V9+V10,family=binomial(),data=wbc) #
OUTPUT:
Call: glm(formula = V11 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10,
family = binomial(), data = wbc)
Coefficients:
(Intercept) V2 V3 V4 V5 V6
V71 V710
8.6625 0.4511 -0.1013 0.4842 0.2206 0.1684
-18.7466 -14.8168
V72 V73 V74 V75 V76 V77
V78 V79
-17.6684 -16.0272 -15.3552 -16.3765 0.7704 -16.2944
-16.6171 NA
V8 V9 V10
0.5052 0.1144 0.4550
Degrees of Freedom: 698 Total (i.e. Null); 681 Residual
Null Deviance: 900.5
Residual Deviance: 102.9 AIC: 138.9
Why does the output contain coefficients for V71, V710, V72, V73, V74, V75, V76, V77, V78 and V79 when the wbc dataframe has only columns V1, V2, V3, V4, V5, V6, V7, V8, V9, V10 ?