0
votes

I have data with continuous and categorical variables, response variable is 1 or 0:

>

 str(test3)
'data.frame':   690 obs. of  7 variables:
 $ A1 : Factor w/ 3 levels "?","a","b": 3 2 2 3 3 3 3 2 3 3 ...
 $ A2 : num  30.8 58.7 24.5 27.8 20.2 ...
 $ A3 : num  0 4.46 0.5 1.54 5.62 ...
 $ A4 : Factor w/ 4 levels "?","l","u","y": 3 3 3 3 3 3 3 3 4 4 ...
 $ A8 : num  1.25 3.04 1.5 3.75 1.71 ...
 $ A11: int  1 6 0 5 0 0 0 0 0 0 ...
 $ A16: num  1 1 1 1 1 1 1 1 1 1 ...*

What is the way to plot the model? Should I divide categorical and continuous variables? I have tried this:

   mod3 <- glm(A16~., data=credit, family=binomial)
    mod3$coefficients
    summary(mod3)

But I received error:

glm.fit: fitted probabilities numerically 0 or 1 occurred 


head(test3, n=30)
   A1    A2     A3 A4     A8 A11 A16
1   b 30.83  0.000  u  1.250   1   1
2   a 58.67  4.460  u  3.040   6   1
3   a 24.50  0.500  u  1.500   0   1
4   b 27.83  1.540  u  3.750   5   1
5   b 20.17  5.625  u  1.710   0   1
6   b 32.08  4.000  u  2.500   0   1
7   b 33.17  1.040  u  6.500   0   1
8   a 22.92 11.585  u  0.040   0   1
9   b 54.42  0.500  y  3.960   0   1
10  b 42.50  4.915  y  3.165   0   1
11  b 22.08  0.830  u  2.165   0   1
12  b 29.92  1.835  u  4.335   0   1
13  a 38.25  6.000  u  1.000   0   1
14  b 48.08  6.040  u  0.040   0   1
15  a 45.83 10.500  u  5.000   7   1
16  b 36.67  4.415  y  0.250  10   1
17  b 28.25  0.875  u  0.960   3   1
18  a 23.25  5.875  u  3.170  10   1
19  b 21.83  0.250  u  0.665   0   1
20  a 19.17  8.585  u  0.750   7   1
21  b 25.00 11.250  u  2.500  17   1
22  b 23.25  1.000  u  0.835   0   1
23  a 47.75  8.000  u  7.875   6   1
24  a 27.42 14.500  u  3.085   1   1
25  a 41.17  6.500  u  0.500   3   1
26  a 15.83  0.585  u  1.500   2   1
27  a 47.00 13.000  u  5.165   9   1
28  b 56.58 18.500  u 15.000  17   1
29  b 57.42  8.500  u  7.000   3   1
30  b 42.08  1.040  u  5.000   6   1
1
So your question isn't really about plotting it is about why your model is failing. Could you please share a sample of your data beyond the structure for example dput(head(test3)) or dput(head(credit)) whicheever you're actually using?Chuck P
Sure, please see below: dput(head(test3)) structure(list(A1 = structure(c(3L, 2L, 2L, 3L, 3L, 3L), .Label = c("?", "a", "b"), class = "factor"), A2 = c(30.83, 58.67, 24.5, 27.83, 20.17, 32.08), A3 = c(0, 4.46, 0.5, 1.54, 5.625, 4), A4 = structure(c(3L, 3L, 3L, 3L, 3L, 3L), .Label = c("?", "l", "u", "y"), class = "factor"), A8 = c(1.25, 3.04, 1.5, 3.75, 1.71, 2.5), A11 = c(1L, 6L, 0L, 5L, 0L, 0L), A16 = c(1, 1, 1, 1, 1, 1)), row.names = c(NA, 6L), class = "data.frame")codeforfun
Thank you that's probably not enough should have asked for head 30 but meantime do you mean to have factors called "?" in variables A1 and A4? Is variable A11 really and integer?Chuck P
I have added the head 30 as a question edit. "?" are NAs that are going to be filled.codeforfun
Dave2e, it is the head of n=30, there are more observations and 0 as wellcodeforfun

1 Answers

1
votes

So absent a look at your full dataset I'm perplexed. I'm suspicious of question marks as factors but none of the other oddities seem to matter. I mocked up a similar data set. Runs fine with or without na.omit.

Short answer is no you don't have to do anything special to tell it variable types...

set.seed(2020)
A1 <- factor(sample(letters[1:3], size = 100,replace = TRUE))
A2 <- runif(100, min = 20, max = 70)
A3 <- runif(100, min = 0, max = 10)
A4 <- factor(sample(c("l", "u", "y", "x"), size = 100,replace = TRUE))
A8 <- runif(100, min = 0, max = 20)
A11 <- sample(0:20, size = 100, replace = TRUE)
A16 <- as.numeric(sample(0:1, size = 100, replace = TRUE, prob = c(.1, .9)))
credit <- data.frame(A1, A2, A3, A4, A8, A11, A16)
str(credit)
#> 'data.frame':    100 obs. of  7 variables:
#>  $ A1 : Factor w/ 3 levels "a","b","c": 3 2 1 1 2 2 1 1 2 2 ...
#>  $ A2 : num  38.8 54.1 29.1 23.3 32 ...
#>  $ A3 : num  0.118 2.288 0.986 3.363 5.745 ...
#>  $ A4 : Factor w/ 4 levels "l","u","x","y": 4 2 2 2 2 3 2 2 2 3 ...
#>  $ A8 : num  8.85 17.94 4.42 2.88 14.77 ...
#>  $ A11: int  4 2 13 2 20 18 20 20 9 18 ...
#>  $ A16: num  1 1 1 1 1 1 0 1 1 1 ...
mod3 <- glm(A16~., data=credit, family=binomial, na.action = na.omit)
mod3
#> 
#> Call:  glm(formula = A16 ~ ., family = binomial, data = credit, na.action = na.omit)
#> 
#> Coefficients:
#> (Intercept)          A1b          A1c           A2           A3          A4u  
#>     0.37850     -0.49031     -0.52429      0.02990      0.07271      1.08706  
#>         A4x          A4y           A8          A11  
#>     1.05172      0.38511     -0.00192     -0.02511  
#> 
#> Degrees of Freedom: 99 Total (i.e. Null);  90 Residual
#> Null Deviance:       69.3 
#> Residual Deviance: 65.55     AIC: 85.55