Include NAs when using group_by in dplyr

Question

I am using dplyr to summarise some data and I'm grouping this by two factors. The problem is not all levels of the second factor are included within the first factor and my dataframe is not showing instances where there is no result.

I want to include an na.rm=FALSE statement (I think) but this isn't working.

I've also tried the mutate function to include all levels of the factor but it's not working either

Here is my code with the mutate included

Dataframe <- UKData %>%
  filter(!is.na(REGION))%>%
  group_by(REGION,EMPSIZE) %>%
  summarise(NumberofEmployers=length(Employers)) %>%
  mutate(EMPSIZE =  factor(EMPSIZE, levels = z)) %>%
  arrange(REGION,EMPSIZE)

So the issue is that not every region has all employer sizes. The employer size band contains 7 levels. I want a table to show NAs where the Region doesn't have a particular size band. Is this possible?

UPDATE,

So the data will look something like this:

Employers     REGION    EMPSIZE
Number 1    Scotland    1-4
Number 2    Scotland    5-49
Number 3    Scotland    50-499
Number 4    Scotland    500-999
Number 5    Scotland    1000-4999
Number 6    Scotland    5000+
Number 7    Scotland    50-499
Number 8    North West  5-49
Number 9    North West  1000-4999
Number 10   Yorkshire   5000+
Number 11   Yorkshire   50-499
Number 12   Yorkshire   5-49
Number 13   London      1-4
Number 14   London      5-49
Number 15   London      50-499
Number 16   London      500-999
Number 17   London      1000-4999
Number 18   London      5000+
Number 19   East        50-499
Number 20   East        1000-4999

So, only Scotland and London have all 6 possible size bands, the other regions do not. So the table I want should look like this:

REGION    EMPSIZE       number
Scotland    1-4             1
Scotland    5-49            1
Scotland    50-499          2
Scotland    500-999         1
Scotland    1000-4999       1
Scotland    5000+           1
North West  1-4             NA
North West  5-49            1
North West  50-499          NA
North West  500-999         NA
North West  1000-4999       1
North West  5000+           NA
Yorkshire   1-4             NA
Yorkshire   5-49            1
Yorkshire   50-499          1
Yorkshire   500-999         NA
Yorkshire   1000-4999       NA
Yorkshire   5000+           1
London      1-4             1
London      5-49            1
London      50-499          1
London      500-999         1
London      1000-4999       1
London      5000+           1
East        1-4             NA
East        5-49            NA
East        50-499          1
East        500-999         NA
East        1000-4999       1
East        5000+           NA

In hindsight, perhaps I don't care if they are NA or in fact 0 - I do want the level shown in the table though

It's not entirely clear what you mean. NA is implicitly allowed in any collection of factors (I believe ... rbind(data.frame(b=letters[1:3]), data.frame(b=NA_character_)) works), but since we don't know what UKData looks like, it's hard to do much. Can you provide a representative sample of it? Refs: stackoverflow.com/questions/5963269, stackoverflow.com/help/mcve, and stackoverflow.com/tags/r/info. — r2evans
So the idea is that you want to show every possible combination of REGION and EMPSIZE, even if it doesn't appear in the data, right? Take a look at the answers to these questions: stackoverflow.com/questions/32247211/fill-in-missing-rows-in-r, stackoverflow.com/questions/43233682/… — divibisan
Bingo - that second link you provided has given me precisely what I'm after. as.data.frame(xtabs(Value ~ Group + Date, DF), responseName = "Value") — Nottles82

Cathryn Beeson-Lynch Cathryn Beeson-Lynch · Accepted Answer · 2018-10-15T18:44:11

If they don't have an EmpSize then why would you want to include it in your data frame? If I were you, I would keep the rows with missing data out, since the NAs just make the table more difficult to read. You're not "losing" any information by removing missing values if your goal is to display EmpSize. If anything, it's more difficult to read when keeping the NAs in the table. (Also 0 is not the same as NA, so it might not be a good idea to replace NAs with 0s).

Include NAs when using group_by in dplyr

1 Answers