3
votes

I couldn't find any similar question although I doubt that this has not been posted before. My question is related to Calculate using dplyr, percentage of NA'S in each column.


In a dataset whith are multiple observations per subject, it is useful to not only calculate the total number of missing data entries (i.e. total NA's per column) but also how many subjects have missing data of some sort.

For instance, in dataset db (see below) item_1 is missing for 2 subjects and item_2 is missing for 1 subject.

Edit 1: What I am interested in is how many subjects have (any) missing value per item. Even if in item_2 there are two missing observations for subject number 1, this should be counted as 1 since it is still the same subject.

library("dplyr")

db <- data.frame(
  subject = c(1, 1, 1, 2),
  item_1 = c(NA, 2, 3, NA),
  item_2 = c(1, NA, NA, 4)
)
db
#>   subject item_1 item_2
#> 1       1     NA      1
#> 2       1      2     NA
#> 3       1      3     NA
#> 4       2     NA      4

My approach so far was to cbind all single calculations into one new data.frame but this quickly gets messy (with more columns) and is surely not well coded.

Edit 1: However, this shows the desired values, as item_1 is missing for two subjects (1 and 2) and item_2 is only missing for 1 subject (subject 2).

cbind(
  db %>%
    filter(is.na(item_1)) %>%
    summarise(na_item_1 = n_distinct(subject)),
  db %>%
    filter(is.na(item_2)) %>%
    summarise(na_item_2 = n_distinct(subject))
)
#>   na_item_1 na_item_2
#> 1         2         1

Question: is there an approach in dplyr to calculate this?

Ideally, I would also like to add the proportion of missings somewhere (like in the example below):

data.frame(
  type = c("n", "proportion"),
  na_item_1 = c(2, 1.0),
  na_item_2 = c(1, 0.5)
)
#>         type na_item_1 na_item_2
#> 1          n       2.0       1.0
#> 2 proportion       1.0       0.5

Created on 2019-04-16 by the reprex package (v0.2.1)

Thanks in advance!

2
Why does item_2 have 1 missing value?NelsonGon
It is about the subject and not the values. What I mean is that item_2 has missing values for 1 subject (subject number 2). I will try to make this clearer in the post.Frederick
Thanks for pointing out @RonakShah. This is wrong. It should be 1.0 as 2 out of 2 subjects have missings in item_1. Corrected in question.Frederick

2 Answers

3
votes

Another dplyr version is to first group_by subject and find out the group which has any NA value, then group_by column and calculate total value of NAs for n and divide it by total unique values of subject to get prop.

library(dplyr)
library(tidyr)

db %>%
  group_by(subject) %>%
  summarise_all(~any(is.na(.))) %>%
  ungroup() %>%
  select(-subject) %>%
  gather() %>%
  group_by(key) %>%
  summarise(n = sum(value), 
            prop = n/n_distinct(db$subject))

#   key       n  prop
#   <chr>  <int> <dbl>
#1 item_1     2   1  
#2 item_2     1   0.5
1
votes

A different tidyverse possibility to assess the number of NA per item and per ID could be:

db %>%
 gather(var, val, -subject) %>%
 group_by(var, subject) %>%
 summarise(val = sum(is.na(val))) %>%
 spread(var, val)

  subject item_1 item_2
    <dbl>  <int>  <int>
1       1      1      2
2       2      1      0

Or if you want the overall number of NA and the proportion of NA per ID:

db %>%
 gather(var, val, -subject) %>%
 group_by(subject) %>%
 summarise(count = sum(is.na(val)),
           proportion = sum(is.na(val))/n())

  subject count proportion
    <dbl> <int>      <dbl>
1       1     3        0.5
2       2     1        0.5

Or if you want the count and proportion of NA just per item:

bind_rows(db %>%
 select(-subject) %>%
 gather(var, val) %>%
 group_by(var) %>%
 summarise(val = sum(is.na(val))) %>%
 spread(var, val) %>%
 mutate(type = "count"),
 db %>%
 select(-subject) %>%
 gather(var, val) %>%
 group_by(var) %>%
 summarise(val = sum(is.na(val))/n()) %>%
 spread(var, val) %>%
 mutate(type = "proportion"))

  item_1 item_2 type      
   <dbl>  <dbl> <chr>     
1    2      2   count     
2    0.5    0.5 proportion

Or if you want the number and proportion of unique subjects with NA per item:

bind_rows(db %>%
 gather(var, val, -subject) %>%
 filter(is.na(val)) %>%
 group_by(var) %>%
 summarise(val = n_distinct(subject)) %>%
 spread(var, val) %>%
 mutate(type = "count"),
 db %>%
 gather(var, val, -subject) %>%
 group_by(var) %>%
 mutate(n = n_distinct(subject)) %>%
 filter(is.na(val)) %>%
 group_by(var) %>%
 summarise(val = first(n_distinct(subject)/n)) %>%
 spread(var, val) %>%
 mutate(type = "proportion"))

  item_1 item_2 type      
   <dbl>  <dbl> <chr>     
1      2    1   count     
2      1    0.5 proportion