I couldn't find any similar question although I doubt that this has not been posted before. My question is related to Calculate using dplyr, percentage of NA'S in each column.
In a dataset whith are multiple observations per subject, it is useful to not only calculate the total number of missing data entries (i.e. total NA's per column) but also how many subjects have missing data of some sort.
For instance, in dataset db
(see below) item_1
is missing for 2 subjects and item_2
is missing for 1 subject.
Edit 1: What I am interested in is how many subjects have (any) missing value per item. Even if in
item_2
there are two missing observations for subject number 1, this should be counted as 1 since it is still the same subject.
library("dplyr")
db <- data.frame(
subject = c(1, 1, 1, 2),
item_1 = c(NA, 2, 3, NA),
item_2 = c(1, NA, NA, 4)
)
db
#> subject item_1 item_2
#> 1 1 NA 1
#> 2 1 2 NA
#> 3 1 3 NA
#> 4 2 NA 4
My approach so far was to cbind
all single calculations into one new data.frame
but this quickly gets messy (with more columns) and is surely not well coded.
Edit 1: However, this shows the desired values, as
item_1
is missing for two subjects (1 and 2) and item_2 is only missing for 1 subject (subject 2).
cbind(
db %>%
filter(is.na(item_1)) %>%
summarise(na_item_1 = n_distinct(subject)),
db %>%
filter(is.na(item_2)) %>%
summarise(na_item_2 = n_distinct(subject))
)
#> na_item_1 na_item_2
#> 1 2 1
Question: is there an approach in dplyr
to calculate this?
Ideally, I would also like to add the proportion of missings somewhere (like in the example below):
data.frame(
type = c("n", "proportion"),
na_item_1 = c(2, 1.0),
na_item_2 = c(1, 0.5)
)
#> type na_item_1 na_item_2
#> 1 n 2.0 1.0
#> 2 proportion 1.0 0.5
Created on 2019-04-16 by the reprex package (v0.2.1)
Thanks in advance!
item_1
. Corrected in question. – Frederick