I have a huge vector which has a couple of NA
values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA
values.
How can I remove the NA
values so that I can compute the max?
Trying ?max
, you'll see that it actually has a na.rm =
argument, set by default to FALSE
. (That's the common default for many other R functions, including sum()
, mean()
, etc.)
Setting na.rm=TRUE
does just what you're asking for:
d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)
If you do want to remove all of the NA
s, use this idiom instead:
d <- d[!is.na(d)]
A final note: Other functions (e.g. table()
, lm()
, and sort()
) have NA
-related arguments that use different names (and offer different options). So if NA
's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.
Use discard
from purrr (works with lists and vectors).
discard(v, is.na)
The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [
:
v %>% discard(is.na)
v %>% `[`(!is.na(.))
Note that na.omit
does not work on lists:
> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1
$b
[1] 2
$c
[1] NA
Just in case someone new to R wants a simplified answer to the original question
How can I remove NA values from a vector?
Here it is:
Assume you have a vector foo
as follows:
foo = c(1:10, NA, 20:30)
running length(foo)
gives 22.
nona_foo = foo[!is.na(foo)]
length(nona_foo)
is 21, because the NA values have been removed.
Remember is.na(foo)
returns a boolean matrix, so indexing foo
with the opposite of this value will give you all the elements which are not NA.
I ran a quick benchmark comparing the two base
approaches and it turns out that x[!is.na(x)]
is faster than na.omit
. User qwr
suggested I try purrr::dicard
also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
times = 1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
purrr::map(airquality, function(x) { x[!is.na(x)] }) 66.8 75.9 130.5643 86.2 131.80 541125.5 1e+06 a
purrr::map(airquality, na.omit) 95.7 107.4 185.5108 129.3 190.50 534795.5 1e+06 b
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06 c
For reference, here's the original test of x[!is.na(x)]
vs na.omit
:
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
times = 1000000)
Unit: microseconds
expr min lq mean median uq max neval cld
map(airquality, function(x) { x[!is.na(x)] }) 53.0 56.6 86.48231 58.1 64.8 414195.2 1e+06 a
map(airquality, na.omit) 85.3 90.4 134.49964 92.5 104.9 348352.8 1e+06 b