1
votes

Using simulations, I want to test/ demonstrate the effects of "censored" data, where some cases are unavailable to us, or cases have values outside the measurement range of our instruments.

Here, I want to label cases as "observed" or "unobserved" based on the rank score of a numeric variable.

My attempts so far confuse tables with element values, but I don't know what to try next. I'm sure it will be head-smacking simple when I see some suggestions

## generate some data
n_rows <- 20

x <- rnorm(n_rows)
status <- rep("unobserved", n_rows)
data <- data.frame(x, status)

library(dplyr)

## how many observed cases?
n_observed <- 5


## Failure #1
data$status[data$x == dplyr::top_n(data$x, n_observed)] <- "observed"

#> Error in UseMethod("tbl_vars"): no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"


## Failure #2
data$status[which((data$x == dplyr::top_n(data, x, n_observed)))] <- "observed"

#> Warning in if (n > 0) {: the condition has length > 1 and only the first element will be used


## Failure #3
data$status[top_n(data, x, n_observed) %in% data] <- "observed"

#> Warning in if (n > 0) {: the condition has length > 1 and only the first element will be used

1
How about data$status[rank(data$x) <= 5] <- "observed" or data$status[rank(-data$x) <= 5] <- "observed" (depending on your desired ordering)?Mikael Jagan

1 Answers

0
votes

If you want ranks, then use rank! Here are two examples, separately setting the top and bottom five ranked values of x to "observed".

data <- data.frame(x = sample(20), status1 = "unobserved", status2 = "unobserved")
data$status1[rank(data$x)  <= 5] <- "observed"
data$status2[rank(-data$x) <= 5] <- "observed"
data
    x    status1    status2
1   2   observed unobserved
2  11 unobserved unobserved
3   3   observed unobserved
4   4   observed unobserved
5  14 unobserved unobserved
6  15 unobserved unobserved
7   1   observed unobserved
8   8 unobserved unobserved
9   7 unobserved unobserved
10 20 unobserved   observed
11 13 unobserved unobserved
12 16 unobserved   observed
13  9 unobserved unobserved
14 10 unobserved unobserved
15 17 unobserved   observed
16  5   observed unobserved
17 18 unobserved   observed
18 19 unobserved   observed
19 12 unobserved unobserved
20  6 unobserved unobserved

You'll have to be slightly more careful if you expect x to contain duplicates. rank has an optional argument ties.method that you can use to specify behaviour in that case.

FYI, dplyr::top_n is intended for "data frame in, data frame out". It should not be used for indexing.