I need to clean up data that was collected with a likert scale. It means that observations in my data are from people who chose one option from an ordinal scale, such as "on a scale of 1-5, where 1 means awful and 5 means wonderful, how would you rate your liking of eggplants?"
Thus, a typical dataset will look like
library(tibble)
set.seed(123)
df_a <-
tibble(name = c("clara", "john", "michelle", "dan", 'timothy', "cindy", "george", "monica", "david", "rebecca"),
response = sample(1:5, 10, replace = TRUE))
name response
<chr> <int>
1 clara 3
2 john 3
3 michelle 2
4 dan 2
5 timothy 3
6 cindy 5
7 george 4
8 monica 1
9 david 2
10 rebecca 3
My task is to test whether the data is indeed likert scale, meaning that (1) values are integers, and (2) if we summarize the unique values, they are consecutive.
- Testing whether all are integers can be done by
all((df_a$response - round(df_a$response)) == 0) ## https://stackoverflow.com/a/10114038/6105259
[1] TRUE
- Testing whether unique values are consecutive [actually I don't know how to do this, but my problem doesn't end here].
My real problem is that likert scale could have different variations and that other strings might show in the data, adding noise.
valid likert scale could span different ranges, for example either 1-5, or 0-3, or 1-10 etc.
many times there will be additional strings such as "irrelevant", "I don't know", "I don't think so", "not applicable to me", and so on. I cannot anticipate which such additional strings will be present in the data, if any at all.
Under such circumstances, I need to detect whether my data is essentially likely to be from "likert scale".
Criteria to decide data is likert scale:
- numeric values are integers.
- when we take the unique values, they are consecutive (in the sense that
sort(unique(df_a$response))
returns1 2 3 4 5
. If it had returned1 3 4 5
then it would have failed the "consecutiveness" criteria) - the smallest value in the range is either
0
or1
- the greatest value is
10
. - noise strings that aren't numeric (such as "I don't know", "abcd34", "irrelevant") account for less than 50% of the data
Below are 4 examples to demonstrate possible types of data and what I expect should happen when testing them for whether they're "likert" or not
In the examples I use stringi::stri_rand_strings
to simulate the "noise" strings (e.g., "I don't know", "irrelevant", and other examples I gave above)
Example 1 -- testing for "is likert scale" should return TRUE
library(stringi)
set.seed(19)
val_begin <- sample(0:1, 1)
val_end <- sample(3:10, 1)
my_seq <- seq(from = val_begin, to = val_end)
additional_strings <- stri_rand_strings(n = 2, length = 5, pattern = "[A-Za-z0-9]")
vec_example_1 <- sample(c(my_seq, additional_strings), size = 100 , replace = TRUE)
barplot(prop.table(table(vec_example_1)), main = "vec example 1)
Example 2 -- testing for "is likert scale" should return FALSE
In the following data, numbers are not consecutive
set.seed(19)
my_seq_2 <- sort(c(seq(0,4), seq(7, 9)))
additional_strings_2 <- stri_rand_strings(n = 2, length = 5, pattern = "[A-Za-z0-9]")
vec_example_2 <- sample(c(my_seq_2, additional_strings_2), size = 100 , replace = TRUE)
barplot(prop.table(table(vec_example_2)), main = "vec example 2)
Example 3 -- testing for "is likert scale" should return FALSE
In the following data, the "additional strings" account for more than 50% of data, making it unlikely that the core of data is likert scale
set.seed(19)
vec_example_3 <- sample(c(rep(additional_strings, 70), sample(my_seq, 30, replace = T)))
barplot(prop.table(table(vec_example_3)), main = "vec example 3")
Example 4 -- testing for "is likert scale" should return FALSE
Just random numbers and strings, no reason to believe this is a likert scale, even if it happens to be unique and consecutive, but 1 -> 30 is simply unlikely to be likert.
set.seed(19)
vec_example_4 <- sample(c(1:30, additional_strings), 1000, replace = T)
barplot(prop.table(table(vec_example_4)), main = "vec example 4")
What I'm asking
I assume that a full solution would be pretty lengthy, so maybe it's too much to ask from people here. So I will be happy for even just tips, a general approach, or ideas how to tackle this.
anyDuplicated(rle(x$values) == 0
will be TRUE if every unique value is in a single run. – G. Grothendieck