0
votes

I am trying to extract the the first, second, third, etc word from the end of a string. stringr:word() can do this by specifying the string and the position that you want (using a 'minus' symbol to specify counts from the end of the string). I am trying to do this from a potentially long list of variable length strings (i.e. don't know the length of the string). When stringr::word finds an NA (string that is shorter than the length i want to extract) it halts my while loop and sends an error message. How can I ignore this to move to the next string?

Here is an example: word("yum just made fresh", -5)

Output: [1] NA Warning messages: 1: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing 2: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing

And from some reason this code:

word("ifkoalasshadarealityshow cake", -5)

will yield this

output: [1] "ifkoalasshadarealityshow"

even though the default separator is a space.

Here is my loop as the counter is increasing:

Subset part of the data

x <- c("would be really into in", "demands the return of the", "", "tomato sugar free lemonada is", "thoughts of eating a piece of", "ifkolalashadarealityshow cake", "yum just made fresh", "ever had a")

Extract last word (not a problem)

word(x, -1) 
#[1] "in"    "the"   ""      "is"    "of"    "cake"  "fresh" "a"

Extract second to last word (warning, but usable output)

word(x, -2)

[1] "into"                     "of"                       NA                         "lemonada"                 "piece"                   
[6] "ifkolalashadarealityshow" "made"                     "had

"

Warning messages: 1: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing 2: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing

Similar with the third and fourth to last words (warning, but usable output)

word(x, -3)

[1] "really" "return" NA       "free"   "a"      NA       "just"   "ever" 

Warning messages: 1: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing 2: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing

 word(x, -4)
[1] "be"     "the"    ""       "sugar"  "eating" "cake"   "yum"    NA     

Warning messages: 1: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing 2: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing

THe fifth to last stops the loop (error and halts the loop)

 word(x, -5)

Error in stri_sub(string, from = start, to = end) : (list) object cannot be coerced to type 'integer' In addition: Warning message: In stri_sub(string, from = start, to = end) : argument is not an atomic vector; coercing

At the fifth iteration, the loop stops. I will like to bypass any errors to continue processing all the data.

Thanks for reading and any tips.

1

1 Answers

0
votes

You can use str_count to count the number of spaces, then use that to select only the elements of x with >= 5 words

library(stringr)

word(x[str_count(x, ' ') + 1 >= 5], -5)

#[1] "would"   "demands" "tomato"  "of" 

Or if you want to keep the NAs

good <- str_count(x, ' ') + 1 >= 5
replace(rep(NA, length(x)), which(good), word(x[good], -5))

[1] "would"   "demands" NA        "tomato"  "of"      NA        NA        NA

or

library(tidyverse)

map_chr(x, ~ if(str_count(.x, ' ') + 1 >= 5) word(.x, -5) else NA)

[1] "would"   "demands" NA        "tomato"  "of"      NA        NA        NA