I am trying to extract a series of words from a series of .txt documents with the 'str_extract_all' stringr function. Everything works well except that the results I get do not show Unicode characters (which are fine in the UTF-8 texts where the information is extracted from). Does anybody know why this is happening?
[I am using RStudio on Windows 10.1]
I have converted my corpus of 5 .txt documents (novels) to a data frame through the following command:
tbl <- list.files(pattern = "*.txt") %>%
map_chr(~ read_file(.)) %>%
data_frame(text = .)
Unicode characters look fine on 'tbl', but when I run the str_extract_all function, they disappear. Here is my code:
uppercase <- sapply(str_extract_all(tbl, '(?<!^|\\.\\s|\\?\\s|\\!\\s)[A-Z][a-z]+'), paste)
This is the result I get:
[1,] "For"
[2,] "Ant"
[3,] "Pati"
etc.
When it should read:
[1,] "For"
[2,] "Antón"
[3,] "Patiño"
etc.
Is this a stringr bug, or has anybody experienced anything similar before? Any help will be much appreciated. Thank you!