3
votes

I am trying to extract a series of words from a series of .txt documents with the 'str_extract_all' stringr function. Everything works well except that the results I get do not show Unicode characters (which are fine in the UTF-8 texts where the information is extracted from). Does anybody know why this is happening?


[I am using RStudio on Windows 10.1]

I have converted my corpus of 5 .txt documents (novels) to a data frame through the following command:

tbl <- list.files(pattern = "*.txt") %>% 
    map_chr(~ read_file(.)) %>% 
    data_frame(text = .)

Unicode characters look fine on 'tbl', but when I run the str_extract_all function, they disappear. Here is my code:

uppercase <- sapply(str_extract_all(tbl, '(?<!^|\\.\\s|\\?\\s|\\!\\s)[A-Z][a-z]+'), paste)

This is the result I get:

[1,] "For"       
[2,] "Ant"       
[3,] "Pati"      

etc.

When it should read:

[1,] "For"       
[2,] "Antón"       
[3,] "Patiño"      

etc.

Is this a stringr bug, or has anybody experienced anything similar before? Any help will be much appreciated. Thank you!

1

1 Answers

4
votes

Unfortunately the character class [A-z] (and its subsets [A-Z], [a-z]) does not work for special characters such as ñ and ó. [[:alpha:]](alphabetic characters) on the other hand seems to work.

stringr::str_extract_all(c("Antón", "Patiño"), '[A-z]+')

returns:

[[1]]
[1] "Ant" "n"  

[[2]]
[1] "Pati" "o"   

whereas

stringr::str_extract_all(c("Antón", "Patiño"), '[[:alpha:]]+')

returns the desired outcome:

[[1]]
[1] "Antón"

[[2]]
[1] "Patiño"