13
votes

I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file.

The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected:

rty <- file("test.txt",encoding="UTF-8")
write("在", file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
scan(rty,what=character())
close(rty)

As shown by the output of scan:

Read 1 item 
[1] "<U+5728>"

The file was not written with the UTF character itself, but some kind of ANSI-compliant fallback. Can I make it work right the first time (i.e. with a text file that has "在" in it instead), or can I work some extra magic to convert the output to Unicode with the proper character replacing the code string?

Thanks.

[More info: the same code behaves properly in Cygwin, R 2.14.2, while 2.14.2 on Win7 is also broken. Is this on my end somewhere?]

5
[Belated update] The issues tend to be with locale rather than encoding. I have resolved gibberish output issues by temporarily changing locale to something "appropriate." God help you if you have language data from more than one locale.Patrick
maybe this post will help.DJJ

5 Answers

22
votes

The problem is due to some R-Windows special behaviour (using the default system coding / or using some system write functions; I do not know the specifics but the behaviour is actually known)

To write text UTF8 encoding on Windows one has to use the useBytes=T options in functions like writeLines or readLines:

txt <- "在"
writeLines(txt, "test.txt", useBytes=T)

readLines("test.txt", encoding="UTF-8")
[1] "在"

Find here a really well written article by Kevin Ushey: http://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ going into much more detail.

8
votes

Saves UTF-8 strings in text file:

kLogFileName <- "parser.log"
log <- function(msg="") {
  con <- file(kLogFileName, "a")
  tryCatch({
    cat(iconv(msg, to="UTF-8"), file=con, sep="\n")
  },
  finally = {
    close(con)
  })
}
7
votes

For anyone coming upon this question later, see the stringi package (https://cran.r-project.org/web/packages/stringi/index.html). It includes numerous functions to enable consistent, cross-platform UTF-8 string support in R. Most relevant to this thread, the stri_read_lines(), stri_read_raw(), and stri_write_lines() functions can consistently input/output UTF-8, even on Windows.

0
votes

I think you are having problems because write is constructed so that it takes the name of an object and you do not appear to have build such a named object. Try this instead:

txt <- "在"
rty <- file("test.txt",encoding="UTF-8")
write(txt, file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
 inp <- scan(rty,what=character())
#Read 1 item
 close(rty)
 inp
#[1] "在"
0
votes

I have such problem with UTF-8 strings which come from DB.

The only way I've found to save them properly is saving file in binary mode.

  F <- file(file.name, "wb")
  tryCatch({
    writeBin(charToRaw(the_utf8_str), F)
  },
  finally = { 
    close(F)
  })