0
votes

I used R and DBI to query data from a PostgreSQL but got corrupted results with weird characters. I check both server encoding and client encoding, and they are all UTF-8. Here's what I did

  1. Connect to the db

    library(DBI)

    conn <- dbConnect( drv = RPostgreSQL::PostgreSQL(), host = *, dbname = , user = , password = * )

  2. Fetch the result

    result_df <- dbGetQuery( conn, "select description from some_table where user_id = 1234" )

  3. Result What I got is ã¯ã˜ã‚ã¾ã—ã¦1008(ã¨ãƒ¼ã‚„orã¨ã…ーや)ã¨ã„ã„ã¾ã™ã€‚ 表ç¾ã™ã‚‹ã®ãŒå¤§å¥½ããªå­ã§ã™!! 感性ãŒäººã¨ã‚ºãƒ¬ã¦ã¾ã™(笑) ãŠæ­Œã‚’ã‚‚ã£ã¨ã†ã¾ããªã‚ŠãŸã„ã§ã™ã€‚ while the actual text is はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。

I test again by reading the text and then print it back

x <- 'はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。'
print(x)
>> はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。

There's nothing wrong with the encoding in this case, so I think the problem happened when the data is returned from the database into R, but I don't know how to fix it.

1
How does it look when you do the query with psql from R host machine?Love Tätting
it looked like this ã¯ã˜ã‚ã¾ã—ã¦1008(ã¨ãƒ¼ã‚„orã¨ã…ーや)ã¨ã„ã„ã¾ã™ã€‚ 表ç¾ã™ã‚‹ã®ãŒå¤§å¥½ããªå­ã§ã™!! 感性ãŒäººã¨ã‚ºãƒ¬ã¦ã¾ã™(笑) ãŠæ­Œã‚’ã‚‚ã£ã¨ã†ã¾ããªã‚ŠãŸã„ã§ã™ã€Tue Nguyen
So it does seem to have nothing to do with R. What does psql -l show and SHOW client_encoding;. Do you have any conversion going on in pg_conversion?Love Tätting
When I start a new databse with default en_US.utf8 and paste your text to a new table with a text column, it works and pg_dump shows your text correctly as do select. What locale do you have on your machines? They might interfere. I have sv_SE.UTF-8 on my machine.Love Tätting

1 Answers

1
votes


RPostgreSQL does not set string encodings properly, see this issue.

So in the meantime you have to declare the encoding yourself. With dplyr, you can do it like this:

suppressPackageStartupMessages({
  library(dplyr)
  library(RPostgreSQL)
})

my_con <- 
  dbConnect(
    PostgreSQL(),
    user = my_username,
    password = my_password,
    host = my_host,
    dbname = my_dbname
  )

my_table <- tribble(
  ~user_id,    ~description,
  "1234",     "はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。"
)

copy_to(my_con, my_table)

result_df <- dbGetQuery(my_con, "SELECT description FROM my_table WHERE user_id = '1234'" )

declare_utf8 <- function(x) {
  Encoding(x) <- "UTF-8"
  x
}

result_df %>% mutate_if(is.character, declare_utf8) %>% pull()
#> [1] "はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。"

dbDisconnect(my_con)
#> [1] TRUE

pull() is not necessary here. I just use it because your description would not print right as a data.frame() on my machine due to another character encoding bug, so I am pulling it out as a vector.