8
votes

The function url_parse is very fast and works fine most of the time. But recently, domain names may contain UTF-8 characters, for example

url <- "www.cordes-tiefkühlprodukte.de"

Now if I apply url_parse on this url, I get a special character "< fc >" in the domain column:

url_parse(url)
  scheme                            domain port path parameter fragment
1   <NA> www.cordes-tiefk<fc>hlprodukte.de <NA> <NA>      <NA>     <NA>

My question is: How can I "fix" this entry to UTF-8? I tried iconv and some functions from the stringi package, but with no success.

(I am aware of httr::parse_url, which does not have this problem. So one approach would be to detect the urls that are not ascii, and use url_parse on those and parse_url on the few special cases. However, this leads to the problem to (efficiently) detect the non-ascii URLs.)

EDIT: Unfortunately, url1 <- URLencode(enc2utf8(url)) does not help. When I do

robotstxt::paths_allowed(
    url1, 
    domain=urltools::suffix_extract(urltools::domain(url1))
)

I get an error could not resolve host. However, plugging in the original URL and the 2nd level domain by hand, paths_allowed works.

> sessionInfo()

R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] urltools_1.7.3 fortunes_1.5-4

loaded via a namespace (and not attached): [1] compiler_3.6.1 Rcpp_1.0.1 triebeard_0.3.0

2
I don't get the special character which you showed when I do url <- "www.cordes-tiefkühlprodukte.de";urltools::url_parse(url). The domain column shows as www.cordes-tiefkühlprodukte.de which is same as url. My R version is R version 3.5.2 and packageVersion("urltools") ‘1.7.3’. You might want to update the post with your sessionInfo()Ronak Shah
Maybe it is a problem with my LOCALE?Karsten W.
I also get the special character. Does this solve your problem: URLencode(enc2utf8(url))?Ben Nutzer

2 Answers

4
votes

I could reproduce the issue. I could convert the column domain to UTF-8 by reading it with readr::parse_character and latin1 encoding:

library(urltools)
library(tidyverse)

url <- "www.cordes-tiefkühlprodukte.de"

parts <- 
  url_parse(url) %>% 
  mutate(domain = parse_character(domain, locale = locale(encoding = "latin1")))

parts

  scheme                         domain port path parameter fragment
1   <NA> www.cordes-tiefkühlprodukte.de <NA> <NA>      <NA>     <NA>

I guess that the encoding you have to specify (here latin1) depends only on your locale and not on the url's special characters, but I'm not 100% sure about that.

1
votes

Just for reference, another method that works fine for me is:

library(stringi)
url <- "www.cordes-tiefkühlprodukte.de"
url <- stri_escape_unicode(url)
dat <- urltools::parse_url(url)
for(cn in colnames(dat)) dat[,cn] <- stri_unescape_unicode(dat[,cn])