The function url_parse
is very fast and works fine most of the time. But recently, domain names may contain UTF-8 characters, for example
url <- "www.cordes-tiefkühlprodukte.de"
Now if I apply url_parse
on this url, I get a special character "< fc >" in the domain column:
url_parse(url)
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefk<fc>hlprodukte.de <NA> <NA> <NA> <NA>
My question is: How can I "fix" this entry to UTF-8? I tried iconv
and some functions from the stringi
package, but with no success.
(I am aware of httr::parse_url
, which does not have this problem. So one approach would be to detect the urls that are not ascii, and use url_parse
on those and parse_url
on the few special cases. However, this leads to the problem to (efficiently) detect the non-ascii URLs.)
EDIT: Unfortunately, url1 <- URLencode(enc2utf8(url))
does not help. When I do
robotstxt::paths_allowed(
url1,
domain=urltools::suffix_extract(urltools::domain(url1))
)
I get an error could not resolve host
. However, plugging in the original URL and the 2nd level domain by hand, paths_allowed
works.
> sessionInfo()
R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] urltools_1.7.3 fortunes_1.5-4
loaded via a namespace (and not attached): [1] compiler_3.6.1 Rcpp_1.0.1 triebeard_0.3.0
url <- "www.cordes-tiefkühlprodukte.de";urltools::url_parse(url)
. Thedomain
column shows aswww.cordes-tiefkühlprodukte.de
which is same asurl
. My R version isR version 3.5.2
andpackageVersion("urltools") ‘1.7.3’
. You might want to update the post with yoursessionInfo()
– Ronak ShahURLencode(enc2utf8(url))
? – Ben Nutzer