0
votes

I try crawling seed urls that are http/https but for few https urls i get below error FetcherThread INFO api.HttpRobotRulesParser (168) - Couldn't get robots.txt for https://corporate.douglas.de/investors/?lang=en: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

on other hand https://www.integrafin.co.uk/annual-reports/ is crawled perfectly fine

below is my configuration plugin.includes protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor|more|static|links)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta|language-identifier

2

2 Answers

0
votes

I think you need to put the certificate of server https://corporate.douglas.de/investors/?lang=en in the "cacerts" file of the JVM that runs your code.

First, download the certificate using Chrome: enter image description here

Then, click in "details" tab and then in button "Copy to file" enter image description here

In the wizard, select the option "DER binary.... (.CER)"

Now, you can use the tool "portecle" (http://portecle.sourceforge.net/) to add the certificate to the cacert file in your JVM followin this steps http://portecle.sourceforge.net/import-trusted-cert.html

Hope works for you.

0
votes

You could try using a more recent version of Nutch, or compile directly from master, and then give a try to the http.tls.certificates.check setting, from (https://github.com/apache/nutch/pull/388). This will essentially allow you to skip the TLS/SSL verification.