Nutch 2.3.1 in crawl Deep Web

Question

i follow the tutorial from

Nutch Wiki "SetupNutchAndTor"(https://wiki.apache.org/nutch/SetupNutchAndTor)

Set up nutch-site.xml

  <property>
        <name>http.proxy.host</name>
        <value>127.0.0.1</value>
        <description>The proxy hostname.  If empty, no proxy is used.
        </description>
  </property>

    <property>
        <name>http.proxy.port</name>
        <value>8118</value>
        <description>The proxy port.</description>
    </property>

but still crawl nothing from the .onion link and not indexed into Solr. Anyone know what is the problem?

Julien Nioche Julien Nioche · Accepted Answer · 2018-02-09T18:20:12

Anything in the logs?

FYI with StormCrawler you can use a SOCKS proxy directly thanks to this commit

You'd need to use OKHTTP for the protocol implementation and configure it like this

http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol" https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"

http.proxy.host: localhost
http.proxy.port: 9050
http.proxy.type: "SOCKS"

Nutch 2.3.1 in crawl Deep Web

1 Answers