0
votes

I have Nutch 1.10 installed, configured and working with the crawl script but trying to upgrade to Nutch 1.13. I'm having trouble getting the Nutch crawl script to work with Nutch v1.13.

This usually worked with v1.10

bin/crawl -i -D elastic.server.url=http://localhost:9300/search-index/ urls/ searchcrawl/  2

However, when I try to run v1.13 with it, I get

Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] [-s <Seed Dir>] <Crawl Dir> <Num Rounds>
-i|--index  Indexes crawl results into a configured indexer
-D      A Java property to pass to Nutch calls
-w|--wait   NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
        are scheduled for fetching. Suffix can be: s for second,
        m for minute, h for hour and d for day. If no suffix is
        specified second is used by default.
-s Seed Dir Path to seeds file(s)
Crawl Dir   Directory where the crawl/link/segments dirs are saved
Num Rounds  The number of rounds to run this crawl for

And I don't see anything in the docs that is different... am I missing something? How can I get the crawl script to work with v1.13?

1

1 Answers

5
votes

Just found the answer after some better searching.

It seems in 1.14, the bin/crawl script now expects the path to the seed to be preceded by -s

This works: bin/crawl -i -D elastic.server.url=http://localhost:9300/search-index/ -s urls/ searchcrawl/ 2

-hth anyone else