0
votes

I followed a tutorial for web-crawling with Nutch using cygwin, tomcat, nutch 1.4 and solr 3.4. I already could crawl an URL once, but somehow this doesn't work anymore, no matter which URL i try. My regex-urlfilter.txt in runtime/local/conf is as following:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
 +^http://([a-z0-9]*\.)*nutch.apache.org/

The only URL in my seed.txt in runtime/local/bin/urls is only http://nutch.apache.org/.

For crawling I use command

$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3

Console output is:

cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3

I know there are a few similar questions, but most of them are not resolved. Can anyone help?

Thank you very much in advance!

2

2 Answers

0
votes

Why using a Nutch version that is really really old? But nevertheless the problem that you're facing is the space at the beginning of this line:

 _+^http://([a-z0-9]*\.)*nutch.apache.org/

(I've highlighted the space with an underscore) every line that starts with a space, \n, # gets ignored by the configuration parser, take a look at: https://github.com/apache/nutch/blob/master/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java#L258-L269

0
votes

You can try deleting the directory newCrawl3. Nutch will not crawl an url again, when it has been crawled lately.