I am trying to run nutch as a crawler over some local directories using examples taken from both the main tutorial (https://cwiki.apache.org/confluence/display/nutch/FAQ#FAQ-HowdoIindexmylocalfilesystem?) as well as from other sources. Nutch is perfectly able to crawl the web no problem, but for some reason it refuses to scan local directories.
My configuration files are as follows:
regex-urlfilter:
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):
# This change is not necessary but may make your life easier.
# Any file types you do not want to index need to be added to the list otherwise
# Nutch will often try to parse them and fail in doing so as it doesnt know
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$
# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
# can be leaked by placing links pointing to web interfaces of services
# running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
# or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
# http://localhost:8080
# http://127.0.0.1/ .. http://127.255.255.255/
# http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
# 10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
# 192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# 172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# accept anything else
+.
nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>NutchSpiderTest</value>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchSpiderTest,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.agent.description</name>
<value>I am just testing nutch, please tell me if it's bothering your website</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>
<property>
<name>file.crawl.parent</name>
<value>false</value>
<description>The crawler is not restricted to the directories that you specified in the
Urls file but it is jumping into the parent directories as well. For your own crawlings you can
change this behavior (set to false) the way that only directories beneath the directories that you specify get
crawled.</description>
</property>
</configuration>
And finally, I commented out this part of regex-normalize.xml:
<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<!-- we do not need this with files
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
-->
Running nutch on Cygwin
, Windows 10 on a distribution built with ant
in the runtime/local directory, using the command:
bin/crawl -s dirs dircrawl 2 >& dircrawl.log
With dirs the folder with the following seed.txt file (I tried to include different versions of the links as it does not seem consistent which version should work, but I could chalk it up to me not having found a definitive answer=:
/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
dircrawl being the directory I want to save the crawl to and instructing number of rounds/max depth to '2'. After a few seconds, nutch crawl outputs the following hadoop.txt log file:
2020-03-24 14:08:58,184 INFO crawl.Injector - Injector: starting at 2020-03-24 14:08:58
2020-03-24 14:08:58,184 INFO crawl.Injector - Injector: crawlDb: dircrawl/crawldb
2020-03-24 14:08:58,184 INFO crawl.Injector - Injector: urlDir: dirs
2020-03-24 14:08:58,184 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2020-03-24 14:08:58,948 INFO crawl.Injector - Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
2020-03-24 14:08:59,011 WARN impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:08:59,888 INFO mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:08:59,890 INFO mapreduce.Job - Running job: job_local1269520609_0001
2020-03-24 14:09:00,897 WARN crawl.Injector - Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
2020-03-24 14:09:00,902 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2020-03-24 14:09:00,906 INFO mapreduce.Job - Job job_local1269520609_0001 running in uber mode : false
2020-03-24 14:09:00,908 INFO mapreduce.Job - map 0% reduce 0%
2020-03-24 14:09:01,158 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:01,447 WARN zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:01,461 INFO crawl.Injector - Injector: overwrite: false
2020-03-24 14:09:01,461 INFO crawl.Injector - Injector: update: false
2020-03-24 14:09:01,924 INFO mapreduce.Job - map 100% reduce 100%
2020-03-24 14:09:01,926 INFO mapreduce.Job - Job job_local1269520609_0001 completed successfully
2020-03-24 14:09:01,951 INFO mapreduce.Job - Counters: 31
File System Counters
FILE: Number of bytes read=1857050
FILE: Number of bytes written=3067581
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=5
Map output records=0
Map output bytes=0
Map output materialized bytes=6
Input split bytes=289
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=13
Total committed heap usage (bytes)=402653184
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
injector
urls_filtered=5
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=239
2020-03-24 14:09:02,022 INFO crawl.Injector - Injector: Total urls rejected by filters: 5
2020-03-24 14:09:02,023 INFO crawl.Injector - Injector: Total urls injected after normalization and filtering: 0
2020-03-24 14:09:02,023 INFO crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0
2020-03-24 14:09:02,023 INFO crawl.Injector - Injector: Total new urls injected: 0
2020-03-24 14:09:02,054 INFO crawl.Injector - Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
2020-03-24 14:09:08,708 INFO crawl.Generator - Generator: starting at 2020-03-24 14:09:08
2020-03-24 14:09:08,708 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2020-03-24 14:09:08,708 INFO crawl.Generator - Generator: filtering: false
2020-03-24 14:09:08,708 INFO crawl.Generator - Generator: normalizing: true
2020-03-24 14:09:08,715 INFO crawl.Generator - Generator: topN: 50000
2020-03-24 14:09:08,879 WARN impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:09:10,418 INFO mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:09:10,424 INFO mapreduce.Job - Running job: job_local828841059_0001
2020-03-24 14:09:11,450 INFO mapreduce.Job - Job job_local828841059_0001 running in uber mode : false
2020-03-24 14:09:11,453 INFO mapreduce.Job - map 0% reduce 0%
2020-03-24 14:09:11,784 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2020-03-24 14:09:11,784 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2020-03-24 14:09:11,784 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2020-03-24 14:09:11,816 WARN zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:12,073 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:12,475 INFO mapreduce.Job - map 100% reduce 100%
2020-03-24 14:09:12,505 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:13,485 INFO mapreduce.Job - Job job_local828841059_0001 completed successfully
2020-03-24 14:09:13,502 INFO mapreduce.Job - Counters: 30
File System Counters
FILE: Number of bytes read=2784859
FILE: Number of bytes written=4605489
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=28
Input split bytes=156
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=28
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=15
Total committed heap usage (bytes)=603979776
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=98
File Output Format Counters
Bytes Written=16
2020-03-24 14:09:13,502 INFO crawl.Generator - Generator: number of items rejected during selection:
2020-03-24 14:09:13,521 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ...
While the dirclaw.txt log yields:
Injecting seed URLs
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch inject dircrawl/crawldb dirs
Injector: starting at 2020-03-24 14:08:58
Injector: crawlDb: dircrawl/crawldb
Injector: urlDir: dirs
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 5
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
24 Mar 2020 14:09:02 : Iteration 1 of 2
Generating a new segment
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true dircrawl/crawldb dircrawl/segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2020-03-24 14:09:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: number of items rejected during selection:
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
So basically now I am kind of stuck. I've tried to undo some of my changes, but no matter what I do I cannot seem to make the configuration work with local directories. Does anyone know what I'm doing wrong?
cat seed_urls.txt | $NUTCH_HOME/bin/nutch filterchecker -stdin
as suggested in stackoverflow.com/questions/48148398/… returns a "-" sign next to all "file:..." and http URLs, but a "+" sign next to https. Uncommenting the line on regex-urlfilter allows both http and https, BUT still won't allow for file://. – montan