nutch 1.16 skips file:/directory styled links in file system crawl

Question

I am trying to run nutch as a crawler over some local directories using examples taken from both the main tutorial (https://cwiki.apache.org/confluence/display/nutch/FAQ#FAQ-HowdoIindexmylocalfilesystem?) as well as from other sources. Nutch is perfectly able to crawl the web no problem, but for some reason it refuses to scan local directories.

My configuration files are as follows:

regex-urlfilter:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# This change is not necessary but may make your life easier.  
# Any file types you do not want to index need to be added to the list otherwise 
# Nutch will often try to parse them and fail in doing so as it doesnt know 
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
#   can be leaked by placing links pointing to web interfaces of services
#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
#   or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
#     http://localhost:8080
#     http://127.0.0.1/ .. http://127.255.255.255/
#     http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
#     10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
#     192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
#     172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)

# accept anything else
+.

nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchSpiderTest</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchSpiderTest,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>I am just testing nutch, please tell me if it's bothering your website</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-file|protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>

</configuration>

And finally, I commented out this part of regex-normalize.xml:

<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<!-- we do not need this with files
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
-->

Running nutch on Cygwin, Windows 10 on a distribution built with ant in the runtime/local directory, using the command:

bin/crawl -s dirs dircrawl 2 >& dircrawl.log

With dirs the folder with the following seed.txt file (I tried to include different versions of the links as it does not seem consistent which version should work, but I could chalk it up to me not having found a definitive answer=:

/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/

dircrawl being the directory I want to save the crawl to and instructing number of rounds/max depth to '2'. After a few seconds, nutch crawl outputs the following hadoop.txt log file:

2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: starting at 2020-03-24 14:08:58
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: crawlDb: dircrawl/crawldb
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: urlDir: dirs
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2020-03-24 14:08:58,948 INFO  crawl.Injector - Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
2020-03-24 14:08:59,011 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:08:59,888 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:08:59,890 INFO  mapreduce.Job - Running job: job_local1269520609_0001
2020-03-24 14:09:00,897 WARN  crawl.Injector - Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
2020-03-24 14:09:00,902 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2020-03-24 14:09:00,906 INFO  mapreduce.Job - Job job_local1269520609_0001 running in uber mode : false
2020-03-24 14:09:00,908 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:01,158 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:01,447 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: overwrite: false
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: update: false
2020-03-24 14:09:01,924 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:01,926 INFO  mapreduce.Job - Job job_local1269520609_0001 completed successfully
2020-03-24 14:09:01,951 INFO  mapreduce.Job - Counters: 31
    File System Counters
        FILE: Number of bytes read=1857050
        FILE: Number of bytes written=3067581
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=5
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=6
        Input split bytes=289
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=6
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=13
        Total committed heap usage (bytes)=402653184
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    injector
        urls_filtered=5
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=239
2020-03-24 14:09:02,022 INFO  crawl.Injector - Injector: Total urls rejected by filters: 5
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected after normalization and filtering: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total new urls injected: 0
2020-03-24 14:09:02,054 INFO  crawl.Injector - Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: starting at 2020-03-24 14:09:08
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: filtering: false
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: normalizing: true
2020-03-24 14:09:08,715 INFO  crawl.Generator - Generator: topN: 50000
2020-03-24 14:09:08,879 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:09:10,418 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:09:10,424 INFO  mapreduce.Job - Running job: job_local828841059_0001
2020-03-24 14:09:11,450 INFO  mapreduce.Job - Job job_local828841059_0001 running in uber mode : false
2020-03-24 14:09:11,453 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:11,784 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2020-03-24 14:09:11,816 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:12,073 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:12,475 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:12,505 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:13,485 INFO  mapreduce.Job - Job job_local828841059_0001 completed successfully
2020-03-24 14:09:13,502 INFO  mapreduce.Job - Counters: 30
    File System Counters
        FILE: Number of bytes read=2784859
        FILE: Number of bytes written=4605489
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=0
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=28
        Input split bytes=156
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=28
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=15
        Total committed heap usage (bytes)=603979776
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=98
    File Output Format Counters 
        Bytes Written=16
2020-03-24 14:09:13,502 INFO  crawl.Generator - Generator: number of items rejected during selection:
2020-03-24 14:09:13,521 WARN  crawl.Generator - Generator: 0 records selected for fetching, exiting ...

While the dirclaw.txt log yields:

Injecting seed URLs
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch inject dircrawl/crawldb dirs
Injector: starting at 2020-03-24 14:08:58
Injector: crawlDb: dircrawl/crawldb
Injector: urlDir: dirs
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 5
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
24 Mar 2020 14:09:02 : Iteration 1 of 2
Generating a new segment
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true dircrawl/crawldb dircrawl/segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2020-03-24 14:09:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: number of items rejected during selection:
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

So basically now I am kind of stuck. I've tried to undo some of my changes, but no matter what I do I cannot seem to make the configuration work with local directories. Does anyone know what I'm doing wrong?

An addendum: running cat seed_urls.txt | $NUTCH_HOME/bin/nutch filterchecker -stdin as suggested in stackoverflow.com/questions/48148398/… returns a "-" sign next to all "file:..." and http URLs, but a "+" sign next to https. Uncommenting the line on regex-urlfilter allows both http and https, BUT still won't allow for file://. — montan
Upon even further inspection, it seems to be an issue with UrlValidator, but I do not know how to edit its config to allow for file:// as removing the filter does allow nutch to browse to that path, but the crawler will return nothing from its crawl. — montan

Sebastian Nagel Sebastian Nagel · Accepted Answer · 2020-03-25T08:06:50

The troubles with crawling file: URLs and why the number of slashes matters are described in NUTCH-1483:

these seed URLs should work:

file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

this does not because "cygdrive" is taken as host name:

file://cygdrive/c/Users/abc/Desktop/anotherdirectory/

I can confirm that crawling file systems works using Nutch 1.16 on Linux (no Windows at hand). Notes: - urlfilter-validator is supposed for internet URLs only because the host name must contain a dot - the configuration file of urlnormalizer-regex contains a special rule to fix the number of slashes after file: - there's also a tool "normalizerchecker" - you might also try "parsechecker" to quickly verify which form of file: URLs definitely works given your configuration:

$> bin/nutch parsechecker file://var/www/html/
fetching: file://var/www/html/
Fetch failed with protocol status: notfound(14), lastModified=0

$> bin/nutch parsechecker file:///var/www/html/
fetching: file:///var/www/html/
parsing: file:///var/www/html/
...
Status: success(1,0)
Title: Index of /mnt/data/var_www_html
Outlinks: 2
  outlink: toUrl: file:/mnt/data/ anchor: ../
  outlink: toUrl: file:/mnt/data/var_www_html/index.html anchor: index.html
...

you should also check all Nutch properties with prefix "file."

nutch 1.16 skips file:/directory styled links in file system crawl

1 Answers