1
votes

~/runtime/local/bin/urls/seed.txt >>

http://nutch.apache.org/

~/runtime/local/conf/nutch-site.xml >>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
            <name>http.agent.name</name>
            <value>My Nutch Spider</value>
    </property>

    <property>
            <name>http.timeout</name>
            <value>99999999</value>
            <description></description>
    </property>

    <property>
            <value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
            scoring-opic|urlnormalizer-(pass|regex|basic)|index-more
            </value>
            <description>Regular expression naming plugin directory names to
            include.  Any plugin not matching this expression is excluded.
            In any case you need at least include the nutch-extensionpoints plugin.
            </description>
    </property>
</configuration>

~/runtime/local/conf/regex-urlfilter.txt >>

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)*

If I crawl, it says like this.

/home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch crawl urls -dir newCrawl/ -depth 3 -topN 3
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: newCrawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 3
Injector: starting at 2014-07-18 11:35:36
Injector: crawlDb: newCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-07-18 11:35:39, elapsed: 00:00:02
Generator: starting at 2014-07-18 11:35:39
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl

No matter what the web adresses are, it always says no urls to fetch. I am struggling with this problem for 3 days. Please Help!!!!

1
Add some wikipedia page or a site with many links in seed.txt and try.Ramanan

1 Answers

1
votes

I was looking at your regex-filter and I spotted a few glitches that you might think about giving a try. Since it won't fit well into the comment, I will post it here anyway even it might not be the complete answer.

  1. Your customized regular expression +^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)* might be the problem. Nutch's regex-urlfilter sometimes can get really confusing and I would highly recommend you start with something that works for everyone, maybe +^http://([a-z0-9]*\.)*nutch.apache.org/ from Wiki just to get started.
  2. After the two steps above, and you are sure Nutch is working, then you can tweak the regex.

To test the regex, I found two ways to do it:

  1. feed a list of URLs to be the seed. And inject them to a new database and see who has been injected or rejected. This doesn't really any coding.
  2. You can set up Nutch in Eclipse and call the corresponding class to test it.

enter image description here