It's been long since I've been trying to crawl using Nutch but it just doesn't seem to run. I'm trying to build a SOLR search for a website and using Nutch for crawling and indexing in Solr.
There have been some permission problems originally but they have been fixed now. The URL I'm trying to crawl is http://172.30.162.202:10200/
, which is not publicly accessible. It is an internal URL that can be reached from the Solr server. I tried browsing it using Lynx.
Given below is the output from the Nutch command:
[abgu01@app01 local]$ ./bin/nutch crawl /home/abgu01/urls/url1.txt -dir /home/abgu01/crawl -depth 5 -topN 100
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
at java.io.FileOutputStream.<init>(FileOutputStream.java:136)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:125)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)
at org.apache.nutch.crawl.Crawl.<clinit>(Crawl.java:43)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
solrUrl is not set, indexing will be skipped...
crawl started in: /home/abgu01/crawl
rootUrlDir = /home/abgu01/urls/url1.txt
threads = 10
depth = 5
solrUrl=null
topN = 100
Injector: starting at 2012-07-27 15:47:00
Injector: crawlDb: /home/abgu01/crawl/crawldb
Injector: urlDir: /home/abgu01/urls/url1.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-07-27 15:47:03, elapsed: 00:00:02
Generator: starting at 2012-07-27 15:47:03
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/abgu01/crawl/segments/20120727154705
Generator: finished at 2012-07-27 15:47:06, elapsed: 00:00:03
Fetcher: starting at 2012-07-27 15:47:06
Fetcher: segment: /home/abgu01/crawl/segments/20120727154705
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://172.30.162.202:10200/
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-07-27 15:47:08, elapsed: 00:00:02
ParseSegment: starting at 2012-07-27 15:47:08
ParseSegment: segment: /home/abgu01/crawl/segments/20120727154705
ParseSegment: finished at 2012-07-27 15:47:09, elapsed: 00:00:01
CrawlDb update: starting at 2012-07-27 15:47:09
CrawlDb update: db: /home/abgu01/crawl/crawldb
CrawlDb update: segments: [/home/abgu01/crawl/segments/20120727154705]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-07-27 15:47:10, elapsed: 00:00:01
Generator: starting at 2012-07-27 15:47:10
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-07-27 15:47:11
LinkDb: linkdb: /home/abgu01/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/abgu01/crawl/segments/20120727154705
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Can anyone please suggest what could be the reason for crawl not running? It always ends by saying "Stopping at depth=1 - no more URLs to fetch" irrespective of the value of depth
or topN
parameters. And I think the reason for it is (looking at the output above) that Fetcher isn't able to fetch any content from the URL.
Any inputs are appreciated!