2
votes

I followed https://wiki.apache.org/nutch/NutchTutorial and trying to install and integrate Nutch 1.12 with Solr 5.5.2. I installed Nutch by following steps mentioned in tutorial but when trying to integrate with solr by running below command. It is throwing the below exception.

bin/nutch index http://10.209.18.213:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter -normalize

Exception 

2016-08-11 09:18:40,076 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-08-11 09:18:40,383 WARN  segment.SegmentChecker - The input path at crawldb is not a segment... skipping
2016-08-11 09:18:40,397 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20160810110110.
2016-08-11 09:18:40,403 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20160810112551.
2016-08-11 09:18:40,408 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20160810112952.
2016-08-11 09:18:40,409 INFO  indexer.IndexingJob - Indexer: starting at 2016-08-11 09:18:40
2016-08-11 09:18:40,415 INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
2016-08-11 09:18:40,415 INFO  indexer.IndexingJob - Indexer: URL filtering: true
2016-08-11 09:18:40,415 INFO  indexer.IndexingJob - Indexer: URL normalizing: true
2016-08-11 09:18:40,672 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-08-11 09:18:40,672 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance
        solr.zookeeper.hosts : URL of the Zookeeper quorum
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication


2016-08-11 09:18:40,677 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: http://10.209.18.213:8983/solr
2016-08-11 09:18:40,677 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-08-11 09:18:40,677 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20160810110110
2016-08-11 09:18:40,683 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20160810112551
2016-08-11 09:18:40,684 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20160810112952
2016-08-11 09:18:41,362 ERROR indexer.IndexingJob - Indexer: java.io.IOException: No FileSystem for scheme: http
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
1
I'm having the same problem. Did you found any solution?LucaoA

1 Answers

-2
votes

The tutorial still mentions the deprecated solrindex command. The index command should be

bin/nutch index -Dsolr.server.url=http://.../solr crawldb/ -linkdb linkdb/ segments/*

Without argument Nutch commands show a command-line help:

bin/nutch index
Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize] [-addBinaryContent] [-base64]
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance
        solr.zookeeper.hosts : URL of the Zookeeper quorum
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication