Integrating nutch 1.11 with solr 6.0.1 cloud

Question

This is similar to solr5.3.15-nutch here, but with a few extra wrinkles. First, as background, I tried solr 4.9.1 and nutch with no problems. Then moved up to solr 6.0.1. Integration worked great as a standalone, and got backend code working to parse the json, etc. However, ultimately, we need security, and don't want to use Kerberos. According to the Solr security documentation, basic auth and rule-based auth (which is what we want) works only in cloud mode (as an aside, if anyone has suggestions for getting non-Kerberos security working in standalone mode, that would work as well). So, went through the doc at Solr-Cloud-Ref, using the interactive startup and taking all the defaults, except for the name of the collection which I made as "nndcweb" instead of "gettingstarted". The configuration I took was data_driven_schema_configs . To integrate nutch, there were many permutations of attempts I made. I'll only give the last 2 that seemed to come closest based on what I've been able to find so far. From the earlier stack-overflow reference, the last one I tried was (note all urls have http://, but the posting system for Stackoverflow was complaining, so I took them out for the sake of this post):

bin/nutch index crawl/crawldb -linkdb crawl/linkdb -D solr.server.url=localhost:8939/solr/nndcweb/ -Dsolr.server.type=cloud -D solr.zookeeper.url=localhost:9983/ -dir crawl/segments/* -normalize

I ended up with the same problem noted in the previous thread mentioned: namely,

Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=localhost:8939/solr/nndcweb at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:217) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=localhost:8939/solr/nndcweb at java.net.URI$Parser.fail(URI.java:2848) at java.net.URI$Parser.checkChars(URI.java:3021) at java.net.URI$Parser.parse(URI.java:3048) at java.net.URI.(URI.java:746) at org.apache.hadoop.fs.Path.initialize(Path.java:203)

I also tried:

bin/nutch solrindex localhost:8983/solr/nndcweb crawl/crawldb -linkdb crawl/linkdb -Dsolr.server.type=cloud -D solr.zookeeper.url=localhost:9983/ -dir crawl/segments/* -normalize

and get same thing. Doing a help on solrindex indicates using the -params with an "&" separating the options (in contrast to using -D). However, this only serves telling my Linux system to try to run some strange things in the background, of course.

Does anybody have any suggestions on what to try next? Thanks!

Update I updated the commands used above to reflect the correction to a silly mistake I made. Note that all url references, in practice, do have the http:// prefix, but I had to take them out to be able to post. In spite of the fix, I'm still getting the same exception though ( a sample of which I used to replace the original above, again with the http:// cut out..which does make things confusing...sorry about that...).

Yet Another Update So..this is interesting. Using the solrindex option, I just took out the port from the zookeeper url ..just localhost (with the http:// prefix). 15 characters. The URISyntaxException says the problem is at index 18 (from org.apache.hadoop.fs.Path.initialize(Path.java:206)). This does happen to match the "=" in "solr.zookeeper.url=". So, it seems like the hadoop.fs.Path.intialize() is taking the whole string as the url. So perhaps I am not setting that up correctly? Or is this a bug in hadoop? That would be hard to believe.

An Almost There Update Alright..given the results of the last attempt, I decided to put the solr.type of cloud and the zookeeper.url in the nutch-site.xml config file. Then did:

bin/nutch solrindex http://localhost:8983/solr/nndcweb crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments -normalize

(great..no complaints about the url now from StackOverflow). No uri exception anymore. Now, the error I get is:

(cutting verbiage at the top)

Indexing 250 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

Digging deeper into the nutch logs, I see the following:

No Collection Param specified on request and no default collection has been set.

Apparently, this has been mentioned at the Nutch Mailing list , in connection with Nutch 1.11 and solr 5 (cloud mode). There it was mentioned that it was not going to work, but a patch would be uploaded (this was back in January 2016). Digging around on the nutch development site, I hadn't come across anything on this issue...something a little bit similar for nutch 1.13, which is apparently not officially released. Still digging around, but if anybody actually has this working somehow, I'd love to hear how you did it..

Edit July 12-2016

So, after a few weeks diversion on another unrelated project, I'm back to this. Before seeing S. Doe's response below, I decided to give ElasticSearch a try instead..as this is a completely new project and we're not tied to anything yet. So far so good. Nutch is working well with it, although to use the distributed binaries I had to back the Elasticsearch version down to 1.4.1. Haven't tried the security aspect yet. Out of curiosity, I will try S. Doe's suggestion with solr eventually and will post how that goes later...

Jorge Luis Jorge Luis · Accepted Answer · 2016-06-09T21:50:49

You're not specifying the protocol to connect to Solr: You need to specify the http:// portion of the solr.server.url and you used the wrong syntax to specify the port to connect, the right URL should be: http://localhost:8983/solr/nndcweb/.

Integrating nutch 1.11 with solr 6.0.1 cloud

2 Answers