How to delete data from Lucene's index exactly by document id?

Question

I have a collection of documents in MongoDb (url: String, title: String, content: String). url is a unique field and contains something like server://aaa/bbb/1.html.

I would like to index data with Lucene, not Mongo (I can change storage). I'm going to store url in Lucene's index. When user searchs something by keywords, I'll perform query with Lucene, read url field and go to Mongo to extract doc by the url. It works well.

But I can't delete data from Lucene's index by url because it contains a lot of not allowed symbols. I use following settins for url field:

store = true
analyzed = false
indexed = true

(Should I index this field? What if I don't index this field? Will Lucene do a full scan? Collection can contain millions of documents)

If I want to have good performance should I create secondary index (Int or Long) and don't search by url?

I use latest versions of JVM, Lucene, Ubuntu and Mongo.

what do you mean by saying a lot of not allowed symbols? you could use special analyzer in Lucene to have URL field as is, or without big changes — Mysterion
also, show some code, when you try to delete docs and it's not working — Mysterion
I'm using clojure and github.com/weavejester/clucy wrapper around Lucene. But it doesn't matter, I can write my own implementation around IndexWriter. I'm interested how to do it in java and then I implement it in clojure. The main question is should I analyze and index url field? Is it a correct way to delete documents from index? — Curiosity
The problem is that I can perform query "some.url" but can't perform "some.url" - trailing char brings an exception. I suppose that the default parser has problems with urls. — Curiosity

Mysterion Mysterion · Accepted Answer · 2015-02-11T10:52:40

You need to properly encode your URL in a query, it should help.

E.g. in your case some.url/foo should be passed in a query as some.url%2Ffoo. You could try decoding/encoding online here - http://www.url-encode-decode.com/

For more info about escaping chars in Solr query take a look here - https://wiki.apache.org/solr/SolrQuerySyntax#NOTE:_URL_Escaping_Special_Characters

How to delete data from Lucene's index exactly by document id?

1 Answers