3
votes

I'm new to Lucene. I want to write a sample code of PyLucene 6.5 in Python 3. I changed this sample code for the version. However, I could find few document and I'm not sure the changes are correct.

# indexer.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, StringField, FieldType
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    writerConfig = IndexWriterConfig(StandardAnalyzer())
    writer = IndexWriter(indexDir, writerConfig)

    print("%d docs in index" % writer.numDocs())
    print("Reading lines from sys.stdin...")

    tft = FieldType()
    tft.setStored(True)
    tft.setTokenized(True)
    for n, l in enumerate(sys.stdin):
        doc = Document()
        doc.add(Field("text", l, tft))
        writer.addDocument(doc)
    print("Indexed %d lines from stdin (%d docs in index)" % (n, writer.numDocs()))
    print("Closing index of %d docs..." % writer.numDocs())
    writer.close()

This code reads input and stores in index directory.

# retriever.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.index import IndexReader, DirectoryReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    analyzer = StandardAnalyzer()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    reader = DirectoryReader.open(indexDir)
    searcher = IndexSearcher(reader)

    query = QueryParser("text", analyzer).parse("hello")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print("Found %d document(s) that matched query '%s':" % (hits.totalHits, query))
    for hit in hits.scoreDocs:
        print(hit.score, hit.doc, hit.toString())
        doc = searcher.doc(hit.doc)
        print(doc.get("text").encode("utf-8"))

We should be able to retrieve (search) with retriever.py but it does not return anything. What's wrong with it?

2

2 Answers

2
votes

I think that the best way for you to get started is to download PyLucene's tarball (version of you choice):

https://www.apache.org/dist/lucene/pylucene/

Inside you will find a test3/ folder (for python3, else test2/ for python2) with python tests. These cover common operations such as indexing, reading, searching and much more. I found these to be extremely helpful, given the terrible lack of documentation around Pylucene.

Checkout the test_Pylucene.py in particular.

Note

This is also a very good way to quickly grasp the changes and adapt your code across releases if the Changelog is not intuitive enough for you.

(Why I'm not providing code in this answer: The problem with providing code snippets on SO's answers for PyLucene is that these quickly become obsolete as soon as a new version is released, as we can see on most of the already existing ones.)

1
votes
In []: tft.indexOptions()
Out[]: <IndexOptions: NONE>

Although it's documented that DOCS_AND_FREQS_AND_POSITIONS is the default, that's no longer the case. That's the default for a TextField; a FieldType must setIndexOptions explicitly.