print lucene in inverted index format

2

votes

According to my understanding, Lucene uses inverted indexes. Is there any way to extract/print lucene index (lucene 6) in an inverted index format:

term1   <doc1, doc100, ..., doc555>
term1   <doc1, ..., doc100, ..., do89>
term1   <doc3, doc2, doc5, ...>
.
.
.
termn   <doc10, doc43, ..., dock>

luceneinverted-index

1

votes

You can use a TermEnum to iterate over terms in your inverted index. Then, for each term, you should use its PostingsEnum to iterate over postings. The following code would work if you have an index with a single segment (Lucene version: 6_5_1):

String indexPath = "your_index_path"
String field = "your_index_field"
try (FSDirectory directory = FSDirectory.open(Paths.get(indexPath));
            IndexReader reader = DirectoryReader.open(directory)) {
        Terms terms = MultiFields.getTerms(reader, field);
        final TermsEnum it = terms.iterator();
        BytesRef term = it.next();
        while (term != null) {
            String termString = term.utf8ToString();
            System.out.print(termStirng + ": ");
            for (LeafReaderContext lrc : reader.leaves()) {
                LeafReader lr = lrc.reader();
                PostingsEnum pe = lr.postings(new Term(field, termString));
                int docId = pe.nextDoc();
                while (docId != PostingsEnum.NO_MORE_DOCS) {
                    postingSize++;
                    Document doc = lr.document(docId);
                    // here you can print your document title, id, etc
                    docId = pe.nextDoc();
                }
            }
            term = it.next();
        }
    } catch (IOException e) {
        e.printStackTrace();
    }

If your index has more than one segment, then $reader.leaves()$ would return readers that have other readers as their leaves (think of a tree of index readers). In this case, you should traverse the tree to get to the leaves and repeat the code inside the for loop for each leaf.

1

votes

I am using Lucene 6.x.x and I am not sure about any easy way but a solution is better than no solution at all. Something like this works for me using - MatchAllDocsQuery.

private static void printWholeIndex(IndexSearcher searcher) throws IOException{
        MatchAllDocsQuery query = new MatchAllDocsQuery();
        TopDocs hits = searcher.search(query, Integer.MAX_VALUE);

        Map<String, Set<Integer>>  invertedIndex = new HashMap<>();


        if (null == hits.scoreDocs || hits.scoreDocs.length <= 0) {
            System.out.println("No Hits Found with MatchAllDocsQuery");
            return;
        }

        for (ScoreDoc hit : hits.scoreDocs) {
            Document doc = searcher.doc(hit.doc);

            List<IndexableField> allFields = doc.getFields();

            for(IndexableField field:allFields){



            //Single document inverted index 
            Terms terms = searcher.getIndexReader().getTermVector(hit.doc,field.name());

            if (terms != null )  {
                TermsEnum termsEnum = terms.iterator();
                while(termsEnum.next() != null){
                if(invertedIndex.containsKey(termsEnum.term().utf8ToString())){
                    Set<Integer> existingDocs = invertedIndex.get(termsEnum.term().utf8ToString());
                    existingDocs.add(hit.doc);
                    invertedIndex.put(termsEnum.term().utf8ToString(),existingDocs);

                }else{
                    Set<Integer> docs = new TreeSet<>();
                    docs.add(hit.doc);
                    invertedIndex.put(termsEnum.term().utf8ToString(), docs);
                }
                }
            }
        }
        }

        System.out.println("Printing Inverted Index:");

        invertedIndex.forEach((key , value) -> {System.out.println(key+":"+value);
        });
    }

Two points,

1.maximum documents supported - Integer.MAX_VALUE. I have not tried but probably, this limit can be eliminated using searchAfter method of searcher and performing multiple searches.

2.doc.getFields() returns only fields that are stored. Probably, you can keep a static field array if all of your indexed fields are not stored since line , Terms terms = searcher.getIndexReader().getTermVector(hit.doc,field.name()); works for not stored fields too.

0

votes

Worked out a version that prints docId:tokenPos for Lucene 6.6.

Directory directory = new RAMDirectory();
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(directory, iwc);

FieldType type = new FieldType();
type.setStoreTermVectors(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectorOffsets(true);
type.setIndexOptions(IndexOptions.DOCS);

Field fieldStore = new Field("text", "We hold that proof beyond a reasonable doubt is required.", type);
Document doc = new Document();
doc.add(fieldStore);
writer.addDocument(doc);

fieldStore = new Field("text", "We hold that proof requires reasoanble preponderance of the evidenceb.", type);
doc = new Document();
doc.add(fieldStore);
writer.addDocument(doc);

writer.close();

DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

MatchAllDocsQuery query = new MatchAllDocsQuery();
TopDocs hits = searcher.search(query, Integer.MAX_VALUE);

Map<String, Set<String>> invertedIndex = new HashMap<>();
BiFunction<Integer, Integer, Set<String>> mergeValue = 
    (docId, pos)-> {TreeSet<String> s = new TreeSet<>(); s.add((docId+1)+":"+pos); return s;};

for ( ScoreDoc scoreDoc: hits.scoreDocs ) {
    Fields termVs = reader.getTermVectors(scoreDoc.doc);
    Terms terms = termVs.terms("text");
    TermsEnum termsIt = terms.iterator();
    PostingsEnum docsAndPosEnum = null;
    BytesRef bytesRef;
    while ( (bytesRef = termsIt.next()) != null ) {
        docsAndPosEnum = termsIt.postings(docsAndPosEnum, PostingsEnum.ALL);
        docsAndPosEnum.nextDoc();
        int pos = docsAndPosEnum.nextPosition();
        String term = bytesRef.utf8ToString();
        invertedIndex.merge(
            term, 
            mergeValue.apply(scoreDoc.doc, pos), 
            (s1,s2)->{s1.addAll(s2); return s1;}
        );
    }
}
System.out.println( invertedIndex);

print lucene in inverted index format

3 Answers