0
votes

I'm having difficulties indexing TREC in Lucene 7. Until now I only needed to Index Text Files which was easily archivable by using a InputStreamReader like desribed by the Demo.

/** Indexes a single document */
static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
try (InputStream stream = Files.newInputStream(file)) {
  // make a new, empty document
  Document doc = new Document();

  Field pathField = new StringField("path", file.toString(), ld.Store.YES);
  doc.add(pathField);

  doc.add(new LongPoint("modified", lastModified));

  doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));

  if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
    System.out.println("adding " + file);
    writer.addDocument(doc);
  } else {
    System.out.println("updating " + file);
    writer.updateDocument(new Term("path", file.toString()), doc);
  }
}
}

But TREC has different tags that store information not relevant for the search results. Like Header Title DocNo and many more. How would I adjust this Code to save specific Tags in their own textfield with their appropiate content?

1

1 Answers

0
votes

Answering my own Question since I found a Solution. This might not be the most optimal and by no means the best looking one.

My Solution is to take the complet InputStream and read it step by step doing the appropiate actions if a certain tag is found here a small example:

        BufferedReader in = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));              
        while ((read = in.readLine()) != null) {

            String[] splited = read.split("\\s+");              
            boolean text = false;

            for (String part : splited) {
                if (part.equals(new String("<TEXT>"))) {
                    text = true;
                }
         }

This solution works to solve my problem but im fairly certain that there is a better looking solution out there.