Indexing NTriples with Lucene

Question

A part of my project is to index the s-p-o in ntriples and I need some help figuring out how exactly to do this via Java (or other language if possible).

Problem statement: We have around 10 files with the extension “.ntriple”. Each file having at least 10k triples. The format of this file is multiple RDF TRIPLEs

<subject1_uri> <predicate1_uri> <object1_uri>
<subject2_uri> <predicate1_uri> <object2_uri>
<subject2_uri> <predicate1_uri> <object3_uri>
            …..
            …..

What I need to perform is, index each of these subjects, predicates and objects so that we can have a fast search and retrieve for queries like “Give me all subjects and objects for predicate1_uri” and so on.

I gave it a try using this example but I saw that this was doing a Full Text Search. This doesn't seem to be efficient as the ntriple files could be as large as 50MB per file.

Then I thought of NOT doing a full text search, instead just store s-p-o as an index Document and each (s,p,o) as a Document Field with another Field as Id (offset of the s-p-o in corresponding ntriple file).

I have two questions:

Is Lucene the Only option for what I am trying to achieve ?
Would the size of the Index files themselves be larger than Half the size of the data itself ?!

Any and all help really appreciated.

Rob Hall Rob Hall · Accepted Answer · 2018-03-30T13:40:01

To answer your first question: No, Lucene is not the only option to do this. You can (and probably should) use any generic RDF database to store the triples. Then you can query the triples using their Java API or using SPARQL. I personally recommend Apache Jena as a Java API for working with RDF.

If you need free-text search across literals in your dataset, there is Lucene Integration with Apache Jena through Jena Text.

Regarding index sizes, this depends entirely on the entropy of your data. If you have 40,000 lines in an NTRIPLE file, but it's all replications of the same triples, then the index will be relatively small. Typically, however, RDF databases make multiple indexes of the data and you will see a size increase.

The primary benefit of this indexing is that you can ask more generic questions than “Give me all subjects and objects for predicate1_uri”. That question can be answered by linearly processing all of the NTRIPLE files without even knowing you're using RDF. The following SPARQL-like query shows an example of a more difficult search facilitated by these data stores:

SELECT DISTINCT ?owner
WHERE {
  ?owner :owns ?thing
  ?thing rdf:type/rdfs:subClassOf :Automobile
  ?thing :hasColor "red"@en
}

In the preceding query, we locate owners of something that is an automobile or any more specific sublcass of automobile so long as the color of that thing is "red" (as specified in english).

Indexing NTriples with Lucene

1 Answers