A part of my project is to index the s-p-o in ntriples and I need some help figuring out how exactly to do this via Java (or other language if possible).
Problem statement: We have around 10 files with the extension “.ntriple”. Each file having at least 10k triples. The format of this file is multiple RDF TRIPLEs
<subject1_uri> <predicate1_uri> <object1_uri>
<subject2_uri> <predicate1_uri> <object2_uri>
<subject2_uri> <predicate1_uri> <object3_uri>
…..
…..
What I need to perform is, index each of these subjects, predicates and objects so that we can have a fast search and retrieve for queries like “Give me all subjects and objects for predicate1_uri” and so on.
I gave it a try using this example but I saw that this was doing a Full Text Search. This doesn't seem to be efficient as the ntriple files could be as large as 50MB per file.
Then I thought of NOT doing a full text search, instead just store s-p-o as an index Document and each (s,p,o) as a Document Field with another Field as Id (offset of the s-p-o in corresponding ntriple file).
I have two questions:
- Is Lucene the Only option for what I am trying to achieve ?
- Would the size of the Index files themselves be larger than Half the size of the data itself ?!
Any and all help really appreciated.