How to extract a representative subset of triples from a triple store

Question

We have an OpenLink Virtuoso-based triple (or rather quad-) store with about ~6bn triples in it. Our collaborators are asking us to give them a small subset of the data so they can test some of their queries and algorithms. Naturally, if we extract a random subset of graph-subject-predicate-object quads from the entire set, most of their SPARQL queries against the subset will find no solutions, because a small random subset of quads will represent an almost entirely disconnected graph. Is there a technique (possibly Virtuoso-specific) that would allow the extraction of a subset of quads s from the entire set S such that, for a given “select” or “construct” SPARQL query Q, Q executed against s would return the same solution as Q executed against the entire set S? If this could be done, it would be possible to run all sample queries that the collaborators want to be able to run against our dataset, extract that smallest possible subset and send it to them (as an n-quads file) so they can load it into their triple store.

If you extract a "random subset" how can you ensure that an arbitrary query returns the same solutions as on the entire dataset? Maybe I don't understand you, but this sound like the holy grail: Shrinking a dataset without losing results for any query. If you know the sample queries indeed then you could use a SPARQL CONSTRUCT query. — UninformedUser
@DmitriiRassokhin, probably you could expose your endpoint to your collaborator, setting up miscellaneous limitations. Or give them results of 3 or 4 iterations of DESCRIBE. Or give them results of a monstrous CONSTRUCT query which emulates these 3-4 rounds of DESCRIBE... FYI: chapter 3. — Stanislav Kralin
It's sometimes also called Concise Bounded description of depth n. But in any case, you need some starting point. And for DESCRIBE you'd need some particular instance/URI — UninformedUser
@AKSW I actually meant to say "for any given query". That is, there is a non-empty set of quads S and some SPARQL query Q, which, when executed against S, returns a non-empty result R. How can one extract the smallest possible subset of quads s from S such that, when the same query Q is executed against s, it returns the same result R? — Dmitrii Rassokhin
Ah, ok. It basically depends on the query I'd say. I mean, for any extraction you have to start from a set of nodes in the graph, and then expand the graph as long as you find new triples. As far as I know, this is more or less impossible, although there are some wildcard hacks for property paths like <p>|!<p>. I don't see how this would be possible with a single SPARQL query, but if you're using some script you should be able to do this via a bunch of SPARQL queries. Not sure, how slow this will be in the end. Maybe we could start with a running example, if you have something in mind — UninformedUser

Kingsley Uyi Idehen Kingsley Uyi Idehen · Accepted Answer · 2017-11-15T02:54:21

You must have a known count of entity types in your database, right? Assuming that to be true, why don't you simply apply a SPARQL DESCRIBE to a sampling of each entity per entity type?

Example:

DESCRIBE ?EntitySample { { SELECT SAMPLE(?Entity) as ?EntitySample COUNT (?Entity) as ?EntityCount ?EntityType WHERE {?Entity a ?EntityType} GROUP BY ?EntityType HAVING (COUNT (?Entity) > 10) LIMIT 50 } }

How to extract a representative subset of triples from a triple store

3 Answers