1
votes

We have an OpenLink Virtuoso-based triple (or rather quad-) store with about ~6bn triples in it. Our collaborators are asking us to give them a small subset of the data so they can test some of their queries and algorithms. Naturally, if we extract a random subset of graph-subject-predicate-object quads from the entire set, most of their SPARQL queries against the subset will find no solutions, because a small random subset of quads will represent an almost entirely disconnected graph. Is there a technique (possibly Virtuoso-specific) that would allow the extraction of a subset of quads s from the entire set S such that, for a given “select” or “construct” SPARQL query Q, Q executed against s would return the same solution as Q executed against the entire set S? If this could be done, it would be possible to run all sample queries that the collaborators want to be able to run against our dataset, extract that smallest possible subset and send it to them (as an n-quads file) so they can load it into their triple store.

3
If you extract a "random subset" how can you ensure that an arbitrary query returns the same solutions as on the entire dataset? Maybe I don't understand you, but this sound like the holy grail: Shrinking a dataset without losing results for any query. If you know the sample queries indeed then you could use a SPARQL CONSTRUCT query. - UninformedUser
@DmitriiRassokhin, probably you could expose your endpoint to your collaborator, setting up miscellaneous limitations. Or give them results of 3 or 4 iterations of DESCRIBE. Or give them results of a monstrous CONSTRUCT query which emulates these 3-4 rounds of DESCRIBE... FYI: chapter 3. - Stanislav Kralin
It's sometimes also called Concise Bounded description of depth n. But in any case, you need some starting point. And for DESCRIBE you'd need some particular instance/URI - UninformedUser
@AKSW I actually meant to say "for any given query". That is, there is a non-empty set of quads S and some SPARQL query Q, which, when executed against S, returns a non-empty result R. How can one extract the smallest possible subset of quads s from S such that, when the same query Q is executed against s, it returns the same result R? - Dmitrii Rassokhin
Ah, ok. It basically depends on the query I'd say. I mean, for any extraction you have to start from a set of nodes in the graph, and then expand the graph as long as you find new triples. As far as I know, this is more or less impossible, although there are some wildcard hacks for property paths like <p>|!<p>. I don't see how this would be possible with a single SPARQL query, but if you're using some script you should be able to do this via a bunch of SPARQL queries. Not sure, how slow this will be in the end. Maybe we could start with a running example, if you have something in mind - UninformedUser

3 Answers

1
votes

You must have a known count of entity types in your database, right? Assuming that to be true, why don't you simply apply a SPARQL DESCRIBE to a sampling of each entity per entity type?

Example:

DESCRIBE ?EntitySample { { SELECT SAMPLE(?Entity) as ?EntitySample COUNT (?Entity) as ?EntityCount ?EntityType WHERE {?Entity a ?EntityType} GROUP BY ?EntityType HAVING (COUNT (?Entity) > 10) LIMIT 50 } }

1
votes

This doesn't have a pre-built Virtuoso-specific solution. (ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)

As noted in the comments above, it's actually a rather complex question. From many perspectives, the simple answer to "what is the minimal set to deliver the same result to the same query" is "all of it," and this might well work out to be the final answer, due to the effort needed to arrive at any satisfactory smaller subset.

The following are more along the lines of experimental exploration, than concrete advice.

  • It sounds like your collaborators want to run some known query, so I'd start by running that query against your full dataset, and then do a DESCRIBE over each ?s and ?p and ?o that appears the in the result, load all that output as your subset, and test the original query against that.

    If known, explicitly including all the ontological data from the large set in the small may help.

    If this sequence doesn't deliver the expected result set, you might try a second, third, or more rounds of the DESCRIBE, this time targeting every new ?s and ?p and ?o that appeared in the previous round.

  • The idea of exposing your existing endpoint, with the full data set, to your collaborators is worth considering. You could grant them only READ permissions and/or adjust server configuration to limit the processing time, result set size, and other aspects of their activity.

  • A sideways approach to this that may help in thinking about it might be to consider that in the SQL Relational Table world, a useful subset of a single table is easy -- it can be just a few rows, that include at least the one(s) that you want your query to return (and often at least a few that you want your query to not return).

    With a SQL Relational Schema involving multiple Tables, the useful subset extends to include the rows of each table which are relationally connected to the (un)desired row(s) in any of the others.

    Now, in the RDF Relational Graph world, each "row" of those SQL Relational Tables might be thought of as having been decomposed, with the primary key of each table becoming a ?s, and each column/field becoming a ?p, and each value becoming a ?o.

    The reality is (obviously?) more complex (you might look at the W3C RDF2RDF work, and the Direct Mapping and R2RML results of that work, for more detail), but this gives a conceptual starting point for considering how to find the quads (or triples) from an RDF dataset that comprise the minimum sub-dataset that will satisfy a given query against both dataset and subdataset.

0
votes

Querying a subset of quads cannot give the same results as querying the entire dataset, because by considering only a small percentage of the quads, you loose quads in the answer too.