Optimize SPARQL query to load SAMPLE labels

Question

The following query loads contracts from my data set (a contract is between an organization and a partner).

SELECT ?contract ?organisation ?partner
WHERE {
    ?organisation
        a gr:BusinessEntity ;
        rejstriky:contract ?contract .

    ?contract a rejstriky:Contract ;
        rejstriky:partner ?partner .
}
GROUP BY ?contract ?organisation ?partner

This query returns around 8000 contracts and it does that immediately (it takes just a fraction of second). Now I need to load labels/names for both the organization and the partner. There might be multiple names available, I just need one. This is my query:

SELECT ?contract ?organisation ?partner
    (SAMPLE(?organisationNames) AS ?organisationName)
    (SAMPLE(?partnerNames) AS ?partnerName)
WHERE {
    ?organisation
        a gr:BusinessEntity ;
        rejstriky:contract ?contract .

    ?contract a rejstriky:Contract ;
        rejstriky:partner ?partner .

    ?organisation gr:legalName ?organisationNames .
    ?partner gr:legalName ?partnerNames .
}
GROUP BY ?contract ?organisation ?partner

This query suddenly takes several minutes to finish.

I did some experiments and I found out that if I decided to get all the names using separate SPARQL calls (by 40 names in a single batch), it'd take less than 2 minutes (it would be significantly faster). Regardless of that, if I'm able to generate those 8000 items within a fraction of second, loading two labels for each item should not take that long.

Do you have any ideas how to optimize my query? Note that I'm using Virtuoso.

This looks like a minor issue in Virtuoso's query planner - there's no obvious reason this should take so long. Have you tried reporting the issue directly and see if they have a solution? — Jeen Broekstra
First guess is a query optimization error within Virtuoso. Have you tested the speed without the SAMPLE aggregate? That is, changing the SELECT list to ?contract ?organisation ?partner ?organisationNames ?partnerNames? You might also raise this to the Virtuoso Users mailing list or the OpenLink Support Forums which audiences include several members of the Virtuoso Development team... — TallTed
I found out that the dataset is slightly corrupted. Around 1000 of the partners are represented by a single URI which has however 1000 different legal names attached. If I remove the aggregation, it actually runs about 3 times faster but it generates 41 million entries. Maybe this is what's messing up with the SAMPLE aggregation and slowing the query down. However I'd still say that selecting one sample value for 8000 items should be reasonably fast regardless of the size of the set I'm choosing from. What do you think? — tobik
Okay, explicitly filtering out that single URI representing 1000 different partners reduced the query duration to just about few seconds. I'll see if I can fix the dataset. Thanks for the help! — tobik

chrisis chrisis · Accepted Answer · 2016-04-06T08:48:55

Not having access to sample data or Virtuoso it's difficult to be sure if this will help, but you might try avoiding the use of SAMPLE.

 SELECT ?contract ?organisation ?organisationName ?partner ?partnerName

WHERE {
    ?organisation
        a gr:BusinessEntity ;
        rejstriky:contract ?contract .

    ?contract a rejstriky:Contract ;
        rejstriky:partner ?partner .

   { SELECT ?organisationName WHERE { ?organisation gr:legalName ?organisationName . } LIMIT 1}
   { SELECT ?partnerName WEHRE {?partner gr:legalName ?partnerName . } LIMIT 1}
}
GROUP BY ?contract ?organisation ?organisationName ?partner ?partnerName

Optimize SPARQL query to load SAMPLE labels

1 Answers