0
votes

Using DBpedia-Live SPARQL endpoint http://dbpedia-live.openlinksw.com/sparql, I am trying to count the total number of triples associated with the instances of type owl:Thing. As the count is really big, an exception is being thrown "Virtuoso 42000 Error The estimated execution time". To get rid of this I tried to use subselect, limit, and offset in the query. However, when the offset is greater than equal to the limit, the solution isn't working and the same exception is being thrown again (Virtuoso 42000 Error), can anyone please identify the problem with my query? Or suggest a workaround? Provided is the query I was trying:

select count(?s) as ?count
where
{
?s ?p ?o
  {
      select ?s
      where
      {
          ?s rdf:type owl:Thing.
      }
    limit 10000
    offset 10000
  }
}
1
works for me on dbpedia.org/sparql - note, it's a shared resources used by many people. You don'T have any performance guarantees nor a guarantee for uptime. The workaround is to load the DBpedia dump and process the data locally. In your case, you could even use UNIX commands like grep, etc - UninformedUser
Thanks for replying, unfortunately, the dumps would be a little older and am trying to perform an experiment with the current state of DBpedia and live changes it produces. I am still unsure why the query would work on the static DBpedia's endpoint and not on the live endpoint. I am assuming there would not be a drastic difference between the configurations of both environments. Thanks again for the response. - singha
Well, the most obvious difference is Virtuoso 7 vs. Virtuoso 8. That alone can lead to different query execution plans etc. Moreover, different servers, different Virtuoso config, there can be so many things making the difference. - UninformedUser
Thanks for the information. So, my takeaway from this conversation would be that there is no way to get the count of the triples related to owl:Thing class from DBpedia LIVE SPARQL endpoint. - singha
That's something I cannot answer, but only DBpedia Live maintainers, Virtuoso devs or more experienced SPARQL users. - UninformedUser

1 Answers

2
votes

Your solution starts with patience. Virtuoso's Anytime Query feature returns some results when a timeout strikes, and keeps running the query in the background -- so if you come back later, you'll typically get more solutions, up to the complete result set.

I had to guess at your original query, since you only posted the piecemeal one you were trying to use --

select ( count(?s) as ?count )
where
{
          ?s rdf:type owl:Thing.
}

I got 3,923,114 within a few seconds, without hitting any timeout. I had set a timeout of 3000000 milliseconds (= 3000 seconds = 50 minutes) on the form -- in contrast to the endpoint's default timeout of 30000 milliseconds (= 30 seconds) -- but clearly hit neither of these, nor the endpoint's server-side configured timeout.

I think you already understand this, but please do note that this count is a moving target, and will change regularly as the DBpedia-Live content continues to be updated from the Wikipedia firehose.


Your divide-and-conquer effort has a significant issue. Note that without an ORDER BY clause in combination with your LIMIT/OFFSET clauses, you may find that some solutions (in this case, some values of ?s) repeat and/or some solutions never appear in a final aggregation that combines all those partial results.

Also, as you are trying to count triples, you should probably do a count(*) instead of count (?s). If nothing else, this helps readers of the query understand what you're doing.


Toward being able to adjust such execution time limits as your query is hitting -- the easiest way would be to instantiate your own mirror via the the DBpedia-Live AMI; unfortunately, this is not currently available for new customers, for a number of reasons. (Existing customers may continue to use their AMIs.) We will likely revive this at some point, but the timing is indefinite; you could open a Support Case to register your interest, and be notified when the AMI is made available for new users.


Toward an ultimate solution... There may be better ways to get to your actual end goal than those you're currently working on. You might consider asking on the DBpedia mailing list or the OpenLink Community Forum.