How can I tell a URL from a semantic web URI, programmatically

Question

I have a program that crawls semantic web documents e.g. RDF and OWL.

It takes the URIs it finds and puts them into a list for further processing. However it also finds URLs that are the part of some statements (I am using the wikipedia data so this is usually the http://dbpedia.org/ontology/wikiPageExternalLink property.

How can I tell which is a semantic web URI and which is just a URL, with as little fuss. I am developing in Java, and am thinking if it takes more than a certain amount of time to read a file the program should just keep going. But I am not sure how to do this.

I know my question is vague, tell me what more detail I should give. I haven't posted code because I don't think it would help in this case.

Al Baker Al Baker · Accepted Answer · 2011-06-20T00:53:18

Why not take your crawled information and put [some of] it into a triple store, and use SPARQL to query it? If this is just a step in a series of processing, you don't need to go for a giant triple store, could just use Jena and TDB for simpel flat/file storage, or even just in-memory models.

The advantages of SPARQL is that you would have all the flexibility you wanted to make that list a graph, and then query that graph.

Since some of the URIs you will encounter are either subjects, predicates, or objects - you really need to just understand which graph patterns make the most sense to do more processing on. Do you like the s,p,o triple where p=wikiPageExternalLink? If so, SPARQL query that, find the object values and happy processing on the result sets.

Note that some objects in that triple patterns will be string literals (e.g. "http://...") and maybe those are the ones you want to process more, than following subject links in the dbpedia graph, e.g. s,p,s2. Again, SPARQL to the rescue with isLiteral().

If it's a subject, I think that would qualify it as a "semantic web URI", in that at least there should be some more RDF statements about it - as opposed to a string literal which is just string of some URI with no other importance in the graph. The corresponding function would be isIRI, so there you could divide the URLs you find in two buckets - literals and IRIs.

See example in the official spec: http://www.w3.org/TR/rdf-sparql-query/#func-isIRI

SPARQL, and specifically Jena's ARQ, has a bunch of functions, filters, and REGEX that can be applied to make it as flexible as possible (e.g. maybe you want to whitelist/blacklist certain domains/patterns, or do some string manipulation before continuuing).

How can I tell a URL from a semantic web URI, programmatically

2 Answers