2
votes

I have a program that crawls semantic web documents e.g. RDF and OWL.

It takes the URIs it finds and puts them into a list for further processing. However it also finds URLs that are the part of some statements (I am using the wikipedia data so this is usually the http://dbpedia.org/ontology/wikiPageExternalLink property.

How can I tell which is a semantic web URI and which is just a URL, with as little fuss. I am developing in Java, and am thinking if it takes more than a certain amount of time to read a file the program should just keep going. But I am not sure how to do this.

I know my question is vague, tell me what more detail I should give. I haven't posted code because I don't think it would help in this case.

2

2 Answers

1
votes

Why not take your crawled information and put [some of] it into a triple store, and use SPARQL to query it? If this is just a step in a series of processing, you don't need to go for a giant triple store, could just use Jena and TDB for simpel flat/file storage, or even just in-memory models.

The advantages of SPARQL is that you would have all the flexibility you wanted to make that list a graph, and then query that graph.

Since some of the URIs you will encounter are either subjects, predicates, or objects - you really need to just understand which graph patterns make the most sense to do more processing on. Do you like the s,p,o triple where p=wikiPageExternalLink? If so, SPARQL query that, find the object values and happy processing on the result sets.

Note that some objects in that triple patterns will be string literals (e.g. "http://...") and maybe those are the ones you want to process more, than following subject links in the dbpedia graph, e.g. s,p,s2. Again, SPARQL to the rescue with isLiteral().

If it's a subject, I think that would qualify it as a "semantic web URI", in that at least there should be some more RDF statements about it - as opposed to a string literal which is just string of some URI with no other importance in the graph. The corresponding function would be isIRI, so there you could divide the URLs you find in two buckets - literals and IRIs.

See example in the official spec: http://www.w3.org/TR/rdf-sparql-query/#func-isIRI

SPARQL, and specifically Jena's ARQ, has a bunch of functions, filters, and REGEX that can be applied to make it as flexible as possible (e.g. maybe you want to whitelist/blacklist certain domains/patterns, or do some string manipulation before continuuing).

0
votes

First, its important to acknowledge that URLs are a subset of URIs. (ie. http://en.wikipedia.org/wiki/Tim_Berners-Lee is the URI for the wikipedia page about Tim Berners Lee). All URIs and URLs play an important role in the Semantic Web

I suppose the big problem you have is deciding which URIs are going to yield RDF triples.

The first approach is to attempt to parse triples out of all URIs you come across eg. even if a page seems to be HTML, it may have RDFa present too. (I suppose you could HTTP request only RDF MIME types - but you would potentially lose a wealth of RDFa data)

Another approach is to presume that all the http://dbpedia.org/ontology/wikiPageExternalLink properties' objects aren't going to yield any interesting facts.

Anther approach is to note domain names / subdomains that don't publish RDF and ignore them.