Why not take your crawled information and put [some of] it into a triple store, and use SPARQL to query it? If this is just a step in a series of processing, you don't need to go for a giant triple store, could just use Jena and TDB for simpel flat/file storage, or even just in-memory models.
The advantages of SPARQL is that you would have all the flexibility you wanted to make that list a graph, and then query that graph.
Since some of the URIs you will encounter are either subjects, predicates, or objects - you really need to just understand which graph patterns make the most sense to do more processing on. Do you like the s,p,o triple where p=wikiPageExternalLink? If so, SPARQL query that, find the object values and happy processing on the result sets.
Note that some objects in that triple patterns will be string literals (e.g. "http://...") and maybe those are the ones you want to process more, than following subject links in the dbpedia graph, e.g. s,p,s2. Again, SPARQL to the rescue with isLiteral().
If it's a subject, I think that would qualify it as a "semantic web URI", in that at least there should be some more RDF statements about it - as opposed to a string literal which is just string of some URI with no other importance in the graph. The corresponding function would be isIRI, so there you could divide the URLs you find in two buckets - literals and IRIs.
See example in the official spec:
http://www.w3.org/TR/rdf-sparql-query/#func-isIRI
SPARQL, and specifically Jena's ARQ, has a bunch of functions, filters, and REGEX that can be applied to make it as flexible as possible (e.g. maybe you want to whitelist/blacklist certain domains/patterns, or do some string manipulation before continuuing).