0
votes

I want to match all Links in my HTML-content-variable where the href starts with http://www.example.com

Example

should match:

<a href="http://www.example.com">foo</a>

shouldn't match:

<a href="/bar/">bar</a>

also match (with linebreaks and other HTML-tags inside anchor):

<a class="bla" id="blubb" href="http://www.example.com/asdf/" title="oops">
<img src="..." alt="" />
</a>

I started with something like this:

<CFSAVECONTENT variable="html">
    <a class="bla" id="blubb" href="http://www.example.com/asdf/" title="oops">
        <img src="..." alt="" /> some Text
    </a>
</CFSAVECONTENT>
<CFSET result = REReplace(html, "<a[^>]*href="http://www\.example\.com[^"]*"[^>]?>([^<]+)</a>", "\1") />

but of course this one wouldn't match my last link example with the img-tag inside a-tag...

Any hints on this one?

2
Upon further investigation, the question title doesn't match what you are asking: Do you want to match all links that start with http://, or all links that start with example.com ? I'll have to modify my answer based on what you want.Shawn Holmes
It should match all Links starting with example.comSeybsen

2 Answers

1
votes

Assuming:

<CFSAVECONTENT variable="html">
    <a class="bla" id="blubb" href="http://www.example.com/asdf/" title="oops">
        <img src="..." alt="" /> some Text
    </a>
    <a href="http://www.example.com/foo">foo</a>
    <a href="http://www.yahoo.com">abc</a>
    <a href="http://www.example.com/bar">bar</a>
</CFSAVECONTENT>

Use:

<cfset links = ReMatch('<a[^>]*href="http://www\.example\.com[^"]*"[^>]*>(.+?)</a>', html) />

'links' is now an array of matched URLs (anchors 1, 2, and 4 should be in the array).

Bear in mind my answer was framed under the assumption you wanted to match all anchors that start with http://www.example.com, which may not necessarily match what you were asking in the title of this question.

0
votes

It can be difficult and dangerous to try using regex against HTML like this (especially if it's not your HTML but is "wild" code from the Internet), for a whole host of reasons.

The correct tool for this job is a HTML parser that can provide a DOM for you to manipulate.

Unfortunately there aren't any for CF, so you need to look at the Java ones. I've heard good things about Jericho but never used it myself.