0
votes

I have a possibly simple task ahead of me, but my RegEx skills are poor. Can anyone help me, or point me in the right direction? :-)

Example text I'm parsing, And I would like to do a foreach on the results where I can get the variable "URL" and the text in between:

Lorem ipsum dolor sit amet, consectetur[URL=/test.aspx?ID=12345]lorem ipsum[/URL] adipiscing elit. Nullam interdum eleifend mauris, nec condimentum nisi lacinia sit amet. Mauris faucibus, orci ac[URL=/Default.aspx?ID=222222]lorem[/URL] convallis volutpat, dolor libero sollicitudin quam, id feugiat magna orci[URL=/Default.aspx?ID=333333]lorem ipsum dolor[/URL] quis augue. Integer nec euismod sem.

3
-1 for really bad title. - gsharp
How about using String.IndexOf() API to find the URL value and then from that index you can read upto next URL string is found. Hope your getting the funda? - Zenwalker
When you feel comfortable enough you can take a look at this gem : shop.oreilly.com/product/9781565922570.do - FailedDev
Good suggestions on where to start reading. - Christopher W. Brandsdal

3 Answers

4
votes

This should do it for you:

Regex theRegex = new Regex(@"\[URL=([^\]]+)\]([^\[]+)\[/URL\]");
string text = "Lorem ipsum dolor sit amet, consectetur[URL=/test.aspx?ID=12345]lorem ipsum[/URL] adipiscing elit. Nullam interdum eleifend mauris, nec condimentum nisi lacinia sit amet. Mauris faucibus, orci ac[URL=/Default.aspx?ID=222222]lorem[/URL] convallis volutpat, dolor libero sollicitudin quam, id feugiat magna orci[URL=/Default.aspx?ID=333333]lorem ipsum dolor[/URL] quis augue. Integer nec euismod sem.";
MatchCollection matches = theRegex.Matches(text);
foreach (Match thisMatch in matches)
{
//        thisMatch.Groups[0].Value is e.g. "[URL=/test.aspx?ID=12345]lorem ipsum[/URL]"
//        thisMatch.Groups[1].Value is e.g. "/test.aspx?ID=12345"
//        thisMatch.Groups[2].Value is e.g. "lorem ipsum"

}
0
votes

This sort of thing will work if your text looks exactly like this, i.e. you have no nested URLs, your URL tag is all in capitals

 "\[URL=([^\]]*)\]([^\[]*\)\[/URL\]"

this should capture two groups: 1 = the stuff after URL=, 2 = the stuff between the [URL]...[\URL] marks.

Basically,

  • as [ and ] are reserved tokens, to match them you need to prefix them by backslashes (i.e. "escaping" them)

  • [^\[] matches any character that isn't an open-bracket.

  • the parentheses determine groups that can be captured.

Caveats: nested URL tags won't work, tags that themselves contain square brackets won't work, and quoted strings "..." should also be free from brackets - i.e. they won't be treated like a correct markup parser would.

The only way round this kind of problem, as far as I know, is to do a full parse.

But if you're sure the data doesn't have these sorts of anomaly, you'll be OK!

0
votes

Here is the requested regex

\[URL=(?<url>[^\]]*)\](?<text>[^\[]*)\[/URL\]

You access the requested values with following code:

   var regex = new Regex(@"\[URL=(?<url>[^\]]*)\](?<text>[^\[]*)\[/URL\]");
   var matches = regex.Matches(textToSearchIn);

   foreach (Match match in matches)
   {
       Debug.Print("Url: {0} Text: {1}", match.Groups["url"].Value, match.Groups["text"].Value);
   }