0
votes

First, yes, I know that regex should never be used to parse HTML, however, in this situation I'm taking a long string of text (output of var_dump(), actually) and using several regexes to transform it into XHTML so I know exactly what tags I will be dealing with. The last two regexes in my sequence look for the curly braces and transform into pieces of XHTML. It works great EXCEPT for when the curly braces are contained in a string variable, which I am outputting in between <var></var> tags in a previous regex.

So, currently, I'm using: /\s*{\s*/u. What I need to do is adjust this to ignore any curly brace anywhere within the <var></var> tags.

I've tried using: /\s*{\s*(?!(?<!<var>)[^\{]*<\/var>)/u but that isn't quite right. I have not yet pinpointed what the conditions are that make it not work correctly. So, I may be close with this regex or I may be way off. Hence the need for the SO expertise. Thank you.

Also, if this is simply not possible, there are other hacks I can do, ie, base64_encode() the string, stick it in the <var></var> tags and then as a last regex, base64_decode() anything surrounded by <var></var> tags. I'd prefer to find a usable regex and more importantly, simply curious if it's possible.

1
You've stumbled into one of the areas why Regular Expressions are not good for parsing HTML. While you can likely code around it in your case, I can't help but note this.Jason McCreary
Sounds like your reinventing the wheel. If you want to take a var_dump() and put it on the screen for debugging purposes look at Krumo, PQP or one of the countless other projects like that. If you insist on going your own way, you're probably better off reimplementing the logic of var_dump() & walking your objects, rather than trying to transform the text.Sean McSomething
@SeanMcSomething: I wrote this long ago and was not aware of those projects at the time (although I found out about them shortly after). As of now, most of my work has already been done and it works exactly as I want it to, save this pretty rare exception of the curly brace IN a string variable. I've just been living with it but have some time to fix it now.heath
@JasonMcCreary: I completely agree. It's WRONG to parse HTML with regex. Haha. But, when I first began, I thought I could write one monster regex to convert the whole string into XHTML. I couldn't do it so it got broken into a few regexes. Which leaves me with trying to parse HTML, albeit very simple and predictable. But yes, you are absolutely correct.heath

1 Answers

3
votes

This might work:

\s*{\s*(?:(?!(?:.*?</var>))|(?=[^<]+<var>))

Pretty much, I rephrased the question: Instead of not matching curly braces within <var>, I only match curly braces that can be proved to be outside of <var>. So, a curly brace is outside of a <var> if:

  1. It can be asserted that this is true: (?!(?:.*?</var>)), which uses a negative lookahead to ensure that we don't hit the closing </var> tag, or
  2. It can be asserted that this is true: (?=[^<]+<var>), which uses a positive lookahead to ensure that somewhere we'll eventually hit the opening <var> tag.

It will definitely fail with nested <var> tags, but it seems to work with the test case I used. You can run it on RegExr and tell me what you think.