Regular expression for syntax highlighting attributes in HTML tag

Question

I'm working on regular expressions for some syntax highlighting in a Sublime/TextMate language file, and it requires that I "begin" on a non-self closing html tag, and end on the respective closing tag:

begin: (<)([a-zA-Z0-9:.]+)[^/>]*(>)
end: (</)(\2)([^>]*>)

So far, so good, I'm able to capture the tag name, and it matches to be able to apply the appropriate patterns for the area between the tags.

jsx-tag-area:
    begin: (<)([a-zA-Z0-9:.]+)[^/>]*>
    beginCaptures:
      '1': {name: punctuation.definition.tag.begin.jsx}
      '2': {name: entity.name.tag.jsx}
    end: (</)(\2)([^>]*>)
    endCaptures:
      '1': {name: punctuation.definition.tag.begin.jsx}
      '2': {name: entity.name.tag.jsx}
      '3': {name: punctuation.definition.tag.end.jsx}
    name: jsx.tag-area.jsx
    patterns:
    - {include: '#jsx'}
    - {include: '#jsx-evaluated-code'}

Now I'm also looking to also be able to capture zero or more of the html attributes in the opening tag to be able to highlight them.

So if the tag were <div attr="Something" data-attr="test" data-foo>

It would be able to match on attr, data-attr, and data-foo, as well as the < and div

Something like (this is very rough):

(<)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*)[^/>]*(>)

It doesn't need to be perfect, it's just for some syntax highlighting, but I was having a hard time figuring out how to achieve multiple capture groups within the tag, whether I should be using look-around, etc, or whether this is even possible with a single expression.

Edit: here are more details about the specific case / question - https://github.com/reactjs/sublime-react/issues/18

This probably won't work very well if you're trying to capture an arbitrary amount of attributes. If it's a variable amount of attributes the regex is going to be very messy and unreadable. This is how ugly it looks capturing two attributes — skamazin
You've had a look at RegEx match open tags except XHTML self-contained tags? — Bergi
Yes of course :) I'm not trying to faithfully parse the html, I'm trying to roughly pattern match it... take a look at the use case github.com/reactjs/sublime-react/issues/18 — tgriesser
Also, the issue is half with the actual matching and half with how it should actually work based on Sublime's syntax highlighting rules (or if I'm going about this the wrong way) — tgriesser
It's a shame I can't really play with this one... From the tutorial it looks like you can use "include": "$self" for recursive matching, which is very cute. Can it also be used for a specific group? For example: match <[Tag][All Attributes]>...</[Tag]>, and then use another rule to parse [All Attributes]? — Kobi

Oscar Hermosilla Oscar Hermosilla · Accepted Answer · 2014-09-10T09:28:50

I may found a possible solution.

It is not perfect because as @skamazin said in the comments if you are trying to capture an arbitrary amount of attributes you will have to repeat the pattern that matches the attributes as many times as you want to limit the number of attributes you will allow.

The regex is pretty scary but it may work for your goal. Maybe it would be possible to simplify it a bit or maybe you will have to adjust some things

For only one attribute it will be as this:

(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))

DEMO

For more attributes you will need to add this as many times as you want:

(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))?

So for example if you want to allow maximum 3 attributes your regex will be like this:

(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?

DEMO

Tell me if it suits you and if you need further details.

Regular expression for syntax highlighting attributes in HTML tag

4 Answers