0
votes

I am trying to parse HLS m3u8 file and where am stuck at is matching m3u8 links. So, if URI= exists, from #EXT-X-I-FRAME-STREAM-INF, grab the one in quotation marks, and if it doesn't, #EXT-X-STREAM-INF, grab the link from new line.

Text:

#EXT-X-STREAM-INF:BANDWIDTH=263851,CODECS="mp4a.40.2, avc1.4d400d",RESOLUTION=416x234,AUDIO="bipbop_audio",SUBTITLES="subs"
gear1/prog_index.m3u8 <== new line link
#EXT-X-I-FRAME-STREAM-INF:URI="gear1/iframe_index.m3u8",CODECS="avc1.4d400d",BANDWIDTH=28451

enter image description here

Regex:

(?:#EXT-X-STREAM-INF:|#EXT-X-I-FRAME-STREAM-INF:)(?:BANDWIDTH=(?<BANDWIDTH>\d+),?|CODECS=(?<CODECS>"[^"]*"),?|RESOLUTION=(?<RESOLUTION>\d+x\d+),?|AUDIO=(?<AUDIO>"[^"]*"),?|SUBTITLES=(?<SUBTITLES>"[^"]*"),?|URI=(?<URI>"[^"]*"),?)*

Regex demo

1
Please see this demo, do you want something like this? Match an additional line and capture it into the 2nd URI group (with (?J) modifier) if #EXT-X-STREAM-INF was matched in Group 1.Wiktor Stribiżew
@Wiktor Stribiżew You are beyond godlike! Please make a post, so I can up vote and accept it as an answer.Srdjan M.
Are you sure your engine is PCRE? Will it work in the actual current project code?Wiktor Stribiżew
@Wiktor Stribiżew I don't know. Am using PHP.Srdjan M.
Yes, PHP uses PCRE.Wiktor Stribiżew

1 Answers

1
votes

A quick fix for your pattern will look like this:

  • Capture the #EXT-X-STREAM-INF part into Group 1
  • Add (?J) modifier to allow named capturing groups with identical names
  • Add a conditional construct that will capture the whole line after the current pattern if Group 1 matched.

The pattern will look like

(?J)(?:(#EXT-X-STREAM-INF)|#EXT-X-I-FRAME-STREAM-INF):(?:BANDWIDTH=(?<BANDWIDTH>\d+),?|CODECS=(?<CODECS>"[^"]*"),?|RESOLUTION=(?<RESOLUTION>\d+x\d+),?|AUDIO=(?<AUDIO>"[^"]*"),?|SUBTITLES=(?<SUBTITLES>"[^"]*"),?|URI=(?<URI>"[^"]*"),?)*(?<URI>(?:(?!#EXT)\S)+))

See the regex demo

So, basically, I added (?<URI>(?:(?!#EXT)\S)+)) at the end and captured (#EXT-X-STREAM-INF) at the start.

The conditional construct matches like this:

  • (? - start of the conditional construct
    • (1) - if Group 1 matched
    • \R - a line break
    • (?<URI> - start of a named capturing group
      • (?:(?!#EXT)\S)+) - any non-whitespace char (\S), 1 or more occurrences (+), that is not a starting char of a #EXT char sequence (the so called "tempered greedy token")
    • ) - end of the named capturing group
  • ) - end of the conditional construct