8
votes

Note that this question is in the context of Julia, and therefore (to my knowledge) PCRE.

Suppose that you had a string like this:

"sssppaaasspaapppssss"

and you wanted to match, individually, the repeating characters at the end of the string (in the case of our string, the four "s" characters - that is, so that matchall gives ["s","s","s","s"], not ["ssss"]). This is easy:

r"(.)(?=\1*$)"

It's practically trivial (and easily used - replace(r"(.)(?=\1*$)","hell","k") will give "hekk" while replace(r"(.)(?=\1*$)","hello","k") will give "hellk"). And it can be generalised for repeating patterns by switching out the dot for something more complex:

r"(\S+)(?=( \1)*$)"

which will, for instance, independently match the last three instances of "abc" in "abc abc defg abc h abc abc abc".

Which then leads to the question... how would you match the repeating character or pattern at the start of the string, instead? Specifically, using regex in the way it's used above.

The obvious approach would be to reverse the direction of the above regex as r"(?<=^\1*)(.)" - but PCRE/Julia doesn't allow lookbehinds to have variable length (except where it's fixed-variable, like (?<=ab|cde)), and thus throws an error. The next thought is to use "\K" as something along the lines of r"^\1*\K(.)", but this only manages to match the first character (presumably because it "advances" after matching it, and no longer matches the caret).

For clarity: I'm seeking a regex that will, for instance, result in

replace("abc abc defg abc h abc abc abc",<regex here>,"hello")

producing

"hello hello defg abc h abc abc abc"

As you can see, it's replacing each "abc" from the start with "hello", but only until the first non-match. The reverse one I provide above does this at the other end of the string:

replace("abc abc defg abc h abc abc abc",r"(\S+)(?=( \1)*$)","hello")

produces

"abc abc defg abc h hello hello hello"
2
use this ^(\S+)(?:\s+\1)* and then do splitting on space character,Avinash Raj
@AvinashRaj - if you read through the full question, you'll see that I'm wanting to know if it can be done without multiple steps - that is, just with a regex. I can do it with the end-of-string equivalent, with r"(.)(?=\1*$)" (or more generally r"(\S+)(?=( \1)*$)").Glen O
Reverse the string. ;-)1983
@KingMob - you can do that, if you only want to match in one direction. But it's not a regex solution, and it doesn't help if you want to match in both directions (for instance, to capture the first sho in each part of the string "shorts shoes shop shovel shortstuff shoplifter shopshop", and not the second sho in shopshop, you need to look in both directions - you're basically figuring out which part isn't the delimiter, in a sense)Glen O
Well, you could replace (\S+)((?= \1)|.*) with "hello$2" (JavaScript for example), but unfortunately you need to be able to reference the submatches in the replacement string which AFAIK you can't do in Julia :-(1983

2 Answers

8
votes

You can use the \G anchor that matches the position after the previous match or at the start of the string. In this way you ensure the contiguity of results from the start of the string to the last occurrence:

\G(\S+)( (?=\1 ))?

demo

or to be able to match until the end of the string:

\G(\S+)( (?=\1(?: |\z)))?
3
votes

For PCRE style engines, unfortunately there is no way to do this without
variable length lookbehind.

A pure solution is not possible.
There is no \G anchor trickery that can accomplish this.

Here is why the \G anchor won't work.

With the anchor, the only guarantee you have is that the last match
resulted in a match where the forward overlap was checked to be equal
to the current match.

As a result, you can only globally match up to N-1 of the duplicate's from the beginning.

Here is a proof:

Regex:

 # (?:\G([a-c]+)(?=\1))

 (?:
      \G 
      ( [a-c]+ )                    # (1)
      (?=
           \1 
      )
 )

Input:

abcabcabcbca

Output:

 **  Grp 0 -  ( pos 0 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 0 , len 3 ) 
abc  
------------
 **  Grp 0 -  ( pos 3 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 3 , len 3 ) 
abc  

Conclusion:

Even though you know the Nth one is there from the previous lookahead,
the Nth one can't be matched without the condition of the current lookahead.

Sorry, and good luck!
Let me know if you find a pure regex solution.