2
votes

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?

My Regex:

I have a Regex:

/\S+V\s*/

This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.

This works. For example:

Example A:

 SEBSTI FMDE OPORV AWEN STEM students into STEM 

// Match found in 'OPORV' (correct)

Example B:

 ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event 
      
//Match not found (correct).   

Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)

My Issue:

It can theoretically occur that sometimes there are names that involve roman numerals such as:

Example C:

 ARKFE SSETE BLME CARFR Academy IV Networking Event 
      
//Match found (incorrect).  

I would like my Regex above to only check the first X characters of the string.

Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).

Intention:

/\S+V\s*/{check within first 25 characters only}
 ARKFE SSETE BLME CARFR Academy IV Networking Event 
                         ^
                         \-  Cut off point. Not found so far so stop. 

//Match not found (correct).  

Workaround:

The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?

$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring); 
3
Your "workaround" sounds like the simplest way to do it and it is probably what anyone would recommend. Finding a way to do it using only a regular expression would probably make a regular expression more difficult to understand.Hernán Alarcón
@HernánAlarcón absolutely, but I was curious as I have never ever seen any Regex reference to searching only a sub-part of a string.Martin
If you insist, I guess you could use a positive lookbehind to match up to 25-n characters before your match (with length n).Hernán Alarcón
What about "ARKFE SSETE BLMEV CARFR Academy IV Networking Event" ?Gerard H. Pille

3 Answers

1
votes

The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:

$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';

demo

details:

^ # start of the line

(?= # open a lookahead assertion
    .{0,25} # the twenty first chararcters
    (.*) # capture the end of the line
) # close the lookahead

.*? # consume lazily the characters

\K # the match result starts here

\S+V    # your pattern
\b      # a word boundary (that matches between a letter and a white-space
        # or the end of the string)

(?=.*\1) # check that the end of the line follows with a reference to
         # the capture group 1 content.

Note that you can also write the pattern in a more readable way like this:

$pattern = '~^
    (*positive_lookahead: .{0,20} (?<line_end> .* ) )
    .*?    \K    \S+ V \b
    (*positive_lookahead: .*? \g{line_end} )   ~xm';

(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)

0
votes

You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:

^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*

See the regex demo. Details:

  • ^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
  • | - or
  • \S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
-2
votes

Any V ending in the first 25 positions

^.{1,24}V\s

See regex

Any word ending in V in the first 25 positions

^.{1,23}[A-Z]V\s