PCRE Regex: Is it possible to check within only the first X characters of a string for a match

Question

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?

My Regex:

I have a Regex:

/\S+V\s*/

This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.

This works. For example:

Example A:

 SEBSTI FMDE OPORV AWEN STEM students into STEM 

// Match found in 'OPORV' (correct)

Example B:

 ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event 
      
//Match not found (correct).

Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)

My Issue:

It can theoretically occur that sometimes there are names that involve roman numerals such as:

Example C:

 ARKFE SSETE BLME CARFR Academy IV Networking Event 
      
//Match found (incorrect).

I would like my Regex above to only check the first X characters of the string.

Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).

Intention:

/\S+V\s*/{check within first 25 characters only}

 ARKFE SSETE BLME CARFR Academy IV Networking Event 
                         ^
                         \-  Cut off point. Not found so far so stop. 

//Match not found (correct).

Workaround:

The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?

$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);

Your "workaround" sounds like the simplest way to do it and it is probably what anyone would recommend. Finding a way to do it using only a regular expression would probably make a regular expression more difficult to understand. — Hernán Alarcón
@HernánAlarcón absolutely, but I was curious as I have never ever seen any Regex reference to searching only a sub-part of a string. — Martin
If you insist, I guess you could use a positive lookbehind to match up to 25-n characters before your match (with length n). — Hernán Alarcón
What about "ARKFE SSETE BLMEV CARFR Academy IV Networking Event" ? — Gerard H. Pille

Casimir et Hippolyte Casimir et Hippolyte · Accepted Answer · 2021-02-10T21:06:33

The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:

$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';

demo

details:

^ # start of the line

(?= # open a lookahead assertion
    .{0,25} # the twenty first chararcters
    (.*) # capture the end of the line
) # close the lookahead

.*? # consume lazily the characters

\K # the match result starts here

\S+V    # your pattern
\b      # a word boundary (that matches between a letter and a white-space
        # or the end of the string)

(?=.*\1) # check that the end of the line follows with a reference to
         # the capture group 1 content.

Note that you can also write the pattern in a more readable way like this:

$pattern = '~^
    (*positive_lookahead: .{0,20} (?<line_end> .* ) )
    .*?    \K    \S+ V \b
    (*positive_lookahead: .*? \g{line_end} )   ~xm';

(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)