0
votes

Here is my regular expression:

❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱

Here is the test text (online demo in javascript where it works fine):

Nulla imperdiet ❰❮6❯⦓“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.⦔❱❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱❰❮8❯⦓Etiam in congue turpis. Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱❰❮9-10❯⦓Aenean luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu, cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱ eu euismod.

But It does not work in php. That is, it does not retrive the first match: ie., from ❰❮6❯⦓“ to vitae.⦔❱. Intriguingly, if I remove the Unicode double quotes charterer (“), it works fine, but adding it, makes it not to match the first match. Why is this? and How can this be avoided?


Explanation of the regex: I wanted to match content between and , if they are the only content excluding digital content inbetween and .

Example for Match:

❰❮6❯⦓Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.⦔❱

Example for Not a Match:

❰❮6❯⦓Lorem ipsum dolor sit amet, consectetur adipiscing elit.⦔ Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.❱


My PHP Code:

<?php
$subject = "Nulla imperdiet ❰❮6❯⦓“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse gravida consectetur mauris,
         eget ornare velit consequat vitae.⦔❱❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱❰❮8❯⦓Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱❰❮9-10❯⦓Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱ eu euismod.";


$pattern = '#❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱#';
preg_match_all($pattern, $subject, $matches);
echo '<pre>';
print_r($matches);
echo '</pre>';    
?>

output:

Array
(
    [0] => Array
        (
            [0] => ❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱
            [1] => ❰❮8❯⦓Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱
            [2] => ❰❮9-10❯⦓Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱
        )

    [1] => Array
        (
            [0] => ❮7❯
            [1] => ❮8❯
            [2] => ❮9-10❯
        )

    [2] => Array
        (
            [0] => Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.
            [1] => Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.
            [2] => Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .
        )

)
1
@anubhava: I mentioned that it was working fine on the online demo. It is not working in php. That is my problem, please test it here: phpliveregex.comJayarathina Madharasan
Show a PHP code demo which is not workinganubhava
It seems that you're matching unicode, but aren't using the u modifier - try to add it.h2ooooooo
@h2ooooooo : Thank a lot. That worked... :) Please add your comment as an answer.Jayarathina Madharasan

1 Answers

4
votes

You're matching unicode characters, but you haven't included the unicode modifier which means that unicode characters won't be seen as what they actually are.

From the manual:

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

To fix your problem, simply append u to your regex:

$pattern = '#❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱#u';
// Add the unicode modifier            ^