0
votes

I'm looking for a regex that i can use in my tokenizer to compile a config file. Here is a snippet out of a class in php:

private $token = array(
    "PATH" => "([a-zA-Z\_-]+\.|\*\.)+([a-zA-Z\_-]+|\*)",
    "MIXED" => "[a-zA-Z0-9-_\(\)\/]{2,}",
    "STRING" => "[a-zA-Z-_]{2,}"
);

private function getToken($string) {
    foreach($this->token as $name => $pattern) {
        preg_match("/^".$pattern."/", $string, $match);
        if(!empty($match))
            return array($name, $match[0]);
    }

    return false;
}

"MIXED" should match "foo/bar" and not "foobar" and "STRING" should match "foobar" and not "foo/bar". Currently "foobar" and "foo/bar" are "MIXED".

How do i write this "AND NOT" in a single pattern down?

Thank you.

1
"MIXED" => "[a-zA-Z0-9-_()]+\/[a-zA-Z0-9-_()]+"Cougar
to be more precise: "MIXED" is also "foo()", "foo(255)"Greggel
Cougar is on the right path. What you want is to express the idea "contains at least one slash".Kaz
Which parsing strategy? First match? Largest match?hakre
first match. A "CHARACTER" is at least.Greggel

1 Answers

1
votes

This pattern will match any sequence of letters, digits, underscores, hyphens and slashes which contains at least one slash:

[a-zA-Z0-9-_/]*\/[a-zA-Z0-9-_/]*

So this gives you a general idea how to reject tokens like abc while matching ab/c. This is very similar to distinguishing floating-point constants from integer constants.

You should probably be tokenizing inputs like foo/bar(255) as four tokens: foo/bar ( 255 and ).

Otherwise enforcing this slash requirement is complicated. The naive ways mean that MIXED can be something like these:

foo(255/255)
foo(/)

or even:

)/-

just because it contains a slash somewhere, not necessarily where you want.

Clarify your requirements.