0
votes

I'm trying to match the prefix of the string Something. For example, If input So,SOM,SomeTH,some,S, it is all accepted because they are all prefixes of Something.

My code

Ss[oO]|Ss[omOMOmoM] {
        printf("Accept Something": %s\n", yytext);
}

Input

Som

Output

Accept Something: So
Invalid Character

It's suppose to read Som because it is a prefix of Something. I don't get why my code doesn't work. Can anyone correct me on what I am doing wrong?

2

2 Answers

2
votes

I don't know what you think the meaning of

Ss[oO]|Ss[omOMOmoM]

is, but what it matches is either:

  • an S followed by an s followed by exactly one of the letters o or O, or
  • an S followed by an s followed by exactly one of the letters o, O, m or M. Putting a symbol more than once inside a bracket expression has no effect.

Also, I don't see how that could produce the output you report. Perhaps there was a copy-and-paste error, or perhsps you have other pattern rules.

If you want to match prefixes, use nested optional matches:

s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?

If you want case-insensitive matcges, you could write out all the character classes, but that gets tiriesome; simpler is to use a case-insensitve flag:

(?i:s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?)

(?i: turns on the insensitive flag, until the matching close parenthesis.

In practice, this is probably not what you want. Normally, you will want to recognise a complete word as a token. You could then check to see if the word is a prefix in the rule action:

[[:alpha:]]+    { if (yyleng <= strlen("something") && 0 == strncasemp(yytext, "something", yyleng) { 
                  /* do something */
                  } 
                }

There is lots of information in the Flex manual.

2
votes

Right now your code (as shown) should only match "Sso" or "SsO" or "Ssm" or "SsM".

You have two alternatives that each start with Ss (without square brackets) so those will be matched literally. That's followed by either [oO] or [omOMomoM], but the characters in square brackets represent alternatives, so that's equivalent to [oOmM] --i.e., any one character of of o, O, m or M.

I'd start with: %option caseless to make it a case-insensitive scanner, so you don't have to list the upper- and lower-case equivalents of every letter.

Then it's probably easiest to just list the alternatives literally:

s|so|som|some|somet|someth|somethi|somethin|something { printf("found prefix"); }

I guess you can make the pattern a bit shorter (at least in the source code) by doing something on this order:

s(o(m(e(t(h(i(n(n(g)?)?)?)?)?)?)?)?)? { printf("found prefix"); }

Doesn't seem like a huge improvement to me, but some might find it more attractive than I do.

If you don't want to use %option caseless the basic idea helps more:

[sS]([oO]([mM]([eE]([tT]([hH]([iI]([nN]([gG])?)?)?)?)?)?)?)? { printf("found prefix"); }

Listing every possible combination of upper and lower case would get tedious.