building a lexer with very many tokens

Question

I've been searching for two hours now And I don't really know what to do.

I'm trying to build a analyser which use a lexer that can match several thousand words. Those are natural language words, that's why they are so many.

I tried first in a simple way with just 1000 differents matches for one token :

    TOKEN :
{
    <VIRG: ",">
|   <COORD: "et">
|   <ADVERBE: "vraiment">
|   <DET: "la">
|   <ADJECTIF: "bonne">
|   <NOM: "pomme"
        |   "émails"
        |   "émaux"
        |   "APL"
        |   "APLs"
        |   "Acide"
        |   "Acides"
        |   "Inuk"

[...]

After javac compilation it returns that the code is too large.

So, how could I manage thousands tokens in my lexer ?

I've read that it is more efficient to use n tokens for each word, than using one token for n words. But in this case I will have rules with 1000+ tokens, which doesn't look like a better idea;
I could modify the token manager, or build one, so it just matchs words in a list;
Here I know that the lexer is a finite state machine, and this is why it's not possible, so is there anyway to use an other lexer ? ;
I could automatically generate a huge regular expression which match every word, but that wouldn't let me handle the words independantly afterwards, and I'm not sure that writing a 60-lines-regular-expression would be a great idea;
Maybe is there a way to load the token from a file, this solution is pretty close to solutions 2 and 3;
Maybe I should use another language ? I'm trying to migrate from XLE (which can handle a lexicon of more than 70 000 tokens) to java, and what is interesting here is to generate java files !

So here it is, I can find my way to handle several thousands tokens with a javacc lexer. That would be great if any one is use to that and have an idea ?

Best

Corentin

If you are using JavaCC to build a parser, you may want to use a custom lexer (see option USER_TOKEN_MANAGER) that uses the implementation technique that @rici mentions below. If you only want a lexer, JavaCC is likely not the best tool. — Theodore Norvell

rici rici · Accepted Answer · 2016-11-16T19:55:31

I don't know how javacc builds its DFA, but it's certain that a DFA capable of distinguishing thousands of words would be quite large. (But by no means unreasonably large: I've gotten flex to build DFAs with hundreds of thousands of states without major problems.)

The usual approach to lexicons with a huge number of fixed lexemes is to use the DFA to recognize a potential word (eg., a sequence of alphabetical characters) and then look the word up in a dictionary to get the token type. That's also more flexible because you can update the dictionary without recompiling.

building a lexer with very many tokens

1 Answers