I've been searching for two hours now And I don't really know what to do.
I'm trying to build a analyser which use a lexer that can match several thousand words. Those are natural language words, that's why they are so many.
I tried first in a simple way with just 1000 differents matches for one token :
TOKEN :
{
<VIRG: ",">
| <COORD: "et">
| <ADVERBE: "vraiment">
| <DET: "la">
| <ADJECTIF: "bonne">
| <NOM: "pomme"
| "émails"
| "émaux"
| "APL"
| "APLs"
| "Acide"
| "Acides"
| "Inuk"
[...]
After javac compilation it returns that the code is too large.
So, how could I manage thousands tokens in my lexer ?
I've read that it is more efficient to use n tokens for each word, than using one token for n words. But in this case I will have rules with 1000+ tokens, which doesn't look like a better idea;
I could modify the token manager, or build one, so it just matchs words in a list;
Here I know that the lexer is a finite state machine, and this is why it's not possible, so is there anyway to use an other lexer ? ;
I could automatically generate a huge regular expression which match every word, but that wouldn't let me handle the words independantly afterwards, and I'm not sure that writing a 60-lines-regular-expression would be a great idea;
Maybe is there a way to load the token from a file, this solution is pretty close to solutions 2 and 3;
Maybe I should use another language ? I'm trying to migrate from XLE (which can handle a lexicon of more than 70 000 tokens) to java, and what is interesting here is to generate java files !
So here it is, I can find my way to handle several thousands tokens with a javacc lexer. That would be great if any one is use to that and have an idea ?
Best
Corentin
USER_TOKEN_MANAGER
) that uses the implementation technique that @rici mentions below. If you only want a lexer, JavaCC is likely not the best tool. – Theodore Norvell