I am working on a project that involves transforming part of speech tagged text into an ANTLR3 AST with phrases as nodes of the AST.
The input to ANTLR looks like:
DT-THE The NN dog VBD sat IN-ON on DT-THE the NN mat STOP .
i.e. (tag token)+ where neither the tag or the token contain white space.
Is the following a good way of lexing this:
WS : (' ')+ {skip();};
TOKEN : (~' ')+;
The grammar then has entries like the following to describe the lowest level of the AST:
dtTHE:'DT-THE' TOKEN -> ^('DT-THE' TOKEN);
nn:'NN' TOKEN -> ^('NN' TOKEN);
(and 186 more of these!)
This approach seems to work but results in a ~9000 line Java Lexer and takes a large amount of memory to build (~2gb) hence I was wondering whether this is the optimal way of solving this problem.