Parse sentences with different word types

Question

I'm looking for a grammar for analyzing two type of sentences, that means words separated by white spaces:

ID1: sentences with words not beginning with numbers
ID2: sentences with words not beginning with numbers and numbers

Basically, the structure of the grammar should look like

ID1 separator ID2  

ID1: Word can contain number like Var1234 but not start with a number  

ID2: Same as above but 1234 is allowed  

separator: e. g. '='

@Bart
I just tried to add two tokens '_' and '"' as lexer-rule Special for later use in lexer-rule Word. Even I haven't used Special in the following grammar, I get the following error in ANTLRWorks 1.4.2:
The following token definitions can never be matched because prior tokens match the same input: Special
But when I add fragment before Special, I don't get that error. Why?

grammar Sentence1b1;

tokens
{
  TCUnderscore  = '_' ;
  TCQuote       = '"' ;
}

assignment
  :  id1 '=' id2
  ;

id1
  :  Word+
  ;

id2
  :  ( Word | Int )+
  ;

Int
  :  Digit+
  ;

// A word must start with a letter
Word
  :  ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
  ;

Special
  : ( TCUnderscore | TCQuote )
  ;

Space
  :  ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
  ;

fragment Digit
  :  '0'..'9'
  ;

Lexer-rule Special shall then be used in lexer-rule Word:

Word
  :  ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
  ;

Your definitions of the two different types of sentences could be interpreted in a couple of ways (and you only mean 1 specific way, I assume :)). Could you give some concrete examples of the two types of sentences? Thanks! — Bart Kiers
you can edit your original question with new information. I wouldn't post HTML though, have a look at the Markdown help to see how to properly format the question. — Bart Kiers

Bart Kiers Bart Kiers · Accepted Answer · 2011-07-29T12:35:01

I'd go for something like this:

grammar Sentence;

assignment
  :  id1 '=' id2
  ;

id1
  :  Word+
  ;

id2
  :  (Word | Int)+
  ;

Int
  :  Digit+
  ;

// A word must start with a letter
Word
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

fragment Digit
  :  '0'..'9'
  ;

which will parse the input:

Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed

as follows:

enter image description here

EDIT

To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):

// wrong!
Special      : (TCUnderscore | TCQuote);
TCUnderscore : '_';
TCQuote      : '"';

Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:

// wrong!
TCUnderscore : '_';
TCQuote      : '"';
Special      : (TCUnderscore | TCQuote);

the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:

The following token definitions can never be matched because prior tokens match the same input: ...

If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:

// good!
Special               : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote      : '"';

Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).

// wrong!
parse : TCUnderscore;

Special               : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote      : '"';

Parse sentences with different word types

2 Answers

EDIT