3
votes

I'm writing a simple HTML parser using yacc(bison) and flex. how can I make this rule:

column -> <td>text</td>column | NULL

I've tried many forms like this:

COLUMN : L_TAG T_OPEN R_TAG ID L_TAG T_CLOSE R_TAG COLUMN
| 
;

//The tokens are specified in lex.

unfortunately it doesn't work. it gives me the shift/reduce error. whether I put the COLUMN at the beginning of the rule or at the end. whether I use the NULL like this:

{$$ = NULL}

or leave it empty. I need the NULL thing to make the rule recursive and be able to have the same tag more than once next to each other. something like this:

<tr>name</tr><tr>age</tr>

how can I make this work?

2

2 Answers

2
votes
column       :/* empty */
             | column '<' TD '>' TEXT '<' '/' TD '>'

You can optimize your rules as far as you make it more specific. for the recursion thing you should in a lalr(1) grammar make it left recursive.

good luck

0
votes

You'd normally break it up into a number of pieces:

table : TABLE rows CLOSE_TABLE
      ;

rows: 
    | rows row
    ;

row:  TR row_header cells CLOSE_TR
   ;

row_header: TH text CLOSE_TH
          ;

cells: 
     | cells cell
     ;

cell: TD TEXT CLOSE_TD
    ;

Where TABLE is <table>, CLOSE_TABLE is </table>, and so on.

Oh, just to be clear: I haven't worked very had to ensure this really parses HTML correctly. In fact, I'm pretty sure it doesn't right now. Just for one obvious example, I believe a row-header should really be optional.