I plan to include text metadata (like bold, font-size, etc.) in the process of parsing to achieve better recognition.
For instance, I have a given structure, where a word on its own line word/r/n
which is bold and sized 24px, is the title for some article. In order to get better recognition results, I want to take the characters as well as the metadata in account. In terms of ANTRL I'm not sure how this could be done best. I'd like to do something like:
- Wrap each character of the original text into a custom object with fields for the metadata and pass that to ANTLR.
- Preprocess the text and insert at specific places annotations for the metadata which is considered by the grammer.
I really like to take option 1. but I'm not sure which part from ANTLR I need to subclass etc. Do I have to start at the ANTLRInputStream-Object
, in order to get a proper stream for a subclassed Lexer to get custom Tokens for a subclassed Parser etc. Is there a more elegant way, especially in querying the tokens while parsing with actions in a {}
block ?
If anyone has some hints and/or experiences this would be great!
EDIT:
Here is a more specific simple example: I have a file wich includes the encoding of metadata which I parse forehand. the actual text including newline look like the following:
entryOne
Here is some content one.
entryTwo
Here is some content two.
Where the titlesentryOne
and entryTwo
are originally font-size of 24px and the content is font-size of 12px (as exemplary given values). Char by char I create a new instance of a custom object encapsulating the character as String and the font-size.
I initialize respective objects for each of the characters with fields of the font-size, e.g for the first letter of entryOne
like
MyChar aTitelChar = new MyChar("e", 24);
For the content, like the second line Here is some content one.
I create instances of MyChar like:
MyChar aContentChar= new MyChar("H", 12);
All characters of the texts are wrapped in instances of the below MyChar
-Class and added to a List<MyChar>
in order to produce a new input for ANTLR.
below is the Java Class for the characters:
public class MyChar {
private int fontSizePx;
private String text;
public MyChar(String text, int fontSizePx) {
this.text = text;
this.fontSizePx = fontSizePx;
}
public int getFontSizePx() {
return fontSizePx;
}
public String getText() {
return text;
}
}
I want that my grammar matches the above two entries (or more formatted this way) which in turn consist each of a title and a content which is terminated with a fullstop. This grammar could look like this:
rule: entry+ NEWLINE
;
entry:
title
content
;
title:
letters NEWLINE
;
content:
(letters)+ '.' NEWLINE
;
letters:
LETTERS
;
LETTERS:
('a'..'z' | 'A'..'Z')+
;
WS:
(' ' | '\t' | 'f' ) + {$channel = HIDDEN;};
NEWLINE:'\r'? '\n';
Now, for instance, what I want to do is to find out if it's really a title of an entry by checking the font-size of all letters encompassing the title-token before titel
-rule returns. In case the input conforms to the grammar but is actually some kind of mistake (the original metadata-encoded file starts with something that conforms to the title
-rule but its actually the content) the author of the grammar could sort that out if he knows that the original font-size for titles is 24 and check this. If one of the letter-tokens doesn't equal to font-size 24 throw an exception/don't return/do smthg. appropriate.
The thing I'm pondering on is where to plug in the List<MyChar>
to provide this functionality (to query kinds of metadata while parsing in context of ANTLR). I'm experimenting with ANTLR's Classes but as I'm new to ANTLR I thought probably some of the experienced users can point me in the right direction, like where would be a good insertion points for custom objects? should I start by implenting CharStream
and override some methods? Probably there is something which ANTLR provides which I haven't found yet?
List<MyChar>
variable that you mentioned? – user1201210List<MyChar>
holds objects with a field for plain text and one for font-size. The input-text example above is just for the grammar. I want to use the ANTLR_lexer_/_parser like I'd pass a String-Instance to the defaultANTLRStringStream(String input)
but something likenew MyANTLRStream(List<MyChar>input)
(MyANTLRStream
extends`ANTLRStringStream') - but here I'm not sure if I'm thinking correctly. I just want to reference fields of MyChar instances in blocks in the grammar like the value of a rule but access the Mychar-Instances of the respective Token as well. – turricanList<MyChar>
in a customANTLRStringStream
, it would be easier to pass it to the parser itself. Will there be oneMyChar
for every character of input that ANTLR reads? – user1201210