Parsing tagged data-string, instead of pure-string

Question

You normally parse text with Parsers like Lark/Python or DCG/Prolog.

What I have instead is tagged data-string, which in a sense is still sequence of words but in reality is list of tuples.

Example :

word1 word2 ... ==process==> [(word1,tag1), (word2, tag5), (w3,t3) ...] ==parser==> struct[ predicate1(word2,word5), p2(w2,w1), ...]

I want to parse it in a similar fashion as a normal string i.e. using Grammar and stuff.. i.e. matching is mostly based on the tokens-properties and sometimes the word itself.

As a result I'm building for now list of tuples , in the future it may be list of trees/ast. Currently I parsing only single sentences, one by one.

here is how the tokenized structure looks like (i pass it trough spacy):

You will have to join us before the match starts.

[('You', 'PRON', 'nsubj'), ('will', 'VERB', 'aux'), ('have', 'AUX', 'ROOT'), ('to', 'PART', 'aux'), ('join', 'VERB', 'xcomp'), ('us', 'PRON', 'dobj'), ('before', 'ADP', 'mark'), ('the', 'DET', 'det'), ('match', 'NOUN', 'nsubj'), ('starts', 'VERB', 'advcl'), ('.', 'PUNCT', 'punct')]

I may add more tags in the future.. that will part of the pre-parsing stage.

Any solution for Python, or eventually for Prolog. Currently I've selected Python, but I'm still experimenting.

Ex grammar (pseudo code):

Sentence : ....
SVO : token.nsubj token*? token.root token*? token.pobj { sent += svo(root, nsubj, pobj)  }
adj : token.adj { sent += adj(word) }

instead of parsing pure-string i'm trying to parse string tokenized to tuples/properties. Standard Grammars cant do that, because they expect string input not list-of-tuples. — sten
I don't understand why to parse list of tuples? if you have list then use it as list. And if you have this list as string then normally convert it to list using eval() or json.loads() — furas
because the PARSING depends on information which is not available in the pure-string, only after pre-processing you can generate this info, like lexical category, postion, order, semantic information, ...etc — sten
Having parsed with parsers that use a lexer to tokenize and also having parsed with DCGs going directly from character stream to AST, I can tell you that trying to understand one and use it to learn the other is not a wise decision. Yes there are some commonalities but once you learn to do a mental context switch before doing one or the other, then life is better. In other words if you are doing tokenization with lexer then think and say only that, when you are parsing with DCGs and going straight from a character stream to AST then think and say only that, don't try and draw analogies. — Guy Coder

CapelliC CapelliC · Accepted Answer · 2021-07-25T07:20:27

In a DCG, the list elements are generic, in the sense that they are Prolog variables. Then you can express the pattern matching in the most natural way, using the anonymous variable _ where you don't care about the actual value:


% SVO : token.nsubj token*? token.root token*? token.pobj { sent += svo(root, nsubj, pobj)  }
svo(svo(Y,X,Z)) -->
   [(X,'PRON',nsubj),_,(Y,_,'ROOT'),_,(Z,_,pobj)].

% adj : token.adj { sent += adj(word) }
adj(adj(X)) -->
   [(X,_,'ADJ')].

Sorry I don't know the spacy format, please take my guesses above as most probably wrong...

About the token formation, IMHO it can be easier to handle the tokenization directly into the DCG, without relying on a lexer. Of course, if the file dimension is 'reasonable'. I did it to parse some MySQL backups (plain SQL, some 10 ~ 30 MB) and it worked well in SWI-Prolog.

Parsing tagged data-string, instead of pure-string

1 Answers