You normally parse text with Parsers like Lark/Python or DCG/Prolog.
What I have instead is tagged data-string, which in a sense is still sequence of words but in reality is list of tuples.
Example :
word1 word2 ... ==process==> [(word1,tag1), (word2, tag5), (w3,t3) ...] ==parser==> struct[ predicate1(word2,word5), p2(w2,w1), ...]
I want to parse it in a similar fashion as a normal string i.e. using Grammar and stuff.. i.e. matching is mostly based on the tokens-properties and sometimes the word itself.
As a result I'm building for now list of tuples , in the future it may be list of trees/ast. Currently I parsing only single sentences, one by one.
here is how the tokenized structure looks like (i pass it trough spacy):
You will have to join us before the match starts.
[('You', 'PRON', 'nsubj'), ('will', 'VERB', 'aux'), ('have', 'AUX', 'ROOT'), ('to', 'PART', 'aux'), ('join', 'VERB', 'xcomp'), ('us', 'PRON', 'dobj'), ('before', 'ADP', 'mark'), ('the', 'DET', 'det'), ('match', 'NOUN', 'nsubj'), ('starts', 'VERB', 'advcl'), ('.', 'PUNCT', 'punct')]
I may add more tags in the future.. that will part of the pre-parsing stage.
Any solution for Python, or eventually for Prolog. Currently I've selected Python, but I'm still experimenting.
Ex grammar (pseudo code):
Sentence : ....
SVO : token.nsubj token*? token.root token*? token.pobj { sent += svo(root, nsubj, pobj) }
adj : token.adj { sent += adj(word) }
eval()
orjson.loads()
– furas