How to use dependency parsing features for text classification?

Question

I did dependency parsing for a sentence using spacy and obtained syntactic dependency tags.

import spacy
nlp = spacy.load('en')
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')

for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Output

Wall/NNP <--compound-- Street/NNP

Street/NNP <--compound-- Journal/NNP

Journal/NNP <--nsubj-- published/VBD

just/RB <--advmod-- published/VBD

published/VBD <--ROOT-- published/VBD

an/DT <--det-- piece/NN

interesting/JJ <--amod-- piece/NN

piece/NN <--dobj-- published/VBD

on/IN <--prep-- piece/NN

crypto/JJ <--compound-- currencies/NNS

currencies/NNS <--pobj-- on/IN

I'm not unable to understand, how can I use this information to generate dependency-based features for text classification. What are the possible ways to generate features from this for text classification?

Thanks in advance............

Sofie VL Sofie VL · Accepted Answer · 2020-03-30T09:38:29

In spaCy, there is currently no direct way to include the dependency features into the textcat component, unless you hack your way through the internals of the code.

In general, you'll have to think about what kind of features would be beneficial to give clues to your textcat algorithm. You could generate binary features for any possible "dependency path" in your data, such as "RB --advmod-- VBD" being one feature and then count how many times it occurs, but you'll very quickly have a very sparse dataset.

You may also be interested in other features like "what POS is the ROOT word" or does the sentence include patterns like "two nouns connected by a verb". But it really depends on the application.

How to use dependency parsing features for text classification?

1 Answers