1
votes

Below my situation.

I have a class TextProcessor that process a text. I need to find the coreferences in such a text and then extract the informations with the Stanford's tool OpenIE. I use this two pipelines:

"tokenize,ssplit,pos,lemma,ner,parse,mention,coref" for coreferences.

and

"tokenize,ssplit,pos,lemma,depparse,natlog,openie" for Information Extraction.

It requires lot of time to use them separately for analyzing a single text, but for the moment I have to do so cause using them together requires a large amount of memory and the pipeline would exeed my memory's bounds.

public class TextProcessor(){
    Properties props;
    StanfordCoreNLP pipeline;

    public TextProcessor() {
        props = new Properties();
        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
        pipeline = new StanfordCoreNLP(props);
    }


    // Performs NER and COREF 
     public void process(String text) {
         Annotation document = new Annotation(malware.getDescription());
         pipeline.annotate(document);

         // Process text (tokenization, pos, lemma, ner, coref)....
     }

     public void extractInformation(String document) {
         props = new Properties();
         props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
         pipeline = new StanfordCoreNLP(props);

         Annotation doc = new Annotation(document);
         pipeline.annotate(doc);

         // Extract informations from doc ...
    }

Is there a way to put together the two pipelines dynamically? I mean, something like this:

1) "tokenize,ssplit,pos,lemma,ner,depparse,mention,coref"

2) "tokenize,ssplit,pos,lemma,ner,depparse,mention,coref,natlog,openie".

I tried to return an Annotation object from the first method process(String text) and then add the other three properties to it in the method extractInformation(String text), like this:

     public Annotation process(String text) {
         Annotation document = new Annotation(malware.getDescription());
         pipeline.annotate(document);

         // Process text (tokenization, pos, lemma, ner, coref)....
         return document;
     }

     public void extractInformation(Annotation document) {
         props.setProperty("annotators","depparse,natlog,openie");
         pipeline = new StanfordCoreNLP(props);
         pipeline.annotate(document);

         // Extract informations from doc ...
    }

But I get this error:

annotator "depparse" requires annotation "TextAnnotation". The usual requirements for this annotator are: tokenize,ssplit,pos.

I thought that adding the new three properties (depparse, natlog, openie) to an already annotated document (with tokenize,ssplit,pos) would work, but it didn't.

So, is there a way to add those properties to the oldest pipeline avoiding to perform again all the pipeline (plus the new properties) and avoid the memory to exceed its bounds?



UPDATE

All I needed to do was

     public Annotation process(String text) {
         Annotation document = new Annotation(malware.getDescription());
         pipeline.annotate(document);

         // Process text (tokenization, pos, lemma, ner, coref)....
         StanfordCoreNLP.clearAnnotatorPool(); // <-- Added: to get rid of the models and solve the memory issue
         return document;
     }

     public void extractInformation(Annotation document) {
         props.setProperty("annotators","natlog,openie");

         props.setProperty("enforceRequirements", "false") //<-- Added

         pipeline = new StanfordCoreNLP(props);
         pipeline.annotate(document);

         // Extract informations from doc ...
    }

Alternatively, you can use:

pipeline = new StanfordCoreNLP(props, false);

in extractInformation(Annotation document).

1

1 Answers

4
votes

It sounds like you want to build a first pipeline, run it on a set of documents, clear the memory, and then build a second pipeline and run it on the set of documents.

If you run the second pipeline on the same set of Annotations, it will just pick up where the first pipeline finished. But you need to set enforceRequirements to false so the second pipeline won't crash. Also after you are done using the first pipeline you should run StanfordCoreNLP.clearAnnotatorPool(); to get rid of the models or you won't solve the memory issue.