0
votes

As mclapply() is not available, I am using library(parallel) to distribute NER of a large corpus. I am using this good tutorial by lmullen as a starting point.

First, a snippet like this is written:

annotate_entities <- function(doc, annotation_pipeline) {
    annotations <- annotate(doc, annotation_pipeline)
    AnnotatedPlainTextDocument(doc, annotations)
}

then, the pipeline:

itinerants_pipeline <- list(
  Maxent_Sent_Token_Annotator(),
  Maxent_Word_Token_Annotator(),
  Maxent_Entity_Annotator(kind = "person"),
  Maxent_Entity_Annotator(kind = "location")
)

The serial version of the corpus processing works, i.e.,

 corpus_serial_NER <- lapply(corpus[1:100]$content, annotate_entities, itinerants_pipeline ) 

When I try to paralllize, however, I run into trouble:

library(parallel)

cl <- makePSOCKcluster(8)
setDefaultCluster(cl)

clusterExport(cl, c('annotate_entities', 
 'annotate', 
 'Maxent_Sent_Token_Annotator', 
 'AnnotatedPlainTextDocument', 
 'Maxent_Sent_Token_Annotator', 
  'Maxent_Word_Token_Annotator',
  'Maxent_Entity_Annotator'))

corpus_NER <- parLapply(cl, corpus[1:100]$content, function(k) {
  itinerants_pipeline <- list(
  Maxent_Sent_Token_Annotator(),
  Maxent_Word_Token_Annotator(),
  Maxent_Entity_Annotator(kind = "person"),
  Maxent_Entity_Annotator(kind = "location"))

  annotate_entities(k, itinerants_pipeline)
  }))

If I tried just exporting the functions above without the "inside", the engine reported them as missing. Searching around gave me the impression that this is because the references to the Java objects were "cut" as the function entered parallelization. But I suspect the handling of this is causing me the grief.

For trivially small corpora (10 documents), but above 50, it blows up with the following message:

Error in checkForRemoteErrors(val) : 
  7 nodes produced errors; first error: java.lang.OutOfMemoryError: GC overhead limit exceeded
In addition: 
Warning messages:
1: 
In .Internal(gc(verbose, reset)) :


 closing unused connection 14 (<-NO92W:11748)
2: 
In .Internal(gc(verbose, reset)) : (.................)

I have read that this error message is from Java, and that it pertains to excessive garbage collection. However, I do not understand what causes this to happen within my parallel code (and not when I run it serially).

I would like to know what causes this behavior, but I am also interested in workarounds. It is not clear to me what the best way of doing parallel lapply on R/Windows is, but used this solution as I was able to get that to work with other functinos (those were not from Java).

1
Shouldn't you be also loading extra packages on workers (through clusterEvalQ)?Roman Luštrik
May well be. I will look into that angle.user1603472
Is your corpus publicly available? What size is it? Which packages do you use?F. Privé
No, not a public corpus. However, @RomanLuštrik 's excellent tip was the solution, I will answer myself with a working solution. So it was a likely a technical issue that could be reproduced with any corpus.user1603472

1 Answers

1
votes

I am posting an answer because it provides a working solution to the exact problem above, thanks to Roman Luštrik's comment. Loading the packages and then removing the creation of the pipeline in the function resolved the issue. This is the working code:

cl <- makePSOCKcluster(7)
setDefaultCluster(cl)

clusterEvalQ(cl, library(NLP));
clusterEvalQ(cl, library(openNLP));
clusterEvalQ(cl, library(RWeka));
clusterEvalQ(cl, library(openNLPmodels.en));

clusterEvalQ(cl, itinerants_pipeline <- list(
    Maxent_Sent_Token_Annotator(),
    Maxent_Word_Token_Annotator(),
    Maxent_Entity_Annotator(kind = "person"),
    Maxent_Entity_Annotator(kind = "location")))

clusterExport(cl, c('annotate_entities'))

system.time(corpus_par_NER <- parLapply(cl,corpus[1:5000]$content, function(k) {
    annotate_entities(k, itinerants_pipeline)
}))

stopCluster(cl)

Crucially, the pipeline is exported in this way. Doing it via clusterExport (in the same list as 'annotate_entities' did not work).