As mclapply() is not available, I am using library(parallel) to distribute NER of a large corpus. I am using this good tutorial by lmullen as a starting point.
First, a snippet like this is written:
annotate_entities <- function(doc, annotation_pipeline) {
annotations <- annotate(doc, annotation_pipeline)
AnnotatedPlainTextDocument(doc, annotations)
}
then, the pipeline:
itinerants_pipeline <- list(
Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator(),
Maxent_Entity_Annotator(kind = "person"),
Maxent_Entity_Annotator(kind = "location")
)
The serial version of the corpus processing works, i.e.,
corpus_serial_NER <- lapply(corpus[1:100]$content, annotate_entities, itinerants_pipeline )
When I try to paralllize, however, I run into trouble:
library(parallel)
cl <- makePSOCKcluster(8)
setDefaultCluster(cl)
clusterExport(cl, c('annotate_entities',
'annotate',
'Maxent_Sent_Token_Annotator',
'AnnotatedPlainTextDocument',
'Maxent_Sent_Token_Annotator',
'Maxent_Word_Token_Annotator',
'Maxent_Entity_Annotator'))
corpus_NER <- parLapply(cl, corpus[1:100]$content, function(k) {
itinerants_pipeline <- list(
Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator(),
Maxent_Entity_Annotator(kind = "person"),
Maxent_Entity_Annotator(kind = "location"))
annotate_entities(k, itinerants_pipeline)
}))
If I tried just exporting the functions above without the "inside", the engine reported them as missing. Searching around gave me the impression that this is because the references to the Java objects were "cut" as the function entered parallelization. But I suspect the handling of this is causing me the grief.
For trivially small corpora (10 documents), but above 50, it blows up with the following message:
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: java.lang.OutOfMemoryError: GC overhead limit exceeded
In addition:
Warning messages:
1:
In .Internal(gc(verbose, reset)) :
closing unused connection 14 (<-NO92W:11748)
2:
In .Internal(gc(verbose, reset)) : (.................)
I have read that this error message is from Java, and that it pertains to excessive garbage collection. However, I do not understand what causes this to happen within my parallel code (and not when I run it serially).
I would like to know what causes this behavior, but I am also interested in workarounds. It is not clear to me what the best way of doing parallel lapply on R/Windows is, but used this solution as I was able to get that to work with other functinos (those were not from Java).
clusterEvalQ
)? – Roman Luštrik