1
votes

How to avoid CORB timeout when run large batch of data pull over 10 million docs pdf/xml? Do I need to reduce thread-count and batch-size.

uris-module:

let $uris := cts:uris(
(),
(),
cts:and-query((
    cts:collection-query("/sites"),
    cts:field-range-query("cdate","<","2019-10-01"),
    cts:not-query(
        cts:or-query((
            cts:field-word-query("dcax","200"),
            more code...,
            ))
    )
))
return (fn:count($uris), $uris)

process.xqy:

declare variable $URI as xs:string external;
let $uris := fn:tokenize($URI,";")
let $outputJson := "/output/json/"
let $outputPdf := "/output/pdf/"

for $uri1 in $uris
let $accStr := fn:substring-before(fn:substring-after($uri1,"/sites/"),".xml")
let $pdfUri := fn:concat("/pdf/iadb/",$accStr,".pdf")
let $doc := fn:doc($uri1)
let $obj := json:object()
let $_ := map:put($obj,"PaginationOrMediaCount",fn:number($doc/rec/MediaCount))
let $_ := map:put($obj,"Abstract",fn:replace($doc/rec/Abstract/text(),"[^a-zA-Z0-9 ,.\-\r\n]",""))
let $_ := map:put($obj,"Descriptors",json:to-array($doc/rec/Descriptor/text()))
    let $_ := map:put($obj,"FullText",fn:replace($doc/rec/FullText/text(),"[^a-zA-Z0-9 ,.\-\r\n]",""))
let $_ := xdmp:save(
    fn:concat($outputJson,$accStr,".json"),
    xdmp:to-json($obj)
)
let $_ := if (fn:doc-available($pdfUri))
    then xdmp:save(
        fn:concat($outputPdf,$accStr,".pdf"),
        fn:doc($pdfUri)
    )
    else ()

return $URI
1
It would be helpful if you provided more details. For instance, the code that is being executed and an indication of whether it is the URIs module execution that is timing out or the transform module executions that are timing out. - Mads Hansen
timing out at the transform module executions. I also added the code uris and process-module - thichxai
How large is your BATCH-SIZE? Have you tried reducing it, to do less per invocation of the process module? You may be trying to do too much in one module invocation. Also, verify that you can run xdmp:save() to that path and it isn't having trouble writing to the filesystem. - Mads Hansen
THREAD-COUNT=16 BATCH-SIZE=10000 - thichxai
Yeah, processing 10k docs at once is a lot. You are using a batch tool to spread the load. I would reduce the batch size(even just use batch size of 1), and instead look to tune throughput by adjusting the thread count. You are only going to be able to write a file at a time in the process module so will get more concurrency by spreading out with more threads instead of larger batch sizes. - Mads Hansen

1 Answers

0
votes

It would be easier to diagnose and suggest improvements if you shared the CoRB job options and the code for your URIS-MODULE and PROCESS-MODULE

The general concept of a CoRB job is that is splits up the work to perform multiple module executions rather than trying to do all of the work in a single execution, in order to avoid timeout issues and excessive memory consumption.

For instance, if you wanted to download 10 million documents, the URIS-MODULE would select the URIs of all of those documents, and then each URI would be sent to the PROCESS-MODULE, which would be responsible for retrieving it. Depending upon the THREAD-COUNT, you could be downloading several documents at a time but they should all be returning very quickly.

Is the execution of the URIs module what is timing out, or the process module?

You can increase the timeout limit from the default limit up to the maximum timeout limit by using: xdmp:request-set-time-limit()

Generally, the process modules should execute quickly and shouldn't be timing out. One possible reason would be performing too much work in the transform (i.e. setting BATCH-SIZE really large and doing too much at once) or maybe a misconfiguration or poorly written query (i.e. rather than fetching a single doc with the $URI value, performing a search and retrieving all of the docs each time that the process module is executed).