0
votes

I am working on an application where up to 1 million documents needs to be fetched in 2 hours

We are using Java client API and structured query to perform the search. However, the query is still slow.

Code is as below:

def fetchPostMessages(dbParam: DbParam): Page = {

  val queryManager = dbClient.newQueryManager()
  val sqb: StructuredQueryBuilder = queryManager.newStructuredQueryBuilder()

  log.info(s "Fetching post messages from database for params: {}", dbParam)

  val modifiedQueryDef = dbParam.param.map {
    param => {
      sqb.and(
        sqb.word(sqb.jsonProperty(status), toBeReported),
        sqb.word(sqb.jsonProperty(dataCategory), "dataCategory1"),
        sqb.range(sqb.jsonProperty(creationDate), marklogicDateFormat.name, Operator.LE, DateUtil.printFpmlDateTime(param.messagesTime)))

    }
  }.getOrElse(sqb.and(sqb.word(sqb.jsonProperty(status.name), toBeReported.name)))

  modifiedQueryDef.setCollections(XmlConstants.ItracMessageTypes.OUTPUT_MESSAGE.name)
  modifiedQueryDef.setOptionsName(sortOption)
  search(modifiedQueryDef, dbParam.pageNum, dbParam.batchSize)
}


private def search(queryDef: QueryDefinition, startIndex: Int, batchSize: Int): Page = {

  val dataList: ListBuffer[Document] = new ListBuffer()
  val jsonDocManager = dbClient.newJSONDocumentManager()
  jsonDocManager.setMetadataCategories(Metadata.ALL)
  jsonDocManager.setPageLength(
    if (batchSize < pageLength) batchSize
    else pageLength)
  val documentPage = jsonDocManager.search(queryDef, startIndex);
  dataList.++ = (extractContent(documentPage))

  val totalSize = documentPage.getTotalSize
  log.info(s "Total documents to be reported : ${totalSize}")
  var pageSize = documentPage.getPageSize

  while (pageSize < batchSize && pageSize <= totalSize) {

    if (batchSize - pageSize < pageSize)
      jsonDocManager.setPageLength(batchSize - pageSize)

    var newDocPage = jsonDocManager.search(queryDef, pageSize + 1)
    dataList.++ = (extractContent(newDocPage))
    pageSize = pageSize + newDocPage.getPageSize
  }

  log.info("Total messages fetched are : {}", dataList.size)
  Page(startIndex, totalSize - batchSize, dataList.to[collection.immutable.Seq])

}

Sort options are:

<search:options xmlns:search="http://marklogic.com/appservices/search">
  <search:sort-order type="xs:string" direction="ascending">
    <search:json-property>subdomLvl1</search:json-property>
  </search:sort-order>
  <search:sort-order type="xs:string" direction="ascending">
    <search:json-property>trdId</search:json-property>
  </search:sort-order>
  <search:sort-order type="xs:string" direction="ascending">
    <search:json-property>validStartDate</search:json-property>
  </search:sort-order>
  <search:sort-order type="xs:string" direction="ascending">
    <search:json-property>ver</search:json-property>
  </search:sort-order>
  <search:sort-order type="xs:string" direction="ascending">
    <search:json-property>reportStatus</search:json-property>
  </search:sort-order>
</search:options>

Indexing on the database is as below:

element range indexes are on for - status, dataCategory and creationDate and for all sort options

1

1 Answers

1
votes

If the process doesn't need the document metadata, consider configuring with jsonDocManager.clearMetadataCategories() instead of jsonDocManager.setMetadataCategories(Metadata.ALL). That approach will reduce the work on the server and client as well as reduce the data transferred.

The loop can be simplified by testing newDocPage.hasNextPage() - see:

http://docs.marklogic.com/guide/java/bulk#id_21619

Instead of accumulating all million documents in a single list, could the client stream the documents to the consuming process as they arrive? That would certainly improve throughput.

You might also consider using the Data Movement SDK to read documents in multiple threads:

http://docs.marklogic.com/guide/java/data-movement#id_60613

http://docs.marklogic.com/javadoc/client/com/marklogic/client/datamovement/QueryBatcher.html

Hoping that helps,