XQuery Java Performance with Large XML file

Question

I'm working on some Xquery code (using SAXON) to execute a simple XQuery file against a large XML file.

The XML file (located at this.referenceDataPath) has 3 million "row" nodes and has the form:

<row>
<ISRC_NUMBER>1234567890</ISRC_NUMBER>
</row>
<row>
<ISRC_NUMBER>1234567891</ISRC_NUMBER>
</row>
<row>
<ISRC_NUMBER>1234567892</ISRC_NUMBER>
</row>

etc...

The XQuery document (located at this.xqueryPath) is :

declare variable $isrc as xs:string external;
declare variable $refDocument external;
let $isrcNode:=$refDocument//row[ISRC_NUMBER=$isrc]
return count($isrcNode)

The Java code is:

private XQItem referenceDataItem;
private XQPreparedExpression xPrepExec;
private XQConnection conn;

//set connection string and xquery file
this.conn = new SaxonXQDataSource().getConnection();
InputStream queryFromFile = new FileInputStream(this.xqueryPath);

//Set the prepared expression 
InputStream is  = new FileInputStream(this.referenceDataPath);
this.referenceDataItem = conn.createItemFromDocument(is, null, null);
this.xPrepExec = conn.prepareExpression(queryFromFile);
xPrepExec.bindItem(new QName("refDocument"), this.referenceDataItem);   

//the code below is in a seperate method and called multiple times
public int getCount(String searchVal){

    xPrepExec.bindString(new QName("isrc"), searchVal, conn.createAtomicType   (XQItemType.XQBASETYPE_STRING));

    XQSequence resultsFromFile = xPrepExec.executeQuery();
    int count = Integer.parseInt(resultsFromFile.getSequenceAsString(new Properties()));
    return count;

}

The method getCount is called many times in succession (eg 1000000 times) to validate the existance of many values in the XML file.

The current speed of the Xquery query is about 500 milliseconds for each call to getCount which seems very slow considering the XML document is in memory and the query is a prepared one.

The reason I'm using XQuery is as a proof of concept for future work where the XML file will have a more complex layout.

I'm running the code on an i7 with 8GB RAM so memory is not an issue - I also increased the allocated heap size for the program.

Any suggestions on how I can improve the speed of this code?

Thanks!

wcandillon wcandillon · Accepted Answer · 2012-06-06T19:51:16

Zorba has a facility to parse and query large XML document. Some documentation about it is available at http://www.zorba-xquery.com/html/entry/2012/05/31/XML_Streaming

For instance, in the following code snippet, we parse a 700MB document via HTTP and the complete process happens in a streaming manner from top to bottom:

import module namespace http = "http://expath.org/ns/http-client";
import module namespace p = "http://www.zorba-xquery.com/modules/xml";
import schema namespace opt = "http://www.zorba-xquery.com/modules/xml-options";

let $raw-data as xs:string := http:send-request(<http:request href="http://cf.zorba-xquery.com.s3.amazonaws.com/forecasts.xml" method="GET" override-media-type="text/plain" />)[2]
let $data := p:parse($raw-data, <opt:options><opt:parse-external-parsed-entity opt:skip-root-nodes="1"/></opt:options>)
return
    subsequence($data, 1, 2)

You can try this example live at http://www.zorba-xquery.com/html/demo#CGPfEyXKvDwDfgzek/VTOIAIrJ8=

XQuery Java Performance with Large XML file

2 Answers