1
votes

I'm working on some Xquery code (using SAXON) to execute a simple XQuery file against a large XML file.

The XML file (located at this.referenceDataPath) has 3 million "row" nodes and has the form:

<row>
<ISRC_NUMBER>1234567890</ISRC_NUMBER>
</row>
<row>
<ISRC_NUMBER>1234567891</ISRC_NUMBER>
</row>
<row>
<ISRC_NUMBER>1234567892</ISRC_NUMBER>
</row>

etc...

The XQuery document (located at this.xqueryPath) is :

declare variable $isrc as xs:string external;
declare variable $refDocument external;
let $isrcNode:=$refDocument//row[ISRC_NUMBER=$isrc]
return count($isrcNode)

The Java code is:

private XQItem referenceDataItem;
private XQPreparedExpression xPrepExec;
private XQConnection conn;

//set connection string and xquery file
this.conn = new SaxonXQDataSource().getConnection();
InputStream queryFromFile = new FileInputStream(this.xqueryPath);

//Set the prepared expression 
InputStream is  = new FileInputStream(this.referenceDataPath);
this.referenceDataItem = conn.createItemFromDocument(is, null, null);
this.xPrepExec = conn.prepareExpression(queryFromFile);
xPrepExec.bindItem(new QName("refDocument"), this.referenceDataItem);   

//the code below is in a seperate method and called multiple times
public int getCount(String searchVal){

    xPrepExec.bindString(new QName("isrc"), searchVal, conn.createAtomicType   (XQItemType.XQBASETYPE_STRING));

    XQSequence resultsFromFile = xPrepExec.executeQuery();
    int count = Integer.parseInt(resultsFromFile.getSequenceAsString(new Properties()));
    return count;

}

The method getCount is called many times in succession (eg 1000000 times) to validate the existance of many values in the XML file.

The current speed of the Xquery query is about 500 milliseconds for each call to getCount which seems very slow considering the XML document is in memory and the query is a prepared one.

The reason I'm using XQuery is as a proof of concept for future work where the XML file will have a more complex layout.

I'm running the code on an i7 with 8GB RAM so memory is not an issue - I also increased the allocated heap size for the program.

Any suggestions on how I can improve the speed of this code?

Thanks!

2

2 Answers

1
votes

Zorba has a facility to parse and query large XML document. Some documentation about it is available at http://www.zorba-xquery.com/html/entry/2012/05/31/XML_Streaming

For instance, in the following code snippet, we parse a 700MB document via HTTP and the complete process happens in a streaming manner from top to bottom:

import module namespace http = "http://expath.org/ns/http-client";
import module namespace p = "http://www.zorba-xquery.com/modules/xml";
import schema namespace opt = "http://www.zorba-xquery.com/modules/xml-options";

let $raw-data as xs:string := http:send-request(<http:request href="http://cf.zorba-xquery.com.s3.amazonaws.com/forecasts.xml" method="GET" override-media-type="text/plain" />)[2]
let $data := p:parse($raw-data, <opt:options><opt:parse-external-parsed-entity opt:skip-root-nodes="1"/></opt:options>)
return
    subsequence($data, 1, 2) 

You can try this example live at http://www.zorba-xquery.com/html/demo#CGPfEyXKvDwDfgzek/VTOIAIrJ8=

1
votes

The most obvious answer to the question how to improve the speed is to try Saxon-EE, which has a more powerful optimizer, and also uses bytecode generation. I haven't tried it, but I think Saxon-EE will detect that this query will benefit from building an index, and the same index will be used repeatedly for each occurrence of the query.

The other suggestion I would make is to declare the type of the variable $refDocument - type information helps the optimizer to make more informed decisions. For example, if the optimizer knows that $refDocument is a single node, then it knows that $refDocument//X will automatically be in document order, without any need for a sort operation.

Replacing the "=" operator by "eq" is also worth trying.