I'm facing an interesting problem, where I need to go through multiple big XML files (each is hundreds of MBs) and output specific data from each of the elements, and do it as fast as possible. Examples:
Points.xml:
<points>
<point>
<identifier>bb25c66c-27d0-447f-aaad-bd8290b332fd</identifier>
<name>A</name>
</point>
<point>
<identifier>f187cc74-2709-4464-995c-b3bdcae46b39</identifier>
<name>B</name>
</point>
</points>
Routes.xml:
<routes>
<route>
<pointLink xlink:href="urn:uuid:bb25c66c-27d0-447f-aaad-bd8290b332fd"/>
<name>1</name>
</route>
<route>
<pointLink xlink:href="urn:uuid:f187cc74-2709-4464-995c-b3bdcae46b39"/>
<name>2</name>
</route>
</routes>
There are tens of thousands of point/route elements in different documents and they are much more complicated than this, but for this example, this excerpt will suffice.
The output needs to be something like this:
1 - A
2 - B
I cannot modify the documents themselves, this is what I have to work with, and now the question is - how to link the elements based on the identifier as efficiently as possible? I'm talking about the way to do this, maybe a different way to think about this, not the actual code, since it will be quite different in the end anyway.
I have tried looping over the routes and then finding the points using FLWOR:
for $route in doc('routes.xml')/routes/route
return concat(
$route/name/text(),
' - ',
doc('points.xml')/points/point[./identifier/text() = substring-after($route/pointLink/@xlink:href, 'urn:uuid:')]/name/text()
)
That didn't work out really well (took almost an hour to complete). Similar story with this approach too:
for $route in doc('routes.xml')/routes/route,
$point in doc('points.xml')/points/point[./identifier/text() = substring-after($route/pointLink/@xlink:href, 'urn:uuid:')]
return concat(
$route/name/text(),
' - ',
$point/name/text()
)
In the end, I will need to use more sub elements from point/route in the output so I think I have to iterate over them with for and then concat the output, but maybe I am wrong, which is why I am asking here.
Is there something I am overlooking, or is there simply no faster way to do this?
point/identifier
should work faster. – Martin Honnen