2
votes

I'm facing an interesting problem, where I need to go through multiple big XML files (each is hundreds of MBs) and output specific data from each of the elements, and do it as fast as possible. Examples:

Points.xml:

<points>
  <point>
    <identifier>bb25c66c-27d0-447f-aaad-bd8290b332fd</identifier>
    <name>A</name>
  </point>
  <point>
    <identifier>f187cc74-2709-4464-995c-b3bdcae46b39</identifier>
    <name>B</name>
  </point>
</points>

Routes.xml:

<routes>
  <route>
    <pointLink xlink:href="urn:uuid:bb25c66c-27d0-447f-aaad-bd8290b332fd"/>
    <name>1</name>
  </route>
  <route>
    <pointLink xlink:href="urn:uuid:f187cc74-2709-4464-995c-b3bdcae46b39"/>
    <name>2</name>
  </route>
</routes>

There are tens of thousands of point/route elements in different documents and they are much more complicated than this, but for this example, this excerpt will suffice.

The output needs to be something like this:

1 - A
2 - B

I cannot modify the documents themselves, this is what I have to work with, and now the question is - how to link the elements based on the identifier as efficiently as possible? I'm talking about the way to do this, maybe a different way to think about this, not the actual code, since it will be quite different in the end anyway.

I have tried looping over the routes and then finding the points using FLWOR:

for $route in doc('routes.xml')/routes/route
  return concat(
    $route/name/text(),
    ' - ',
    doc('points.xml')/points/point[./identifier/text() = substring-after($route/pointLink/@xlink:href, 'urn:uuid:')]/name/text()
  )

That didn't work out really well (took almost an hour to complete). Similar story with this approach too:

for $route in doc('routes.xml')/routes/route,
    $point in doc('points.xml')/points/point[./identifier/text() = substring-after($route/pointLink/@xlink:href, 'urn:uuid:')]
  return concat(
    $route/name/text(),
    ' - ',
    $point/name/text()
  )

In the end, I will need to use more sub elements from point/route in the output so I think I have to iterate over them with for and then concat the output, but maybe I am wrong, which is why I am asking here.

Is there something I am overlooking, or is there simply no faster way to do this?

1
As you are using BaseX, have you inserted the documents into a database to make use of an index like a text index docs.basex.org/wiki/Indexes? That way any cross-reference to compare point/identifier should work faster.Martin Honnen
The documents are indeed in a database, but I have been using direct access to them via doc because in similar cases that I have already solved, it proved to be several times faster (BaseX optimizes doc to db:open-pre anyway). I'm also using exact paths in the actual query instead of // or *, so that shouldn't be problem either. However, from the link you posted, I do not see "Applying text index for .." anywhere in the Info panel so I will try to dive deeper into it, thanks!Lukáš Zemina

1 Answers

2
votes

The problem was indeed indexing, as Martin Honnen said in comment. Simply creating an attribute index (CREATE INDEX attribute) helped to reduce the query time from ~45 minutes to less than a second. Incredible.