1
votes

How do I best go about to write a Source for the Python SDK which should read a nested XML file and split the content into multiple rows. The existing sources all work on line level which is not what I need in context of my XML.

It's a bunch of XML files and every single file makes out one transaction that has to be broken down into multiple records (order lines, payments, etc.).

1

1 Answers

1
votes

You can use this pattern for reading TensorFlow records as a model for writing your own source: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/tfrecordio.py

You can use Python to parse the XML into elements.

Please keep in mind that a source will write to a PCollection that must contain only one type of element, so your source cannot emit some payment records and some order records. You'll need to either emit a single transaction record or create a wrapper around each record sub-type and filter on the contents later.