Read a pickle from another pipeline in Beam?

Question

I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill.

The writing works well, writing a number of files, each with a pickled object. When I download the file manually, I can unpickle the file. Code for writing: beam.io.WriteToText('gs://{}', coder=coders.DillCoder())

But the reading breaks every time, with one of the errors below. Code for reading: beam.io.ReadFromText('gs://{}*', coder=coders.DillCoder())

Either...

  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
KeyError: '\x90'

...or...

  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named measur

(the class of the object sits in a path with measure, though not sure why it misses the last character there)

I've tried using the default coder, and a BytesCoder, and pickling & unpickling as a custom task in the pipeline.

My working hypothesis is the reader splitting the file by line, and so treating a single pickle (which has new lines within it) as multiple objects. If so, is there a way of avoiding that?

I could attempt to build a reader myself, but I'm hesitant since this seems like a well-solved problem (e.g. Beam already has a format to move objects from one pipeline stage to another).

Tangentially related: How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?

Thank you!

chamikara chamikara · Accepted Answer · 2018-02-01T17:22:15

ReadFromText is designed to read new line separated records in text files hence is not suitable for your use-case. Implementing FileBasedSource is not a good solution either since it's designed for reading large files with multiple records (and usually splits these files into shards for parallel processing). So, in your case, the current best solution for Python SDK is to implement a source yourself. This can be as simple as a ParDo that reads files and produces a PCollection of records. If your ParDo produce a large number of records consider adding a apache_beam.transforms.util.Reshuffle step following that which will allow runners to parallelize following steps better. For Java SDK we have FileIO which already provides transforms to make this bit easier.

Read a pickle from another pipeline in Beam?

2 Answers