I have a fairly simple Apache Beam pipeline in Python I have set up in a Jupyter notebook and would like to deploy to a Dataflow runner. I am fairy new to all 3 of these! I am using the Python 3 and Apache Beam 2.27.0 kernel.
my pipeline options looks something like this:
options.view_as(GoogleCloudOptions).project = 'inspired-studio-11111'
options.view_as(GoogleCloudOptions).job_name = 'Dataflow Test Job2' + jobid
options.view_as(GoogleCloudOptions).region = 'us-central1'
options.view_as(GoogleCloudOptions).staging_location = 'gs://bucket/staging'
options.view_as(GoogleCloudOptions).temp_location = 'gs://bucket/temp'
options.view_as(SetupOptions).save_main_session = True
The pipeline runs fine in the notebook and interacts with GCP storage. When I set it up to run and run it on a GCP dataflow runner, I consistently get the following exception:
Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 771, in run self._load_main_session(self.local_staging_directory) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 512, in _load_main_session pickler.load_session(session_file) File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 318, in load_session return dill.load_session(file_path) File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 368, in load_session module = unpickler.load() File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load obj = StockUnpickler.load(self) File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'IPython'
Installing and importing ipython in my notebook did not help. Does this need to be configured somewhere on the GCP VM?