1
votes

I've installed apache_beam Python SDK and apache airflow Python SDK in a Docker. Python Version: 3.5
Apache Airflow: 1.10.5

I'm trying to execute apache-beam pipeline using **DataflowPythonOperator**. When I run a DAG from airflow UI at that time I get

Import Error: import apache_beam as beam. Module not found

With the same setup I tried **DataflowTemplateOperator** and it's working perfectly fine.

When I tried same docker setup with Python 2 and apache airflow 1.10.3, two months back at that time operator didn't returned any error and was working as expected.

After SSH into docker when I checked the installed libraries (using pip freeze) in a docker container I can see the installed versions of apache-beam and apache-airflow. apache-airflow==1.10.5 apache-beam==2.15.0

Dockerfile:

RUN pip install --upgrade pip
RUN pip install --upgrade setuptools
RUN pip install apache-beam
RUN pip install apache-beam[gcp]
RUN pip install google-api-python-client
ADD . /home/beam


RUN pip install apache-airflow[gcp_api]

airflow operator:

new_task = DataFlowPythonOperator(
 task_id='process_details',
 py_file="path/to/file/filename.py",
 gcp_conn_id='google_cloud_default',
 dataflow_default_options={
             'project': 'xxxxx',
             'runner': 'DataflowRunner',
             'job_name': "process_details",
             'temp_location': 'GCS/path/to/temp',
             'staging_location': 'GCS/path/to/staging',
             'input_bucket': 'bucket_name',
             'input_path': 'GCS/path/to/bucket',
             'input-files': 'GCS/path/to/file.csv'
     },
     dag=test_dag)
3
Can you share the full traceback?Iain Shelvington
@lainShelvington Here is the traceback: linkN. L

3 Answers

1
votes

This look like a known issue: https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/46

please run pip install six==1.10. This is a known issue in Beam (https://issues.apache.org/jira/browse/BEAM-2964) which we are trying to get fixed upstream.

So try installing six==1.10 using pip

0
votes

From this github link will help you to solve your problem. Follow below steps.

  1. Read following nice article on virtualenv, this will help in later steps,

    https://www.dabapps.com/blog/introduction-to-pip-and-virtualenv-python/?utm_source=feedly

  2. Create virtual environment ( Note I created it in cloudml-samples folder & named it env)

    titanium-vim-169612:~/cloudml-samples$ virtualenv env

  3. Activate virtual env

    @titanium-vim-169612:~/cloudml-samples$ source env/bin/activate

  4. Install cloud-dataflow using following link: (this brings in apache_beam)

    https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python

  5. Now u can check that apache_beam is present in env/lib/python2.7/site-packages/

    @titanium-vim-169612:~/cloudml-samples/flowers$ ls ../env/lib/python2.7/site-packages/

Run the sample At this point, I got an error about missing tensorflow. I installed tensorflow in my virtualenv by using the link below (use installation steps for virtualenv),

https://www.tensorflow.org/install/install_linux#InstallingVirtualenv

The sample seems to work now.

0
votes

This may not be an option for you, but I was getting the same error with python 2. Executing the same script with python 3 resolved the error.

I was running through the dataflow tutorial: https://codelabs.developers.google.com/codelabs/cpb101-simple-dataflow-py/

and when I follow the instructions as specified:

python grep.py

I get the error from the title of your post. I hit it with:

python3 grep.py 

and it works as expected. I hope it helps. Happy hunting if it doesn't. See the link for details on what exactly I was running.