7
votes

Task: I am to run an ETL job that will Extract TIFF images from GCS, Transform those images to text with a combination of open source computer vision tools such as OpenCV + Tesseract and ultimately Load the data into BigQuery

Problem: I am trying to use Dataflow to perform the ETL job because I have millions of images (each image is a separate file/blob) and I want to scale to hundreds of machines. However, I am running into some problems with Dataflow (which will be described in greater detail below) regarding the best means to download the images.

Questions: Ultimately I am trying to determine:

1) Is dataflow the best solution to do this? Alternatives that I have considered are running a multithreaded job on a large machine. Are there any other alternatives that I should be considering?

2) If dataflow is the best solution, then how should I specifically handle downloading the millions of images (so that I can run them through a transformation)?

Technical challenges:

The following post Recommended solution recommends using beam.io.gcp.gcsio.GcsIO().open(filepath, 'r') in a DoFn to download images from GCS.
I've attempted going down this path, using beam.io.gcp.gcsio.GcsIO().open(filepath, 'r'), however, I am having trouble open the images. That issue is described here: IO.BufferReader issue.

When using the DirectRunner I can download the image files using this client api from google.cloud import storage and I can open and pre-process the images no problem. However, when using the dataflow runner, I am having dependency issues, AttributeError: 'module' object has no attribute 'storage'.

That being said, If Dataflow is the best solution, what is the best method to download and process millions of images?

1

1 Answers

2
votes

You are doing the right think. It seems that you ran into two problems:

  • With the io.BufferedReader issue, you need to add an interface that will allow you to seek into the Tiff file, as you correctly found in the question.

  • It seems that the problem with using the google.cloud.storage is that the dependency is not available in the Dataflow environment. To add this dependency, check out Managing Python pipeline dependencies in the Beam documentation.

The main idea is that you can run your pipeline with a --requirements_file reqs.txt file passed in, listing all the extra dependencies that you'd like to add.