Task: I am to run an ETL job that will Extract TIFF images from GCS, Transform those images to text with a combination of open source computer vision tools such as OpenCV + Tesseract and ultimately Load the data into BigQuery
Problem: I am trying to use Dataflow to perform the ETL job because I have millions of images (each image is a separate file/blob) and I want to scale to hundreds of machines. However, I am running into some problems with Dataflow (which will be described in greater detail below) regarding the best means to download the images.
Questions: Ultimately I am trying to determine:
1) Is dataflow the best solution to do this? Alternatives that I have considered are running a multithreaded job on a large machine. Are there any other alternatives that I should be considering?
2) If dataflow is the best solution, then how should I specifically handle downloading the millions of images (so that I can run them through a transformation)?
Technical challenges:
The following post Recommended solution recommends using beam.io.gcp.gcsio.GcsIO().open(filepath, 'r')
in a DoFn to download images from GCS.
I've attempted going down this path, using beam.io.gcp.gcsio.GcsIO().open(filepath, 'r')
, however, I am having trouble open the images. That issue is described here: IO.BufferReader issue.
When using the DirectRunner
I can download the image files using this client api from google.cloud import storage
and I can open and pre-process the images no problem. However, when using the dataflow runner, I am having dependency issues, AttributeError: 'module' object has no attribute 'storage'
.
That being said, If Dataflow is the best solution, what is the best method to download and process millions of images?