0
votes

I have written a very simple Nifi template which first lists and then fetches an object from a bucket on Google Cloud Storage. Obviously, when fetching the object, Nifi tries to download the object from the bucket using internet. My question is that, if I want to ingest such object to other Google Cloud services, such as Pub/Sub or Cloud Datastore, do I need to download this files to a separate node?

Why should I not have another node in Google Cloud which could be in the same group of IPs as in Google Cloud Storage? So instead of downloading from the internet it would be just transferring the object among a network?

Another question I have: does the Dataflow default templates for transferring files and objects for buckets to other services such as Pub/Sub obey similar principle? I mean if they use internet connection to transfer object from a bucket to Pub/Sub or of they transfer the object among network nodes?

1

1 Answers

0
votes

Transfers among Google Cloud Platform services are made within the private network. So as long as you have set the appropriate Firewall rules, the services will be able to communicate directly through the private network thus there is no need for the files to be downloaded.

For example if you have a job where an object is downloaded from an external source to Cloud Storage and then transferred from Cloud Storage to Cloud Datastore it will use the internet to download the file to Cloud Storage and then it will use the internal private network to transfer it to Cloud Datastore.

Therefore, regarding your second question, the files and objects are transferred among network nodes for Dataflow jobs.

As described in the Dataflow Documentation - Regional endpoints:

You can minimize network latency and network transport costs by running a Cloud Dataflow job from the same region as its sources and/or sinks.


Notes about common Cloud Dataflow job sources:

Cloud Storage buckets can be regional or multi-regional resources: When using a Cloud Storage regional bucket as a source, Google recommends that you perform read operations in the same region.