Cloud Data Fusion Preview environment

Question

We can configure the compute profile to run the pipeline on a custom cluster that I create, however for preview I cannot specify the compute profile.

There are some custom transformations i need to use which requires me to install some external jar on the data-proc cluster for the code to work. I would like to test it before i deploy the code using the "preview run"

Is there away i can achieve this. I don't see any property that i can set to choose the compute profile at the time of preview run.

Regards testing you Data Fusion pipeline, you could set set the plugin fields that reference the project id and service account to macros in order to define them globally and to make the pipeline portable. Then, with the preview run you can assign values to the fields and test you pipeline. Would that attend your needs? — Alexandre Moraes
Alexandre: Its not the GCP project that is the issue, I can parameterize the projects/paths/Target in the pipeline and would be able to access them at runtime during the preview run. However the preview does not run on the compute profile i have created for the environment (Because there are no spark applications running on the cluster). In fact i have no idea on which machine the Job runs(I am very new to datafusion and am not aware of all the features). My question here is if we can also control that behaviour and let the job run on a data-proc cluster i have specifically provisioned — Trishit Ghosh
according to the documentation when you submit your job you have to select the cluster. The command line would be: gcloud dataproc jobs submit job-command \ --cluster cluster-name --region region \ other dataproc-flags \ -- job-args. Is that what you are asking for ? — Alexandre Moraes
The issue is more related to Data Fusion service of GCP, I am not submitting a spark job Data-proc cluster directly, I use the data-fusion service which will internally trigger jobs on the cluster. Its fine when i have deployed the pipeline and run, where i can configure the compute profile for my pipeline to use, however when i do a "preview run" before deploying the pipeline that is where i don't have any option of choosing the compute profile — Trishit Ghosh
You are correct, there is not the possibility to choose the compute profile when checking the Preview run in Data Fusion. Also, regarding the the cluster where the job will run, according to the documentation Data fusion provisions an ephemeral Dataproc cluster, which is deleted after the completion of the job, here. Did this information help you? — Alexandre Moraes

Alexandre Moraes Alexandre Moraes · Accepted Answer · 2020-05-07T09:27:17

After our discussion in the chat and further investigation, I confirm that before deploying the pipeline it is not possible to select a Compute Profile within the Pipeline Studio. However, you have some available options by clicking at Configure, as shown below:

If you click on Configure, you can change: Pipeline Config, Engine Config, Resources and Pipeline alert. In addition, you can also select the Preview mode and then click on Configure to change the runtime arguments and Preview config (number of records that will be shown).

In case that you need to select your Compute Profile to test your code, I would suggest you to deploy, select the proper Compute profile and run it. If you need to change anything within you pipeline, you can duplicate it and it will take you back to the Pipeline studio where you can edit it. You can achive this as following: click on Actions buttom (located in the upper right corner of the Data Fusion pipeline console), then click on Duplicate.

As a alternative, you can also ask for a Feature request with Google in this link.

Cloud Data Fusion Preview environment

1 Answers