Run beam pipeline with flink yarn session on EMR

Question

I am trying to run a basic wordcount beam pipeline from the python SDK with a flink yarn session on AWS EMR. I have used both the flink runner and portable runner and get two different errors listed below. Jobs from both types of runners show up in the flink UI and successfully run with a local flink session on my laptop.

With the FlinkRunner the job runs as BeamApp-hadoop-0617202523-14894e58 and gives error:

ERROR:root:java.lang.NoClassDefFoundError: Could not initialize class org.apache.beam.runners.core.construction.SerializablePipelineOptions

With the PortableRunner the job runs as BeamApp-root-0617202248-36b0d306 (I believe this means it is successfully submitting the job from the beam portable runner docker image) and gives error:

ERROR:root:java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype

I assumed these are dependency errors and have tried getting the mentioned jars in the /usr/lib/flink/lib directory. The yarn container logs list the correct jars when logging the classpath on application startup, but the errors persist.

Apache Beam version 2.22.0, flink version 1.10.0, emr version 5.30.0.

sreekanthnaga sreekanthnaga · Accepted Answer · 2021-03-29T05:59:42

I ran into a similar issue with Apache beam + AWS EMR + Flink and I resolved this by excluding the jackson-core, jackson-annotation and the jackson-databind dependencies from the FlinkPipelineOptions.filesToStage.

options.setFilesToStage(Arrays.stream(System.getProperty("java.class.path").split(":"))
   .filter(d -> !d.contains("com.fasterxml.jackson.core"))
   .filter(d -> Files.exists(Paths.get(d)))
   .collect(Collectors.toList()));

Run beam pipeline with flink yarn session on EMR

1 Answers