Problem with Flink StreamingFileSink<GenericRecord> & Azure Datalake Gen 2

Question

I have a problem trying to sink a file into Azure Datalake Gen 2 with the StreamingFileSink from Flink, I'm using core-site.xml with Hadoop Bulk Format I'm trying to copy to my datalake with abfss:// format (also try with abfs://)

java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS
[job-playground-job-cluster-0 flink-job-cluster]        at org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.<init>(HadoopRecoverableWriter.java:61) ~[flink-dist_2.11-1.11.0.jar:1.11.0]
[job-playground-job-cluster-0 flink-job-cluster]        at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202) ~[flink-dist_2.11-1.11.0.jar:1.11.0]
[job-playground-job-cluster-0 flink-job-cluster]        at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69) ~[flink-dist_2.11-1.11.0.jar:1.11.0]
[job-playground-job-cluster-0 flink-job-cluster]        at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$BulkFormatBuilder.createBuckets(StreamingFileSink.java:371) ~[flink-dist_2.11-1.11.0.jar:1.11.0]

I read in the official documentation and dive into Library and the problems is here: https://github.com/apache/flink/blob/master/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopRecoverableWriter.java#L60

public HadoopRecoverableWriter(org.apache.hadoop.fs.FileSystem fs) {
        this.fs = checkNotNull(fs);

        // This writer is only supported on a subset of file systems
        if (!"hdfs".equalsIgnoreCase(fs.getScheme())) {
            throw new UnsupportedOperationException(
                    "Recoverable writers on Hadoop are only supported for HDFS");
        }

This is my core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.azure.account.auth.type.ADLS_ACCOUNT_NAME.dfs.core.windows.net</name>
    <value>SharedKey</value>
    <description>
    It is inferred by the url
    </description>
  </property>

  <property>
    <name>fs.azure.account.key.ADLS_ACCOUNT_NAME.dfs.core.windows.net</name>
    <value>ADLS_KEY</value>
    <description>
    </description>
  </property>

  <property>
    <name>fs.azure.createRemoteFileSystemDuringInitialization</name>
    <value>true</value>
  </property>

  <property>
    <name>fs.azure.always.use.https</name>
    <value>true</value>
  </property>
  
</configuration>

Anyone have pass this problem or is a problem with the extention abfss/abfs.

Looking at the code, Flink requires the operation truncate(Path f, long newLength) to be implemented -which only HDFS has done. Once it's in abfs then the hadoop team can talk to the flink folk about probing for this more elegantly, now there's an API to ask if an FS instance supports a specific feature — stevel
Link to the problem: truncate() is used: github.com/apache/flink/blob/master/flink-filesystems/… — stevel

Robert Metzger Robert Metzger · Accepted Answer · 2020-08-11T07:45:14

0

votes

The StreamingFileSink does not yet support Azure Data Lake.

Problem with Flink StreamingFileSink<GenericRecord> & Azure Datalake Gen 2

1 Answers