I'm using Flink SQL to read debezium avro data from Kafka and store as parquet files in S3. Here is my code,
import os
from pyflink.datastream import StreamExecutionEnvironment, FsStateBackend
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment, StreamTableEnvironment, \
ScalarFunction
exec_env = StreamExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
# start a checkpoint every 12 s
exec_env.enable_checkpointing(12000)
t_config = TableConfig()
t_env = StreamTableEnvironment.create(exec_env, t_config)
INPUT_TABLE = 'source'
KAFKA_TOPIC = os.environ['KAFKA_TOPIC']
KAFKA_BOOTSTRAP_SERVER = os.environ['KAFKA_BOOTSTRAP_SERVER']
OUTPUT_TABLE = 'sink'
S3_BUCKET = os.environ['S3_BUCKET']
OUTPUT_S3_LOCATION = os.environ['OUTPUT_S3_LOCATION']
ddl_source = f"""
CREATE TABLE {INPUT_TABLE} (
`event_time` TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL,
`id` BIGINT,
`price` DOUBLE,
`type` INT,
`is_reinvite` INT
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_TOPIC}',
'properties.bootstrap.servers' = '{KAFKA_BOOTSTRAP_SERVER}',
'scan.startup.mode' = 'earliest-offset',
'format' = 'debezium-avro-confluent',
'debezium-avro-confluent.schema-registry.url' = 'http://kafka-production-schema-registry:8081'
)
"""
ddl_sink = f"""
CREATE TABLE {OUTPUT_TABLE} (
`event_time` TIMESTAMP,
`id` BIGINT,
`price` DOUBLE,
`type` INT,
`is_reinvite` INT
) WITH (
'connector' = 'filesystem',
'path' = 's3://{S3_BUCKET}/{OUTPUT_S3_LOCATION}',
'format' = 'parquet'
)
"""
t_env.sql_update(ddl_source)
t_env.sql_update(ddl_sink)
t_env.execute_sql(f"""
INSERT INTO {OUTPUT_TABLE}
SELECT *
FROM {INPUT_TABLE}
""")
When I submit the job, I get the following error message,
pyflink.util.exceptions.TableException: Table sink 'default_catalog.default_database.sink' doesn't support consuming update and delete changes which is produced by node TableSourceScan(table=[[default_catalog, default_database, source]], fields=[id, price, type, is_reinvite, timestamp])
I'm using Flink 1.12.1. The source is working properly and I have tested it using a 'print' connector in the sink. Here is a sample data set which was extracted from the task manager logs when using 'print' connector in the table sink,
-D(2021-02-20T17:07:27.298,14091764,26.0,9,0)
-D(2021-02-20T17:07:27.298,14099765,26.0,9,0)
-D(2021-02-20T17:07:27.299,14189806,16.0,9,0)
-D(2021-02-20T17:07:27.299,14189838,37.0,9,0)
-D(2021-02-20T17:07:27.299,14089840,26.0,9,0)
-D(2021-02-20T17:07:27.299,14089847,26.0,9,0)
-D(2021-02-20T17:07:27.300,14189859,26.0,9,0)
-D(2021-02-20T17:07:27.301,14091808,37.0,9,0)
-D(2021-02-20T17:07:27.301,14089911,37.0,9,0)
-D(2021-02-20T17:07:27.301,14099937,26.0,9,0)
-D(2021-02-20T17:07:27.302,14091851,37.0,9,0)
How can I make my table sink work with the filesystem connector ?