Using Python user defined function in a Java Flink Job

Question

Is there anyway to use a python user defined function within a Java Flink Job or anyway to communicate for example the result of a transformation done by flink with java with a python user defined function to apply some machine learning things:

I know that from pyFlink you can do something like this:

table_env.register_java_function("hash_code", "my.java.function.HashCode")

But I need to do something like that but add the python function from java, or how can I pass the result of a java transformation to a Python UDF Flink job directly?

I hope these questions are not to crazy, but I need to know if exist somehow to communicate Flink DataStream API with Python Table API having Java as main language? this means that from Java I need to do: Source -> Transformations -> Sink, but some of these transformations can trigger a Python function or a Python function will be waiting for some Java transformation to finish to do something with the Stream result.

I hope someone understand what I'm trying to do here.

Kind regards!

David Anderson David Anderson · Accepted Answer · 2020-07-04T08:29:06

Support for Python UDFs (user defined functions) was added in Flink 1.10 -- see PyFlink: Introducing Python Support for UDFs in Flink's Table API. For example, you can do this:

add = udf(lambda i, j: i + j, [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT())
table_env.register_function("add", add)
my_table.select("add(a, b)")

For more examples, etc, see the blog post linked above, or the stable documentation.

In Flink 1.11 (release expected next week), support has been added for vectorized Python UDFs, bringing interoperability with Pandas, Numpy, etc. This release also includes support for Python UDFs in SQL DDL, and in the SQL client. For documentation, see the master docs.

It sounds like you want to call out to Python from Java. The Stateful Functions API supports this more completely -- see remote functions. But to call out to Python from the Java DataStream API, I think your only option is to use the SQL DDL support added in Flink 1.11. See FLIP-106 and the docs.

FLIP-106 has this example:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
tEnv.getConfig().getConfiguration().setString("python.files", "/home/my/test1.py");
tEnv.getConfig().getConfiguration().setString("python.client.executable", "python3");

tEnv.sqlUpdate("create temporary system function func1 as 'test1.func1' language python");
Table table = tEnv.fromDataSet(env.fromElements("1", "2", "3")).as("str").select("func1(str)");
tEnv.toDataSet(table, String.class).collect();

which you should be able to convert to use the DataStream API instead.

Using Python user defined function in a Java Flink Job

2 Answers