I have written a Hive UDF that does decryption using an in-house API as follows:
public Text evaluate(String customer) {
String result = new String();
if (customer == null) { return null; }
try {
result = com.voltage.data.access.Data.decrypt(customer.toString(), "name");
} catch (Exception e) {
return new Text(e.getMessage());
}
return new Text(result);
}
and Data.decrypt does:
public static String decrypt(String data, String type) throws Exception {
configure();
String FORMAT = new String();
if (type.equals("ccn")) {
FORMAT = "CC";
} else if (type.equals("ssn")) {
FORMAT = "SSN";
} else if (type.equals("name")) {
FORMAT = "AlphaNumeric";
}
return library.FPEAccess(identity, LibraryContext.getFPE_FORMAT_CUSTOM(),String.format("formatName=%s", FORMAT),authMethod, authInfo, data);
}
where configure() creates a pretty expensive context object.
My question is: Does Hive execute this UDF once for every row returned by the query? i.e. If I'm selecting 10,000 rows, does the evaluate method get run 10,000 times?
My gut instinct tells me yes. And if so, then here's a second question:
Is there any way I can do one of the following:
a) run configure() once when the query first starts, then share the context object
b) instead of the UDF returning a decrypted string, it aggregates the encrypted string into some Set, then I do a bulk decrypt on the set?
Thanks in advance