I am trying to write a bloom filter builder in PIG making use of the builtin BuildBloom and Bloom UDFs. The syntax for calling the BuildBloom UDF is:
define bb BuildBloom('hash_type', 'vector_size', 'false_positive_rate');
where the vector size and and false positive rate arguments are passed in as charrarrays. Since I don't necessarily know the vector size before hand, but it is always available within the script prior to calling the BuildBloom UDF, I want to use the builtin COUNT UDF instead of some hard-coded value. Something like:
records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE
(long) $0 AS value_fld:long,
(chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value') AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);
define bb BuildBloom('jenkins', n, '$false_positive_rate');
The problem is that when I describe n I get: n: {count: chararray}. Predictably, the BuildBloom UDF call fails because it got a tuple as input where it expected a simple chararray. How should I pull just the chararray (i.e. the integer return from COUNT cast to a chararray) and assign that to n for use in the call to BuildBloom(...)?
EDIT: Here is the resulting error when I attempted to pass N::count into the BuildBloom(...) UDF. describe N yields: N {count: chararray}. The offending line (line 40) reads: define bb BuildBloom('jenkins', N::count, '$fpr');
ERROR 1200: <file buildBloomFilter.pig, line 40, column 32> mismatched input 'N::count' expecting set null
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. <file buildBloomFilter.pig, line 40, column 32> mismatched input 'N::count' expecting set null
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: Failed to parse: <file buildBloomFilter.pig, line 40, column 32> mismatched input 'N::count' expecting set null
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:235)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
... 14 more