0
votes

Can we write a hive query in Spark - UDF. eg I have 2 tables: Table A and B

where b1 contains column names of A and b2 contains the value of that column in A. Now I want to query the tables in such a way that I get result as below: Result.

Basically replace the values of column in A with B based on column names and their corresponding values. To achieve that I wrote spark-UDF eg:convert as below

def convert(colname: String, colvalue:String)={

sqlContext.sql("SELECT b3 from B where b1 = colname and b2 = colvalue").toString; } I registered it as:

sqlContext.udf.register("conv",convert(_:String,_:String));

Now my main query is-

val result = sqlContext.sql("select a1 , conv('a2',a2), conv('a3',a3)");

result.take(2);

It gives me java.lang.NullPointerException.

Can someone please suggest if this feature is supported in spark/hive. Any other approach is also welcome. Thanks!

2
you cannot use sqlContext inside an UDF - Raphael Roth
Thanks for the input. - MGM

2 Answers

1
votes

Here is the solution to your question. You can do it in Hive itself.

WITH a_plus_col
AS (SELECT a1
        ,'a2' AS col_name
        ,a2 AS col_value
    FROM A
    UNION ALL
    SELECT a1
        ,'a3' AS col_name
        ,a3 AS col_value
    FROM A)
SELECT a_plus_col.a1 AS r1
    ,MAX(CASE WHEN a_plus_col.col_name = 'a2' THEN B.b3 END) AS r2
    ,MAX(CASE WHEN a_plus_col.col_name = 'a3' THEN B.b3 END) AS r3
FROM a_plus_col
INNER JOIN B ON ( a_plus_col.col_name = b1 AND a_plus_col.col_value = b2)
GROUP BY a_plus_col.a1;
1
votes

No, UDF Doesn't permit to write a Query inside. You can only pass the data as variables and do transformation to get the final result back at row/column/table level.