1
votes

I have a table as below:

user_id email
u1      e1, e2
u2      null

My goal is to convert this into the following format:

user_id email
u1      e1
u1      e2
u2      null

HIVE sql: select * FROM table LATERAL VIEW explode ( split ( email ,',' ) ) email AS email_id

When above query is executed in hive I am getting the nulls however when the same is ran in spark-sql I am not getting nulls, this question and scenario has already been discussed here

Spark sql:

  1. select * FROM table LATERAL VIEW OUTER explode ( split ( email ,',' ) ) email AS email_id;
  2. select * from table lateral view POSEXPLODE_OUTER(split(email,',')) email as email_id <br>

The second is failing with syntax issue, I tried searching for lateral view with posexplode_outer but could not get much results, I want to bring nulls in spark-sql.

3
what's the problem with the first query in spark sql? the second function posexplode_outer isn't valid.Vamsi Prabhala
@VamsiPrabhala: Even after running first query i am not able to see NULLs in result, "SELECT POSEXPLODE_OUTER(array(10,20)) AS (POS,SO);" this runs in spark-sqlVinod
@sramalingam24: Yea I had checked this earlier but it was wrt dataframe, i needed to implement in spark-sql queryVinod
Everything you write in spark SQL is available as dataframe function so going from one to the other is just semanticssramalingam24

3 Answers

3
votes

Spark SQL doesn't use HiveQL. It is partially compatible with it, but shouldn't be mistaken. Instead of using LATERAL VIEW you should just use SELECT

Seq(
  ("u1", "e1, e2"), ("u2", null)
).toDF("user_id", "email").createOrReplaceTempView("table")

spark.sql(
    "SELECT user_id, explode_outer(split(email, ',')) FROM table"
).show
// +-------+----+                               
// |user_id| col|
// +-------+----+
// |     u1|  e1|
// |     u1|  e2|
// |     u2|null|
// +-------+----+
1
votes

Adding coalesce after split seems to be working

with tmp_table as ( 
  select 'u1' as user, 'e1,e2' as email 
  union all 
  select 'u2' as user, NULL as email
)
select * FROM tmp_table 
LATERAL VIEW explode ( coalesce(split ( email ,',' ), array(NULL)) ) email AS email_id;

output

u1  e1,e2   e1
u1  e1,e2   e2
u2  NULL    NULL
0
votes

LATERAL VIEW OUTER was added in Spark 2.2.0

e.g.

scala> spark.sql( | "select * FROM table LATERAL VIEW OUTER explode ( split ( email ,',' ) ) email AS email_id" | ).show +-------+------+--------+
|user_id| email|email_id| +-------+------+--------+ | u1|e1, e2| e1| | u1|e1, e2| e2| | u2| null| null| +-------+------+--------+