2
votes

I am trying to build BST using SparkSQL, This could be done easily using another SELECT statement within main SELECT statement, but SparkSQL does not support SELECT within SELECT as column.

Graphical Representation of BST is as below :

Binary Search Tree

Input is the row-column representation of BST.Goal is to produce output table using input table data.

  • root : having no parent node

  • inner : having parent node as well as child node

  • leaf : having only parent node and no child node

Data

This could be achieved easily with select within select, below SQL should do this:

SELECT t.node,
    CASE
        WHEN t.parent IS NULL THEN 'root'
        WHEN EXISTS (SELECT t1.parent FROM bst t1 WHERE t1.parent = t.node) THEN 'inner'        
        ELSE 'leaf'
    END
FROM bst t

As SparkSQL does not have above mentioned feature, I had to do this with work around.

spark.sql("""SELECT node, 'inner' AS desc
FROM bst t
WHERE EXISTS (SELECT 1
                        FROM bst t1
                        WHERE t1.parent=t.node)
            AND parent IS NOT NULL

UNION ALL

SELECT node,'leaf' AS desc
FROM bst t
WHERE NOT EXISTS (SELECT 1
                        FROM bst t1
                        WHERE t1.parent=t.node)
            AND parent IS NOT NULL          

UNION ALL

SELECT node,'root' AS desc
FROM bst t
WHERE parent IS NULL""").show()

Commands to quickly create dummy data

data = \
  [(1, 2),
  (3, 2),
  (6, 8),
  (9, 8),
  (2, 5),
  (8, 5),
  (5, None)]
df = spark.createDataFrame(data, ["node", "parent"])
df.createOrReplaceTempView ("bst")

I am using Spark version 2.1, any other better optimized way to do this?

2

2 Answers

1
votes

You need to do a self join, but you can probably do it in a single statement (vs two statements with a union).

Here's an attempt:

select distinct 
  case 
    when n.parent is null then 'root' 
    when child.node is null then 'leaf'
    else 'inner' end
  ,n.node 
from bst n 
left outer join bst child
on n.node = child.parent
;
0
votes

Here is a one-way using SparkSQL

df.withColumn("TYPE",functions \
.when(functions.col("parent").isNull(),"root") \
.when(functions.col("node")\
.isin(list(df.select(functions.col("parent")).distinct().rdd.map(lambda x:x[0]).collect())),"inner") \
.otherwise("leaf"))\
.drop("parent")\
.show()

+----+-----+
|node| TYPE|
+----+-----+
|   1| leaf|
|   3| leaf|
|   6| leaf|
|   9| leaf|
|   2|inner|
|   8|inner|
|   5| root|
+----+-----+

If possible could you help me with Optimizing this