I am trying to build BST using SparkSQL, This could be done easily using another SELECT statement within main SELECT statement, but SparkSQL does not support SELECT within SELECT as column.
Graphical Representation of BST is as below :
Input is the row-column representation of BST.Goal is to produce output table using input table data.
root : having no parent node
inner : having parent node as well as child node
- leaf : having only parent node and no child node
This could be achieved easily with select within select, below SQL should do this:
SELECT t.node,
CASE
WHEN t.parent IS NULL THEN 'root'
WHEN EXISTS (SELECT t1.parent FROM bst t1 WHERE t1.parent = t.node) THEN 'inner'
ELSE 'leaf'
END
FROM bst t
As SparkSQL does not have above mentioned feature, I had to do this with work around.
spark.sql("""SELECT node, 'inner' AS desc
FROM bst t
WHERE EXISTS (SELECT 1
FROM bst t1
WHERE t1.parent=t.node)
AND parent IS NOT NULL
UNION ALL
SELECT node,'leaf' AS desc
FROM bst t
WHERE NOT EXISTS (SELECT 1
FROM bst t1
WHERE t1.parent=t.node)
AND parent IS NOT NULL
UNION ALL
SELECT node,'root' AS desc
FROM bst t
WHERE parent IS NULL""").show()
Commands to quickly create dummy data
data = \
[(1, 2),
(3, 2),
(6, 8),
(9, 8),
(2, 5),
(8, 5),
(5, None)]
df = spark.createDataFrame(data, ["node", "parent"])
df.createOrReplaceTempView ("bst")
I am using Spark version 2.1, any other better optimized way to do this?

