Spark SQL add column/update-accumulate value

Question

I have the following DataFrame:

name,email,phone,country
------------------------------------------------
[Mike,[email protected],+91-9999999999,Italy]
[Alex,[email protected],+91-9999999998,France]
[John,[email protected],+1-1111111111,United States]
[Donald,[email protected],+1-2222222222,United States]
[Dan,[email protected],+91-9999444999,Poland]
[Scott,[email protected],+91-9111999998,Spain]
[Rob,[email protected],+91-9114444998,Italy]

exposed as temp table tagged_users:

resultDf.createOrReplaceTempView("tagged_users")

I need to add additional column tag to this DataFrame and assign calculated tags by different SQL conditions, which are described in the following map(key - tag name, value - condition for WHERE clause)

val tags = Map(
  "big" -> "country IN (SELECT * FROM big_countries)",
  "medium" -> "country IN (SELECT * FROM medium_countries)",
  //2000 other different tags and conditions
  "sometag" -> "name = 'Donald' AND email = '[email protected]' AND phone = '+1-2222222222'"
  )

I have the following DataFrames(as data dictionaries) in order to be able to use them in SQL query:

Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")

I want to test each line in my tagged_users table and assign it appropriate tags. I tried to implement the following logic in order to achieve it:

tags.foreach {
  case (tag, tagCondition) => {
    resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
       .withColumn("tag", lit(tag).cast(StringType))
  }
}

def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
    f"SELECT * FROM $table WHERE $tagCondition"
}

but right now I don't know how to accumulate tags and not override them. Right now as the result I have the following DataFrame:

name,email,phone,country,tag
Dan,[email protected],+91-9999444999,Poland,medium
Scott,[email protected],+91-9111999998,Spain,medium

but I need something like:

name,email,phone,country,tag
Mike,[email protected],+91-9999999999,Italy,big
Alex,[email protected],+91-9999999998,France,big
John,[email protected],+1-1111111111,United States,big
Donald,[email protected],+1-2222222222,United States,(big|sometag)
Dan,[email protected],+91-9999444999,Poland,medium
Scott,[email protected],+91-9111999998,Spain,(big|medium)
Rob,[email protected],+91-9114444998,Italy,big

Please note that Donal should have 2 tags (big|sometag) and Scott should have 2 tags (big|medium).

Please show how to implement it.

UPDATED

val spark = SparkSession
  .builder()
  .appName("Java Spark SQL basic example")
  .config("spark.master", "local")
  .getOrCreate();

import spark.implicits._
import spark.sql

Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")

val df = Seq(
  ("Mike", "[email protected]", "+91-9999999999", "Italy"),
  ("Alex", "[email protected]", "+91-9999999998", "France"),
  ("John", "[email protected]", "+1-1111111111", "United States"),
  ("Donald", "[email protected]", "+1-2222222222", "United States"),
  ("Dan", "[email protected]", "+91-9999444999", "Poland"),
  ("Scott", "[email protected]", "+91-9111999998", "Spain"),
  ("Rob", "[email protected]", "+91-9114444998", "Italy")).toDF("name", "email", "phone", "country")

df.collect.foreach(println)

df.createOrReplaceTempView("tagged_users")

val tags = Map(
  "big" -> "country IN (SELECT * FROM big_countries)",
  "medium" -> "country IN (SELECT * FROM medium_countries)",
  "sometag" -> "name = 'Donald' AND email = '[email protected]' AND phone = '+1-2222222222'")

val sep_tag = tags.map((x) => { s"when array_contains(" + x._1 + ", country) then '" + x._1 + "' " }).mkString

val combine_sel_tag1 = tags.map((x) => { s" array_contains(" + x._1 + ",country) " }).mkString(" and ")

val combine_sel_tag2 = tags.map((x) => x._1).mkString(" '(", "|", ")' ")

val combine_sel_all = " case when " + combine_sel_tag1 + " then " + combine_sel_tag2 + sep_tag + " end as tags "

val crosqry = tags.map((x) => { s" cross join ( select collect_list(country) as " + x._1 + " from " + x._1 + "_countries) " + x._1 + "  " }).mkString

val qry = " select name,email,phone,country, " + combine_sel_all + " from tagged_users " + crosqry

spark.sql(qry).show

spark.stop()

fails with the following exception:

Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sometag_countries' not found in database 'default';
    at org.apache.spark.sql.catalyst.catalog.ExternalCatalog$class.requireTableExists(ExternalCatalog.scala:48)
    at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireTableExists(InMemoryCatalog.scala:45)
    at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getTable(InMemoryCatalog.scala:326)
    at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:701)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:730)
    ... 74 more

Why not have a countries table with two columns, country name and description? Description could be a single value from (small, medium, big, big|medium, small|medium). Then you just need to join the two tables on country name. — Terry Dactyl
Because this is just a particular case. According to the system design, users through the UI can configure as many tags, as they want(need) with the different conditions and names — alexanoid
The same is true for the collections like big_countries, medium_countries. users through the UI can configure as many collections with different names and elements as they need and use the reference to them in their SQL queries — alexanoid

Louis Thompson Louis Thompson · Accepted Answer · 2018-11-09T14:39:33

If you need to aggregate the results and not just execute each query perhaps use map instead of foreach then union the results

 val o = tags.map {
  case (tag, tagCondition) => {
    val resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
      .withColumn("tag", new Column("blah"))
    resultDf
  }
}

o.foldLeft(o.head) {
  case (acc, df) => acc.union(df)
}

Spark SQL add column/update-accumulate value

3 Answers