PySpark: for Each row, count another table based on condition

Question

For each row in table 1, I am trying to count the rows that are in table 2 and meet the condition based on value from table 1.

The age from table 1 should be between the StartAge and EndAge of table 2, OR equal to either StartAge and EndAge.

Will this be possible using a udf and withColumn? I've tried couple of ways to do it such as using withColumn and withColumn with a UDF, but both ways fail.

def counter(a):
    return table2.where((table2.StartAge <= a) & (table2.EndAge >=a)).count()

counter_udf = udf(lambda age: counter(age), IntegerType())

table1 = table1.withColumn('Count', counter_udf('Age ID'))

Does this make sense? Thanks.

Example input and output:

Please don't post images of code/data (or links to them) , insttead paste them as text so users can copy the data — anky

kites kites · Accepted Answer · 2020-07-27T19:11:46

check this out. you can achieve it using spark-sql.

    from pyspark.sql import SparkSession

    spark = SparkSession.builder \
        .appName('SO')\
        .getOrCreate()

    sc= spark.sparkContext

    df = sc.parallelize([([3]), ([4]), ([5])]).toDF(["age"])

    df1 = spark.createDataFrame([(0, 10), (7, 15), (5, 10), (3, 20), (5, 35), (4, 5),]
                           , ['age_start', 'age_end'])

    df.createTempView("table1")

    df1.createTempView("table2")



    spark.sql('select  t1.age as age_id, count(*) as count from table1 t1 join table2  t2 on  t1.age >=t2.age_start and t1.age<=t2.age_end group by t1.age order by count').show()

    # +------+-----+
    # |age_id|count|
    # +------+-----+
    # |     3|    2|
    # |     4|    3|
    # |     5|    5|
    # +------+-----+

PySpark: for Each row, count another table based on condition

2 Answers