1
votes

I have Spark Data Frame which several columns 2 of which are named recorded_ts and ts. recorded_ts is a string representation of a date time value (in the format 20180101101125) and ts is a true timestamp.

When I print out the data frame contents, there's one-to-one match between values in recorded_ts and ts.

I then compute mean of a column using the following

meanDF = pv \
         .groupBy("p_id", "u_id", "r_id", "b_id", window("ts", "15 minutes", "1 minute")) \
         .agg(mean("value")) \
         .withColumnRenamed("avg(value)", "mean_15")

meanDF.createOrReplaceTempView("my_stats")

upsertDF = spark.sql("select p_id, u_id, r_id, b_id, \
            date_format(window['start'], 'yyyyMMddHHmmss') as recorded_dtm, mean_15 \
           from my_stats ")


 aggRows = upsertDF.collect()
 for aggRow in aggRows:
     logger.info(aggRow)

Unfortunately, the values for recorded_dtm drop off the seconds and print as 20180101101100 instead of 20180101101125

I have tried replacing 15 minutes with 900 seconds and 1 minute with 60 seconds but that doesn't make any difference.

How do I preserve the seconds as well from the time stamp column?

1

1 Answers

0
votes

It is my misunderstanding of the window function. window['start'] represents the start time of the window. There's a corresponding window['end'] that represents the end time of the window over which the aggregate is computed.