1
votes
df = spark.read.parquet('xxx')
tmstmp = df['timestamp']

spark.conf.set("spark.sql.session.timeZone", "Singapore")

time_df = spark.createDataFrame([('tmstmp',)], ['unix_time'])
time_df.select(from_unixtime('unix_time').alias('ts')).collect()

df['timestamp'] = time_df

spark.conf.unset("spark.sql.session.timeZone")

there is an error at this line:

time_df.select(from_unixtime('unix_time').alias('ts')).collect()

with the exception error msg:

Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

1

1 Answers

0
votes

The exception that you are receiving is pretty clear in itself. You have a cluster with two or more nodes. The server/node from where you submitted this command(master) has different version of python than your other nodes(worker).

Possible solution:

  • Bump up your worker nodes python version or set PYSPARK_PYTHON env to python3.7 installation.
  • change your driver python version to match worker nodes version.