0
votes

I get an error when writing data to elasticsearch from spark. Most documents are written fine, then I have this kind of exceptions

org.elasticsearch.hadoop.rest.EsHadoopRemoteException: date_time_exception: date_time_exception: Invalid value for Year (valid values -999999999 - 999999999): -6220800000

  • The field mapping in elasticsearch is "date"

  • The field type in pySpark is DateType not TimestampType which imo should make clear that this is a date without time. The value shown by spark is "1969-10-21" so a perfectly reasonable date.

(It was originally a timestampType, from another elasticsearch date read but I converted it to a dateType in hope to solve this error but I have the exact same error message (with the exact same timestamp value) either sending to elasticSearch a TimestampType or DateType)

My guess is that there are three 0s that shouldn't be in that timestamp sent to elasticsearch but I can't find any way to normalize it. Is there an option for org.elasticsearch.hadoop connector ?

(elk version is 7.5.2, spark is 2.4.4)

1

1 Answers

0
votes

Obvious workaround : use any other type than an TimestampType or DateType

e.g. using this udf for LongType (to demonstrate that it's indeed a timestamp length issue).

import datetime
import time

def conv_ts(d) :
  return time.mktime(d.timetuple())
ts_udf = F.udf(lambda z : int(conv_ts(z)), LongType())

(Note that in that snippet the spark input is timestampType not dateType, so a python datetime not date, because I tried messing around with time conversions too)

OR (much more efficient way) obviously to avoid the udf by using a StringType field of formatted date instead of the Long timestamp thanks to the pyspark.sql.date_format function.

A solution but not a really satisfying one, I would rather understand why the connector doesn't properly deal with timestampTypes and dateTypes by adjusting the timestamp length accordingly.