AWS glue crawler parses parquet's timetick weirdly

Question

base
I download the log with python every day (with cronjob, using ipython).
The download server is CENTOS7, dockerized ubuntu

(base) root@:/# uname -a
Linux f5210d345285 3.10.0-957.21.2.el7.x86_64 #1 SMP Wed Jun 5 14:26:44 UTC 2019 x86_64 GNU/Linux

server DateTime is set to utc

cron Process
1. Download from athena
2. Make pandas dataframe
3. Get aggregation data from mysql server (onpremise)
4. Join & aggregation
5. Make final dataframe
6. Upload to s3

final_df.to_parquet(temp_filename, compression=None, engine='fastparquet')
FH = open(temp_filename, 'rb')
data_bytes = FH.read()
FH.close()
os.remove(filename)
boto3.session.resource('s3').Object(bucketname, s3pathname).put(Body=data_bytes)

Work with glue-crawler
Get data from athena and use it (like tableau)

problem
1. The problem is datetime column.
2. The datetime column name is 'reg_date', and it comes from mariadb column that type is 'datetime'
3. When I display dataframe and operation with it, the 'reg_date' works fine.
4. If I run the next code, it works just right too.

final_df.to_parquet(temp_filename, compression=None, engine='fastparquet')
read_df = pd.read_parquet(temp_filename)
display(read_df)

And when I check s3 parquet file on the AWS console web browser with s3, it shows no problem. (I click s3 -> bucket -> path -> file -> select source -> preview)

    [
    {
        "id": "1251616",
        "press": "8",
        "reg_date": "2019-05-22T14:06:25.000Z", #this line
        "scenario_id": 5072,
        "scheduletype": "1",
        "url": "some url string",
        "user_id": "some id string",
        "writer": "some writer name string",
        "deleted": "0",
        "display": "1",
        "keyword": "some keyword string",
        "modifier": "some modifier string",
        "scenario_reg_date": "2019-05-15 15:04:24",
        "sentence": "some long string..",
        "standby_transmission": "1",
        "subject": "some long string..",
        "scenario_user_id": "some user id string",
        "press_name": "some string",
        "press_url": "some url string",
        "press_host": "some url string",
        "gaid": "some string",
        "article_number": 233235,
        "article_uid": "some string",
        "ga:adsenseadsclicks": 0,
        "ga:adsenseadsviewed": 11,
        "ga:adsensectr": 0,
        "ga:adsenseecpm": 0,
        "ga:adsenserevenue": 0,
        "ga:adsenseviewableimpressionpercent": 0,
        "ga:contentgroup1": "some string",
        "ga:contentgroup2": "some string",
        "ga:date": 20190704,
        "ga:hostname": "some string",
        "ga:hour": 12,
        "ga:keyword": "some string",
        "ga:pagepath": "some string",
        "ga:pagetitle": "some string",
        "ga:pageviews": 1,
        "ga:sessions": 1,
        "ga:source": "some string",
        "host": "some string",
        "adsenseattached": 1,
        "eventtime": "2019-07-04T12:00:00.000Z"
    },
    {
        "id": "1251616",
        "press": "8",
        "reg_date": "2019-05-22T14:06:25.000Z",  #and .. this line also
        "scenario_id": 5072,
        "scheduletype": "1",
    ....
    ....

I think that reg_date column is perfectly normal.
But when I run glue crawler and preview with athena, it turns weird.

#in athena
SELECT "id","reg_date","scenario_reg_date" FROM "catalog-name"."table_name" limit 10

result is

idx|id     |reg_date                 |scenario_reg_date
1  |1251616|+51357-12-22 18:56:40.000|2019-05-15 15:04:24
2  |1361993|+51390-05-01 13:36:40.000|2019-05-15 15:04:24
3  |1461362|+51417-09-19 00:53:20.000|2019-05-15 15:04:24
4  |1461362|+51417-09-19 00:53:20.000|2019-05-15 15:04:24
5  |1461362|+51417-09-19 00:53:20.000|2019-05-15 15:04:24

'reg_date' column turns totaly weird!
The glue crawler is setted all basic, only set iam, datacatalog name, s3 source and they are basic setting category

The type of things is next:

type(result_df['reg_date'][0])
#pandas._libs.tslibs.timestamps.Timestamp

type(result_df['scenario_reg_date'][0])
#str

I tried the next things.

dataframe.to_parquet(engine='pyarrow')
it saves timestamp to bigint type. so glue crawler recognize it bigint type too. and athena query also shows it bigint type.
dataframe['reg_date'] = dataframe['reg_date'].apply(lambda x : x.to_pydatetime().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
in this case, s3 console shows it normal but glue crawler recognized it 'string' type... and athena also.

I think parquet format has some secret to save its schema.

I expect...

Glue crawler parse datetime data to normal timetick.
not +51357-12-22 18:56:40.000, just 2019-05-22 14:06:25.000 like scenario_reg_date column.

And I want to know why this problem is occurring.

I drilled into it for over 5 hours, and it ruined my whole day.

How do I solve this problem?

ya2410 ya2410 · Accepted Answer · 2019-07-08T08:34:57

Athena requires the Java TIMESTAMP format: YYYY-MM-DD HH:MM:SS.fffffffff. You need to adjust your value to match this format. Source here

Notice that Glue crawlers often fail to detect timestamp columns and classify them as strings (like your scenario_reg_date). So if you wish to use date functionality later on those columns, you'll to take the table's DDL > manually convert data types > drop & recreate the table.

AWS glue crawler parses parquet's timetick weirdly

1 Answers