2
votes

base
I download the log with python every day (with cronjob, using ipython).
The download server is CENTOS7, dockerized ubuntu

(base) root@:/# uname -a
Linux f5210d345285 3.10.0-957.21.2.el7.x86_64 #1 SMP Wed Jun 5 14:26:44 UTC 2019 x86_64 GNU/Linux

server DateTime is set to utc

cron Process
1. Download from athena
2. Make pandas dataframe
3. Get aggregation data from mysql server (onpremise)
4. Join & aggregation
5. Make final dataframe
6. Upload to s3

final_df.to_parquet(temp_filename, compression=None, engine='fastparquet')
FH = open(temp_filename, 'rb')
data_bytes = FH.read()
FH.close()
os.remove(filename)
boto3.session.resource('s3').Object(bucketname, s3pathname).put(Body=data_bytes)
  1. Work with glue-crawler
  2. Get data from athena and use it (like tableau)

problem
1. The problem is datetime column.
2. The datetime column name is 'reg_date', and it comes from mariadb column that type is 'datetime'
3. When I display dataframe and operation with it, the 'reg_date' works fine.
4. If I run the next code, it works just right too.

final_df.to_parquet(temp_filename, compression=None, engine='fastparquet')
read_df = pd.read_parquet(temp_filename)
display(read_df)
  1. And when I check s3 parquet file on the AWS console web browser with s3, it shows no problem. (I click s3 -> bucket -> path -> file -> select source -> preview)
    [
    {
        "id": "1251616",
        "press": "8",
        "reg_date": "2019-05-22T14:06:25.000Z", #this line
        "scenario_id": 5072,
        "scheduletype": "1",
        "url": "some url string",
        "user_id": "some id string",
        "writer": "some writer name string",
        "deleted": "0",
        "display": "1",
        "keyword": "some keyword string",
        "modifier": "some modifier string",
        "scenario_reg_date": "2019-05-15 15:04:24",
        "sentence": "some long string..",
        "standby_transmission": "1",
        "subject": "some long string..",
        "scenario_user_id": "some user id string",
        "press_name": "some string",
        "press_url": "some url string",
        "press_host": "some url string",
        "gaid": "some string",
        "article_number": 233235,
        "article_uid": "some string",
        "ga:adsenseadsclicks": 0,
        "ga:adsenseadsviewed": 11,
        "ga:adsensectr": 0,
        "ga:adsenseecpm": 0,
        "ga:adsenserevenue": 0,
        "ga:adsenseviewableimpressionpercent": 0,
        "ga:contentgroup1": "some string",
        "ga:contentgroup2": "some string",
        "ga:date": 20190704,
        "ga:hostname": "some string",
        "ga:hour": 12,
        "ga:keyword": "some string",
        "ga:pagepath": "some string",
        "ga:pagetitle": "some string",
        "ga:pageviews": 1,
        "ga:sessions": 1,
        "ga:source": "some string",
        "host": "some string",
        "adsenseattached": 1,
        "eventtime": "2019-07-04T12:00:00.000Z"
    },
    {
        "id": "1251616",
        "press": "8",
        "reg_date": "2019-05-22T14:06:25.000Z",  #and .. this line also
        "scenario_id": 5072,
        "scheduletype": "1",
    ....
    ....
  1. I think that reg_date column is perfectly normal.
  2. But when I run glue crawler and preview with athena, it turns weird.
#in athena
SELECT "id","reg_date","scenario_reg_date" FROM "catalog-name"."table_name" limit 10

result is

idx|id     |reg_date                 |scenario_reg_date
1  |1251616|+51357-12-22 18:56:40.000|2019-05-15 15:04:24
2  |1361993|+51390-05-01 13:36:40.000|2019-05-15 15:04:24
3  |1461362|+51417-09-19 00:53:20.000|2019-05-15 15:04:24
4  |1461362|+51417-09-19 00:53:20.000|2019-05-15 15:04:24
5  |1461362|+51417-09-19 00:53:20.000|2019-05-15 15:04:24
  1. 'reg_date' column turns totaly weird!
  2. The glue crawler is setted all basic, only set iam, datacatalog name, s3 source and they are basic setting category

The type of things is next:

type(result_df['reg_date'][0])
#pandas._libs.tslibs.timestamps.Timestamp

type(result_df['scenario_reg_date'][0])
#str

I tried the next things.

  1. dataframe.to_parquet(engine='pyarrow')
    it saves timestamp to bigint type. so glue crawler recognize it bigint type too. and athena query also shows it bigint type.

  2. dataframe['reg_date'] = dataframe['reg_date'].apply(lambda x : x.to_pydatetime().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
    in this case, s3 console shows it normal but glue crawler recognized it 'string' type... and athena also.

I think parquet format has some secret to save its schema.

I expect...

Glue crawler parse datetime data to normal timetick.
not +51357-12-22 18:56:40.000, just 2019-05-22 14:06:25.000 like scenario_reg_date column.

And I want to know why this problem is occurring.

I drilled into it for over 5 hours, and it ruined my whole day.

How do I solve this problem?

1

1 Answers

1
votes

Athena requires the Java TIMESTAMP format: YYYY-MM-DD HH:MM:SS.fffffffff. You need to adjust your value to match this format. Source here

Notice that Glue crawlers often fail to detect timestamp columns and classify them as strings (like your scenario_reg_date). So if you wish to use date functionality later on those columns, you'll to take the table's DDL > manually convert data types > drop & recreate the table.