AWS Glue data catalog supposed to define meta information about the actual data, e.g. table schema, location of partitions etc. Notion of partitions is a way of restrict Athena to scan only certain destinations in your S3 bucket for speed and cost efficiency. When you query data located in S3 bucket using Athena, it uses table definitions specified in Glue data catalog. This also means, that when you execute DDL statements in Athena, the corresponding table is created in Glue datacatalog. So I am not sure what you mean by "Glue finds 0 rows"
If you created your table using Athena like this:
CREATE EXTERNAL TABLE `mytable`(
`labels` array<string>,
`userid` string)
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='labels,userid,')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/data/'
Note that LOCATION
points to the place where your partitioning starts. Then adding a single partition should be like:
ALTER TABLE mytable
ADD PARTITION (year=2019, month=06, day=27, hour=00)
LOCATION 's3://my-bucket/data/2019/06/27/00/';
ALTER TABLE mytable
ADD PARTITION (year=2019, month=06, day=28, hour=00)
LOCATION 's3://my-bucket/data/2019/06/28/00/';
After this two DDL query statements you should be able to see mytable
in Glue datacatalog with two partitions under View partitions
tab. Now, if you run query without WHERE
statement:
SELECT
"day", COUNT(*)
FROM
mytable
GROUP BY "day";
Then all your data specified by partitions will be scanned and you should get
| day | count |
|-----|----------------|
| 27 | some number |
| 28 | another number |
Now, if you want to count records within a specific day, you would need to include WHERE
statement
SELECT
"day", COUNT(*)
FROM
mytable
WHERE(
"day" = '27'
)
GROUP BY "day";
Then you data only under s3://my-bucket/data/2019/06/27/
will be scanned and you should get something like:
| day | count |
|-----|----------------|
| 27 | some number |
Addition notes
- According to AWS, a table in a Glue catalog could have up to 10 million partitions, so 8700 partitions per year would hardly be an issue.
- AWS does not charge you for DDL statements executed by Athena.
- If your paths in S3 adhere HIVE convention, i.e.
s3://my-bucket/data/year=2019/month=06/day=27/hour=00/
then after you defined table you can simply run MSCK REPAIR TABLE mytable
and all partitions will be added to a table in Glue datacatalog.
For a big number of partitions it is not feasible to run ALTER TABLE mytable ADD PARTITION ...
. Instead, you could use:
- Glue Crawler. From my experience, this is only useful when you don't know much about your data and you huge amounts of it. Here is AWS pricing.
- AWS SDK such as boto3 for python. It provides API for both Athena and Glue clients.
For Athena client you could generate ALTER TABLE mytable ADD PARTITION ...
statements as a string and then send it for execution. Here is a post on Medium that can help you to get started.
You can also use Glue client to do the same thing with batch_create_partition
or create_partition
methods, but this would require different inputs then Athena client
Update 2019-07-03
If your data has structure like
s3://my-bucket/data/2019/06/27/00/00001.json
s3://my-bucket/data/2019/06/27/00/00002.json
s3://my-bucket/data/2019/06/27/01/00001.json
s3://my-bucket/data/2019/06/27/01/00002.json
...
s3://my-bucket/data/2019/06/28/00/00001.json
s3://my-bucket/data/2019/06/28/00/00002.json
s3://my-bucket/data/2019/06/28/01/00001.json
s3://my-bucket/data/2019/06/28/01/00002.json
but you only want to have only 3 partitions, i.e. year, month, day, then definition of your table should take that into account:
CREATE EXTERNAL TABLE `mytable`(
`labels` array<string>,
`userid` string)
PARTITIONED BY ( -- Here we specify only three columns
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='labels,userid,')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/data/'
then DDL statements for adding partitions should be:
ALTER TABLE mytable
ADD PARTITION (year=2019, month=06, day=27)
LOCATION 's3://my-bucket/data/2019/06/27/'; -- Stop at day level
ALTER TABLE mytable
ADD PARTITION (year=2019, month=06, day=28)
LOCATION 's3://my-bucket/data/2019/06/28/'; -- Stop at day level
Remember that in S3 there is not such thing as folders or directories. This is how I see partitions and locations in a context of Athena, Glue and S3. Partition is an abstraction for a group of S3 objects, where grouping is defined by filtering all objects with respect to a certain "prefix" <=> Location. Thus, when you specify LOCATION
, stop at "day level". Although, you can stop at "hour level", e.g. s3://my-bucket/data/2019/06/28/01/
, but then you would need to create partitions for all other hours if you want Athena be able to scan them. On top of that, combination of partition values should be unique (which is equivalent to defining 4 partitions), otherwise AWS wouldn't allow to create it.
Just tested in my AWS account with data that resembles your S3 paths and were able to see partitions in Glue data catalog that point to correct destination.