2
votes

I defined several tables in AWS glue.

Over the past few weeks, I've had different issues with the table definition which I had to fix manually - I want to change column names, or types, or change the serialization lib. However, If i already have partitions created, the repairing of table doesn't change them, and so I have to delete all partitions manually and then repairing.

Is there a simple way to do this? Delete all partitions from an AWS Glue table? I'm using aws batch-delete-partition CLI command, but it's syntax is tricky, and there are some limitations on the amount of partitions you can delete in one go, the whole thing is cumbersome...

3
using boto3 with python might be the more elegant solution you are looking for, any reason for sticking with aws cli?Golammott
@Golammott No reason really... I really expected this to be an easy thing to do, by the AWS CLI or otherwise, without having to write code. I'd rather not write python scripts if I can avoid it. If you can post your python code it'll helpYossi Vainshtein

3 Answers

5
votes

For now, I found this command line solution, runinng aws glue batch-delete-partition iteratively for batches of 25 partitions using xargs

(here I am assuming there are max 1000 partitions):

aws glue get-partitions --database-name=<my-database> --table-name=<my-table> | jq -cr '[ { Values: .Partitions[].Values } ]' > partitions.json

seq 0 25 1000 | xargs -I _ bash -c "cat partitions.json | jq -c '.[_:_+25]'" | while read X; do aws glue batch-delete-partition --database-name=<my-database> --table-name=<my-table > --partitions-to-delete=$X; done

Hope it helps someone, but I'd prefer a more elegant solution

4
votes

Using python3 with boto3 looks a little bit nicer. Albeit not by much :)

Unfortunately AWS doesn't provide a way to delete all partitions without batching 25 requests at a time.

import boto3

glue_client = boto3.client("glue", "us-west-2")

def get_and_delete_partitions(database, table, batch=25):
    partitions = glue_client.get_partitions(
        DatabaseName=database,
        TableName=table)["Partitions"]

    for i in range(0, len(partitions), batch):
        to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
        glue_client.batch_delete_partition(
            DatabaseName=database,
            TableName=table,
            PartitionsToDelete=to_delete)
0
votes

Here is a PowerShell version FWIW:

$database = 'your db name'
$table = 'your table name'
# Set the variables above

$batch_size = 25
Set-DefaultAWSRegion -Region eu-west-2
$partition_list = Get-GLUEPartitionList -DatabaseName $database -TableName $table

$selected_partitions = $partition_list
# Uncomment and edit predicate to select only certain partitions
# $selected_partitions = $partition_list | Where-Object {$_.Values[0] -gt '2020-07-20'} 

$selected_values = $selected_partitions | Select-Object -Property Values

for ($i = 0; $i -lt $selected_values.Count; $i += $batch_size) {
    $chunk = $selected_values[$i..($i + $batch_size - 1)]
    Remove-GLUEPartitionBatch -DatabaseName $database -TableName $table -PartitionsToDelete $chunk -Force
}

# Now run `MSCK REPAIR TABLE db_name.table_name` to add the partitions again