0
votes

From AWS Sagemaker Documentation, In order to track metrics in cloudwatch for custom ml algorithms (non-builtin), I read that I have to define my estimaotr as below.

But I am not sure how to alter my training script so that the metric definitions declared inside my estimators can pick up these values.

estimator =
                Estimator(image_name=ImageName,
                role='SageMakerRole', 
                instance_count=1,
                instance_type='ml.c4.xlarge',
                k=10,
                sagemaker_session=sagemaker_session,
                metric_definitions=[
                   {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
                   {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
                ]
            )

In my training code, I have

    for epoch in range(1, args.epochs + 1):
        total_loss = 0
        model.train()
        for step, batch in enumerate(train_loader):
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device)
            model.zero_grad()

            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
            loss = outputs[0]

            total_loss += loss.item()
            loss.backward() # Computes the gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip for error prevention
            # modified based on their gradients, the learning rate, etc.
            optimizer.step() # Back Prop
logger.info("Average training loss: %f\n", total_loss / len(train_loader))

Here, I want the train:error to pick up total_loss / len(train_loader) but I am not sure how to assign this.

1

1 Answers

0
votes

You have to define a regex to capture that pattern, try with this:

{'Name': 'Average training loss', 'Regex': 'Average training loss = ([0-9\.]+)'}

You can try the regex in tool like this and see what happens.