From AWS Sagemaker Documentation, In order to track metrics in cloudwatch for custom ml algorithms (non-builtin), I read that I have to define my estimaotr as below.
But I am not sure how to alter my training script so that the metric definitions declared inside my estimators can pick up these values.
estimator =
Estimator(image_name=ImageName,
role='SageMakerRole',
instance_count=1,
instance_type='ml.c4.xlarge',
k=10,
sagemaker_session=sagemaker_session,
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
)
In my training code, I have
for epoch in range(1, args.epochs + 1):
total_loss = 0
model.train()
for step, batch in enumerate(train_loader):
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
model.zero_grad()
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
loss = outputs[0]
total_loss += loss.item()
loss.backward() # Computes the gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip for error prevention
# modified based on their gradients, the learning rate, etc.
optimizer.step() # Back Prop
logger.info("Average training loss: %f\n", total_loss / len(train_loader))
Here, I want the train:error to pick up total_loss / len(train_loader)
but I am not sure how to assign this.