How to fix SageMaker data-quality monitoring-schedule job that fails with 'FailureReason': 'Job inputs had no data'

Question

I am trying to schedule a data-quality monitoring job in AWS SageMaker by following steps mentioned in this AWS documentation page. I have enabled data-capture for my endpoint. Then, trained a baseline on my training csv file and statistics and constraints are available in S3 like this:

from sagemaker import get_execution_role
from sagemaker import image_uris
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_data_monitor = DefaultModelMonitor(
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    volume_size_in_gb=30,
    max_runtime_in_seconds=3_600)

# base s3 directory
baseline_dir_uri = 's3://api-trial/data_quality_no_headers/'
# train data, that I have used to generate baseline
baseline_data_uri = baseline_dir_uri + 'ch_train_no_target.csv'
# directory in s3 bucket that I have stored my baseline results to 
baseline_results_uri = baseline_dir_uri + 'baseline_results_try17/'


my_data_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri,
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_results_uri,
    wait=True, logs=False, job_name='ch-dq-baseline-try21'
)

and data is available in S3:

Then I tried scheduling a monitoring job by following this example notebook for model-quality-monitoring in sagemaker-examples github repo, to schedule my data-quality-monitoring job by making necessary modifications with feedback from error messages.

Here's how tried to schedule the data-quality monitoring job from SageMaker Studio:

from sagemaker import get_execution_role
from sagemaker.model_monitor import EndpointInput
from sagemaker import image_uris
from sagemaker.model_monitor import CronExpressionGenerator
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

# base s3 directory
baseline_dir_uri = 's3://api-trial/data_quality_no_headers/'

# train data, that I have used to generate baseline
baseline_data_uri = baseline_dir_uri + 'ch_train_no_target.csv'

# directory in s3 bucket that I have stored my baseline results to 
baseline_results_uri = baseline_dir_uri + 'baseline_results_try17/'
# s3 locations of baseline job outputs
baseline_statistics = baseline_results_uri + 'statistics.json'
baseline_constraints = baseline_results_uri + 'constraints.json'

# directory in s3 bucket that I would like to store results of monitoring schedules in
monitoring_outputs = baseline_dir_uri + 'monitoring_results_try17/'

ch_dq_ep = EndpointInput(endpoint_name=myendpoint_name,
                         destination="/opt/ml/processing/input_data",
                         s3_input_mode="File",
                         s3_data_distribution_type="FullyReplicated")

monitor_schedule_name='ch-dq-monitor-schdl-try21'

my_data_monitor.create_monitoring_schedule(endpoint_input=ch_dq_ep,
                                           monitor_schedule_name=monitor_schedule_name,
                                           output_s3_uri=baseline_dir_uri,
                                           constraints=baseline_constraints,
                                           statistics=baseline_statistics,
                                           schedule_cron_expression=CronExpressionGenerator.hourly(),
                                           enable_cloudwatch_metrics=True)

after an hour or so, when I check the status of the schedule like this:

import boto3
boto3_sm_client = boto3.client('sagemaker')
boto3_sm_client.describe_monitoring_schedule(MonitoringScheduleName='ch-dq-monitor-schdl-try17')

I get failed status like below:

'MonitoringExecutionStatus': 'Failed',
  ...
  'FailureReason': 'Job inputs had no data'},

Entire Message:

```
{'MonitoringScheduleArn': 'arn:aws:sagemaker:ap-south-1:<my-account-id>:monitoring-schedule/ch-dq-monitor-schdl-try21',
 'MonitoringScheduleName': 'ch-dq-monitor-schdl-try21',
 'MonitoringScheduleStatus': 'Scheduled',
 'MonitoringType': 'DataQuality',
 'CreationTime': datetime.datetime(2021, 9, 14, 13, 7, 31, 899000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2021, 9, 14, 14, 1, 13, 247000, tzinfo=tzlocal()),
 'MonitoringScheduleConfig': {'ScheduleConfig': {'ScheduleExpression': 'cron(0 * ? * * *)'},
  'MonitoringJobDefinitionName': 'data-quality-job-definition-2021-09-14-13-07-31-483',
  'MonitoringType': 'DataQuality'},
 'EndpointName': 'ch-dq-nh-try21',
 'LastMonitoringExecutionSummary': {'MonitoringScheduleName': 'ch-dq-monitor-schdl-try21',
  'ScheduledTime': datetime.datetime(2021, 9, 14, 14, 0, tzinfo=tzlocal()),
  'CreationTime': datetime.datetime(2021, 9, 14, 14, 1, 9, 405000, tzinfo=tzlocal()),
  'LastModifiedTime': datetime.datetime(2021, 9, 14, 14, 1, 13, 236000, tzinfo=tzlocal()),
  'MonitoringExecutionStatus': 'Failed',
  'EndpointName': 'ch-dq-nh-try21',
  'FailureReason': 'Job inputs had no data'},
 'ResponseMetadata': {'RequestId': 'dd729244-fde9-44b5-9904-066eea3a49bb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dd729244-fde9-44b5-9904-066eea3a49bb',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '835',
   'date': 'Tue, 14 Sep 2021 14:27:53 GMT'},
  'RetryAttempts': 0}}
```

Possible things you might think to have gone wrong at my side or might help me fix my issue:

dataset used for baseline: I have tried to create a baseline with the dataset with and without my target-variable(or dependent variable or y) and the error persisted both times. So, I think the error has originated because of a different reason.
there are no log groups created for these jobs for me to look at and try debug the issue. baseline jobs have log-groups, so i presume there is no problem with roles being used for monitoring-schedule-jobs not having permissions to create a log group or stream.
role: the role I have attached is defined by get_execution_role(), which points to a role with full access to sagemaker, cloudwatch, S3 and some other services.
the data collected from my endpoint during my inference: here's how a line of data of .jsonl file saved to S3, which contains data collected during inference, looks like:

{"captureData":{"endpointInput":{"observedContentType":"application/json","mode":"INPUT","data":"{\"longitude\": [-122.32, -117.58], \"latitude\": [37.55, 33.6], \"housing_median_age\": [50.0, 5.0], \"total_rooms\": [2501.0, 5348.0], \"total_bedrooms\": [433.0, 659.0], \"population\": [1050.0, 1862.0], \"households\": [410.0, 555.0], \"median_income\": [4.6406, 11.0567]}","encoding":"JSON"},"endpointOutput":{"observedContentType":"text/html; charset=utf-8","mode":"OUTPUT","data":"eyJtZWRpYW5faG91c2VfdmFsdWUiOiBbNDUyOTU3LjY5LCA0NjcyMTQuNF19","encoding":"BASE64"}},"eventMetadata":{"eventId":"9804d438-eb4c-4cb4-8f1b-d0c832b641aa","inferenceId":"ef07163d-ea2d-4730-92f3-d755bc04ae0d","inferenceTime":"2021-09-14T13:59:03Z"},"eventVersion":"0"}

I would like to know what has gone wrong in this entire process, that led to data not being fed to my monitoring job.

From captured data, your encoding for input is JSON and for output BASE64. As having different encoding for input and output won't support, how did you resolved that issue? — Python coder
i think that is a different issue altogehter. this question was about data not being found for merge operation. and the monitoring happens after that, for which scripts needs to be written and i failed doing so. more: github.com/aws/sagemaker-python-sdk/issues/2665 — Naveen Reddy Marthala
Are you able to successfully schedule a DataQuality Monitoring job? If yes, can you share your working code snippet? — Python coder
@Pythoncoder, no, i haven't been able to, which is why the github issue linked above is still open. and if you manage to schedule a data quality monitoring job successfully, i request you to post your working script on that github issue. — Naveen Reddy Marthala

Naveen Reddy Marthala Naveen Reddy Marthala · Accepted Answer · 2021-10-01T04:56:18

This happens, during the ground-truth-merge job, when the spark can't find any data either in '/opt/ml/processing/groundtruth/' or '/opt/ml/processing/input_data/' directories. And that can happen when either you haven't sent any requests to the sagemaker endpoint or there are no ground truths.

I got this error because, the folder /opt/ml/processing/input_data/ of the docker volume mapped to the monitoring container had no data to process. And that happened because, the thing that facilitates entire process, including fetching data couldn't find any in S3. and that happened because, there was an extra slash(/) in the directory to which endpoint's captured-data will be saved. to elaborate, while creating the endpoint, I had mentioned the directory as s3://<bucket-name>/<folder-1>/, while it should have just been s3://<bucket-name>/<folder-1>. so, while the thing that copies data from S3 to docker volume tried to fetch data of that hour, the directory it tried to extract the data from was s3://<bucket-name>/<folder-1>//<endpoint-name>/<variant-name>/<year>/<month>/<date>/<hour>(notice the two slashes). So, when I created the endpoint-configuration again with the slash removed in S3 directory, this error wasn't present and ground-truth-merge operation was successful as part of model-quality-monitoring.

I am answering this question because, someone read the question and upvoted it. meaning, someone else has faced this problem too. so, I have mentioned what worked for me. And I wrote this, so that StackExchange doesn't think I am spamming the forum with questions.

How to fix SageMaker data-quality monitoring-schedule job that fails with 'FailureReason': 'Job inputs had no data'

1 Answers