2
votes

I have a setup where I am publishing messages to Google Cloud PubSub service.

I wish to get the size of each individual message that I am publishing to PubSub. So for this, I identified the following approaches (Note: I am using the Python clients for publishing and subscribing, following a line-by-line implementation as presented in their documentation):

  • View the message count from the Google Cloud Console using the 'Monitoring' feature
  • Create a pull subscription client and view the size using message.size in the callback function for the messages that are being pulled from the requested topic.
  • Estimate the size of the messages before publishing by converting them to JSON as per the PubSub message schema and using sys.getsizeof()

For a sample message like as follows which I published using a Python publisher client:

{
  "data": 'Test_message',
  "attributes": {
    'dummyField1': 'dummyFieldValue1',
    'dummyField2': 'dummyFieldValue2'
  }
}

, I get the size as 101 as the message.size output from the following callback function in the subcription client:

def callback(message):
    print(f"Received {message.data}.")
    if message.attributes:
        print("Attributes:")
        for key in message.attributes:
            value = message.attributes.get(key)
            print(f"{key}: {value}")
    print(message.size)
    message.ack()

Whereas the size displayed on Cloud Console Monitoring is something around 79 B. enter image description here

So these are my questions:

  • Why are the sizes different for the same message?
  • Is the output of message.size in bytes?
  • How do I view the size of a message before publishing using the python client?
  • How do I view the size of a single message on the Cloud Console, rather than a aggregated measure of size during a given timeframe which I could find in the Monitoring section?
2
According to the documentation, the message.size is an attribute that Return the size of the underlying message, in bytes. Regarding your question about the value of message_sizes this metric means the Distribution of publish message sizes (in bytes). It is Sampled every 60 seconds. After sampling, data is not visible for up to 240 seconds, link. Could you tell me the reason you want to check the message size before publishinhg? - Alexandre Moraes
Also, would message.size and 'message_sizes` (as mentioned above) satisfy your needs? - Alexandre Moraes
@AlexandreMoraes I wish to know the size of messages that are being published to have an estimate of the dataflow if messages are being published at a specified rate for a specified number of days. This is in turn to estimate how much it would cost, and whether it would stay within the free tier. - Ishwar Venugopal
According to the Python Library documentation you only have the message.size as a message attribute for the subscriber. Otherwise, you will have to use Cloud Monitoring and alerts, which is very useful if you want to monitor your quota expenditure. Did all this information help you? - Alexandre Moraes
Yes, please. That would be fine. - Ishwar Venugopal

2 Answers

1
votes

In order to further contribute to the community, I am summarising our discussion as an answer.

  1. Regarding message.size, it is an attribute from a message in the subscriber client. In addition, according to the documentation, its definition is:

Returns the size of the underlying message, in bytes

Thus you would not be able to use it before publishing.

  1. On the opposite side, message_size is a metric in Google Cloud Metrics and it is used by Cloud Monitoring, here.

Finally, the last topic discussed was that your aim is to monitor your quota expenditure, so you can stay in the free tier. For this reason, the best option would be using Cloud Monitoring and setup alerts based on the metrics such as pubsub.googleapis.com/topic/byte_cost. Here are some links, where you can find more about it: Quota utilisation, Alert event based, Alert Policies.

1
votes

Regarding your third question about viewing the message size before publishing, the billable message size is the sum of the message data, the attributes (key plus value), 20 bytes for the timestamp, and some bytes for the message_id. See the Cloud Pub/Sub Pricing guide. Note that the minimum of 1000 bytes is billable regardless of message size so if your messages may be smaller than 1000 bytes it’s important to have good batch settings. The message_id is assigned server-side and is not guaranteed to be a certain size but it is returned by the publish call as a future so you can see examples. This should allow you to get a pretty accurate estimate of message cost within the publisher client. Note that you can also use the monitoring client library to read Cloud Monitoring metrics from within the Python client.

Regarding your fourth question, there’s no way to extract single data points from a distribution metric (Unless you have only published one message during the time period in the query in which case the mean would tell you the size of that one message).