1
votes

I was testing streaming processing of google cloud pub/sub. Forward message from publisher to topic, reading the message on the pub/sub on apache-beam and checking it with beam.Map(print).

Reading messages from the pub/sub, it worked. But, an error occurred after reading the messages all.

ㅡ. This code delivers messages from publisher to topic

from google.cloud import pubsub_v1
from google.cloud import bigquery
import time

# TODO(developer)
project_id = [your-project-id]
topic_id = [your-topic-id]

# Construct a BigQuery client object.
client = bigquery.Client()

# Configure the batch to publish as soon as there is ten messages,
# one kilobyte of data, or one second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
max_messages=10,  # default 100
max_bytes=1024,  # default 1 MB
max_latency=1,  # default 10 ms'

)
publisher = pubsub_v1.PublisherClient(batch_settings)    
topic_path = publisher.topic_path(project_id, topic_id)

query = """
    SELECT *
    FROM `[bigquery-schema.bigquery-dataset.bigquery-tablename]`
    LIMIT 20
"""
query_job = client.query(query)

# Resolve the publish future in a separate thread.
def callback(topic_message):
    message_id = topic_message.result()
    print(message_id)

print("The query data:")
for row in query_job:
    data = u"category={}, language={}, count={}".format(row[0], row[1], row[2])
    print(data)
    data = data.encode("utf-8")
    time.sleep(1)
    topic_message = publisher.publish(topic_path, data=data)
    topic_message.add_done_callback(callback)

print("Published messages with batch settings.")

ㅡ. Apache-beam code [for reading and processing data from pub/sub]

# Copyright 2019 Google LLC.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# [START pubsub_to_gcs]
import argparse
import datetime
import json
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam.transforms.window as window

pipeline_options = PipelineOptions(
    streaming=True,
    save_main_session=True,
    runner='DirectRunner',
    return_immediately=True,
    initial_rpc_timeout_millis=25000,
)

class GroupWindowsIntoBatches(beam.PTransform):
    """A composite transform that groups Pub/Sub messages based on publish
    time and outputs a list of dictionaries, where each contains one message
and its publish timestamp.
"""

def __init__(self, window_size):
    # Convert minutes into seconds.
    self.window_size = int(window_size * 60)

def expand(self, pcoll):
    return (
        pcoll
        # Assigns window info to each Pub/Sub message based on its
        # publish timestamp.
        | "Window into Fixed Intervals"
        >> beam.WindowInto(window.FixedWindows(self.window_size))
        | "Add timestamps to messages" >> beam.ParDo(AddTimestamps())
        # Use a dummy key to group the elements in the same window.
        # Note that all the elements in one window must fit into memory
        # for this. If the windowed elements do not fit into memory,
        # please consider using `beam.util.BatchElements`.
        # https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements
        | "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
        | "Groupby" >> beam.GroupByKey()
        | "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
    )


class AddTimestamps(beam.DoFn):
    def process(self, element, publish_time=beam.DoFn.TimestampParam):
            """Processes each incoming windowed element by extracting the Pub/Sub
            message and its publish timestamp into a dictionary. `publish_time`
            defaults to the publish timestamp returned by the Pub/Sub server. It
            is bound to each element by Beam at runtime.
        """

        yield {
            "message_body": element.decode("utf-8"),
            "publish_time": datetime.datetime.utcfromtimestamp(
                float(publish_time)
            ).strftime("%Y-%m-%d %H:%M:%S.%f"),
        }

class WriteBatchesToGCS(beam.DoFn):
    def __init__(self, output_path):
        self.output_path = output_path
    def process(self, batch, window=beam.DoFn.WindowParam):
        """Write one batch per file to a Google Cloud Storage bucket. """

        ts_format = "%H:%M"
        window_start = window.start.to_utc_datetime().strftime(ts_format)
        window_end = window.end.to_utc_datetime().strftime(ts_format)
        filename = "-".join([self.output_path, window_start, window_end])
        with beam.io.gcp.gcsio.GcsIO().open(filename=filename, mode="w") as f:
            for element in batch:
                f.write("{}\n".format(json.dumps(element)).encode("utf-8"))

class test_func(beam.DoFn) :
    def __init__(self, delimiter=','):
        self.delimiter = delimiter
    def process(self, topic_message):
        print(topic_message)

def run(input_topic, output_path, window_size=1.0, pipeline_args=None):
    # `save_main_session` is set to true because some DoFn's rely on
    # globally imported modules.
    pipeline_options = PipelineOptions(
        pipeline_args, streaming=True, save_main_session=True
    )

    with beam.Pipeline(options=pipeline_options) as pipeline:
        (
            pipeline
            | "Read PubSub Messages"
            >> beam.io.ReadFromPubSub(topic=input_topic)
            | "Pardo" >> beam.ParDo(test_func(','))
        )

if __name__ == "__main__":  # noqa
    input_topic = 'projects/[project-id]/topics/[pub/sub-name]'
    output_path = 'gs://[bucket-name]/[file-directory]'
    run(input_topic, output_path, 2)
# [END pubsub_to_gcs]

As a temporary measure, I set return_immediately=True. but, This is not a fundamental solution either. Thank you for reading it.

1
Hello, I would like to clarify what "error occurred after reading the messages all". And can you provide the error message as well? Have you followed any documentation? Thank you!aga
@muscat Hi, An error occurs when the apache-beam reads all of the messages from the pub/sub. Here are the error-related documents. cloud.google.com/pubsub/docs/reference/error-codes thank you!Quack

1 Answers

0
votes

This seems to be a known issue of the PubSub libraries reported in other SO thread and it looks that it was recently addressed with version 1.4.2 but not yet included in the BEAM dependencies that's still using google-cloud-pubsub>=0.39.0,<1.1.0.

I made some research and found that DataflowRunner appears to handle this error better than DirectRunner, which is maintained by Apache Beam team. The issue has been already reported on beam site, and it's not resolved yet.

Also please be advised that the troubleshooting guide for DEADLINE_EXCEEDED error can be found here. You can check if any of the presented advices could help in your case, such as upgrading to the latest version of the client library.