1
votes

I'm looking to create a publisher that streams and sends tweets containing a certain hashtag to a pub/sub topic.

The tweets will then be ingested with cloud dataflow and then loaded into a Big Query database.

In the following article they do something similar where the publisher is hosted on a docker image on a Google Compute Engine instance.

Can anyone recommend alternative Google Cloud resources that could host the publisher code more simply, that avoids the need to create a docker file etc? The publisher would need to run constantly. Would cloud run for e.g. be a suitable alternative?

1
Can you be more precise about your last statemet The publisher would need to be active constantly? One of the big advantage of using a serverless component is that it is not necessarily required to run constantly, but instead, it can spawn and scale as it is needed. Is there another actor that would trigger this component, or should this component scratch some data for example and decide when it is time to publish a message to a topic?Neo Anderson
thanks Neo, i meant it would need to run constantly (ie be listening 24/7). My idea was that as soon as relevant tweets are discovered they should be passed to the pub sub topic.pablowilks2
If your module must be listening then it could be an event driven module(cloud functions would be the easiest to implement something). If it must discover something - that is a different story and there are some other options. I hope now I asked my question in a better manner.Neo Anderson
Note that using Cloud Run also requires packaging and deploying a container, so you'll need to create a Docker file to use it. As Neo said, publishing using Cloud Functions would be a good choice if you have some event-driven way to trigger your Function to run. Otherwise, you could perhaps trigger your Function to run periodically using Cloud Scheduler: cloud.google.com/scheduler/docs/creatingLauren

1 Answers

1
votes

There are some workarounds I can think of:

  1. A quick way to avoid containers architecture is having the on_data method inside a loop, for example, by using something like while(true) or start a Stream like explained in Create your Python script and run the code in a Compute Engine in the background with nohup python -u myscript.py. Or follow the steps described in Script on GCE to capture tweets that uses tweepy.Stream to start the streaming.

  2. You might want to reconsider the Dockerfile option since its configuration could be not so difficult, see Tweets & pipelines where there is a script that read the data and publish to PubSub, you will see that 9 lines are used for the Docker file and it is deployed in App Engine using Cloud Build. Another implementation with a Docker file that requires more steps is twitter-for-bigquery, in case it helps, you will see that there are more specific steps and more configurations.

  3. Cloud Functions is also another option, in this guide Serverless Twitter with Google Cloud you can check the Design section to know if it fits your use case.

  4. Airflow with Twitter Scraper could work for you since Cloud Composer is a managed service for Airflow and you can create an Airflow environment quickly. It uses the Twint library, check the Technical section in the link for more details.

  5. Stream Twitter Data into BigQuery with Cloud Dataprep is a workaround that put aside complex configurations. In this case the job won't run constantly but can be scheduled to run within minutes.