Real-time data processing can be a tricky business. Your data pipelines can be subject to network issues, data arriving late, out of order or not at all, as well as wild swings in demand. You might need to collect this data from a plethora of devices and distribute that data to many services for analysis, processing or notification purposes. At peak periods, you might find some components of your pipeline bursting at the seams, unable to cope with the proverbial data firehose pointed at them. If only there was some way to decouple your data producers and consumers, some way to easily collate all those producers and route them to your consumers, some way to control that pesky firehose so that it runs at a rate more palatable to your consumers. Well, wonder no more fellow data wrangler – there is! Enter Cloud Pub/Sub.
What is Cloud Pub/Sub?
Cloud Pub/Sub is essentially a message broker for your applications and services. In Pub/Sub terminology, you can categorise these applications and services as either producers of data, known as publishers, or consumers of data, known as subscribers. In Cloud Pub/Sub, publishers send packets of data (or messages) to a topic, which stores these messages until they are consumed by one or more subscribers. Data is not consumed from the topic directly, but instead through a subscription that either pushes to, or is pulled from, by the subscriber. To ensure reliable delivery of messages, Cloud Pub/Sub requires subscribers to acknowledge that messages have been delivered. Only when all subscribers have acknowledged receipt of a message is the message removed from a Pub/Sub topic.
Cloud Pub/Sub provides at-least-once delivery. What this means is that in the event that an attempt to deliver a message goes unacknowledged, Pub/Sub will repeatedly attempt to re-deliver the message until it is acknowledged by the subscriber. A message will be considered unacknowledged if no acknowledgement is received within the acknowledgement deadline time limit. The acknowledgement deadline can be configured to suit your use case, but the value must be between 10 and 600 seconds. Be aware that Cloud Pub/Sub is not a storage service however. Messages are not kept indefinitely, even if they are never acknowledged by subscribers. Unacknowledged messages are held for a maximum of 7 days before being deleted.
When working with Pub/Sub, you should consider the implications of your subscribers receiving a message multiple times and whether the potential duplication of data is acceptable for your use case. If you need exactly once delivery, then you could consider using a tool like Cloud Dataflow as an intermediary layer between your Pub/Sub topic and your chosen destination. Cloud Dataflow is capable of processing Pub/Sub messages in real-time and will also allow you to apply transformation logic to your data, so this setup is a good choice if you need to implement a real-time ETL pipeline.
Cloud Pub/Sub is a serverless service, so you won’t need to worry about managing any infrastructure as part of the deployment process. Because of this, You can easily scale pub/sub to deal with vast amounts of data. There are no hard limits on the number of publishers and subscribers, but each topic is restricted to a maximum of 10,000 subscriptions. There is also a project wide-cap of 10,000 topics and subscriptions, as well as a 1,000 MB/s publisher throughput limit and a 2,000 MB/s subscriber throughput limit.
Unlike many other services, you don’t need to specify a location when creating a cloud pub/sub topic. Under the hood, pub/sub automatically handles the distribution of messages to data centers in a way that enables low latency message transfer, even when producers and consumers are globally distributed. As a result, messages in a topic may be stored across several regions, though any given message will only be present in one of those regions.
A useful feature of cloud pub/sub is the ability to create snapshots. Snapshots allow you to replay any messages received after the snapshot was created, as well as any messages that were unacknowledged at that point. This feature can protect you from loss of data if, for example, you deploy a subscriber that malfunctions and fails to handle messages in an expected manner. In this situation, you can make use of the seek feature to revert to a snapshot taken before the deployment of the subscriber, after applying any fixes.
When Should I Use Cloud Pub/Sub?
You can use cloud pub/sub when you need to stream data from one or more source applications to one or more destination applications. Messages sent via pub/sub must be smaller than 10mb in size, so you’ll need to consider another approach to decoupling your publishers and subscribers if this isn’t sufficient. Some example use cases for cloud pub/sub include:
Event notification – You can use pub/sub to publish alerts to subscribers and then trigger some process in response to the notification, using a service such as cloud functions. For example, you could configure billing alerts on a google cloud project that send notifications to pub/sub when your budget has been reached. You could then trigger a cloud function to switch off billing for your project to prevent costs from rising further. Note that this will also halt any active resources, so don’t do this in a production environment.
Real-time Data Processing – Pub/sub can push data to Cloud Dataflow or Cloud Functions for processing in near real-time. You could then stream that data into an analytical platform, such as BigQuery, to make it available for immediate consumption by end-users.
Logging – If you use multiple logging services, you might want to consolidate these logs from across your applications into a single, centralised platform. You can export your logs in real-time from multiple sources by publishing them to a pub/sub topic and then subscribe to that log stream from your central logging platform.
The cost of running Cloud Pub/Sub depends on the volume of data published and delivered between publishers and subscribers. The first 10GiB is free, and is then charged at $40 per TiB after that. Messages stored for seek purposes, such as snapshots are charged at $0.27 per GB. Depending on where your publishers and subscribers are geographically located, you may also be subject to data egress charges. You can read more about pub/sub pricing here.
Getting Started With Cloud Pub/Sub
Setting Up Pub/Sub
Let’s open the cloud console and create our first pub/sub topic. From the navigation menu, scroll to the Big Data section and open the pub/sub browser. From here, ensure you have the topics tab open, and click the create topic button. This will open a pop-up window asking for an identifier for your topic and your preferred encryption method. Cloud Pub/Sub offers end-to-end encryption of your messages using a google provided key by default but you can supply your own encryption key if you prefer. For this example, you can give your topic an id of your choice and leave the encryption method as default. Click the create button to create your first topic.
Having a pub/sub topic isn’t much use to us without a subscription. You should be aware that if your topic doesn’t have any subscriptions when a message is received, you will not be able to deliver the message, even if you add a subscription later on. You must create a subscription before your messages arrive if you want those messages to be available to a subscriber. To add a subscription, click the subscription tab in the pub/sub browser and then click the create subscription button. In the subscription form that opens, enter the following:
- In the Subscription ID field, enter an id of your choice.
- In the Select a Cloud Pub/Sub Topic field, select the topic you just created.
- Leave the Delivery Type as pull.
- Under Subscription Expiry, allow your subscription to expire after 31 days (in case you forget to delete it).
- Set the Acknowledgement Deadline to be 60 seconds.
- Leave the remaining settings as they are and click create.
To complete our setup, lets go ahead and create a snapshot to protect ourselves from accidental data loss. From the pub/sub browser, open the snapshots tab and click the create snapshot button. In the pop-up window that follows, choose the subscription that you just created from the drop-down field, and in the snapshot id field, name your snapshot my-first-snapshot. Any messages we publish to our topic will now be replayable by using the seek feature.
Publishing & Subscribing
We’re now ready to publish some messages. In the pub/sub browser, navigate back to your topic by opening the topics tab and clicking your topic in the list. In the topic details page that follows, click the publish message button to publish a message. In the publish message form you’ll notice you have the option to add a message body and optionally add some attributes. In Pub/Sub, Messages consist of a data payload, and metadata key value attributes that you can use to describe the message. In the message body field we’ll type the phrase ‘hello from message 1’ as our payload. In message attributes, click add an attribute, then add a key value pair with ‘language’ as the key and ‘english’ as the value. For good measure, repeat the above steps a few times, incrementing the number on the end of the hello message so we’ve got a few messages to work with.
Next, we’ll act as a subscriber by pulling our messages from the subscription we created for our pub/sub topic. Navigate to your subscription by opening the subscriptions tab and clicking your subscription. From here, click the view messages button to open the messages panel. Check the enable ack messages box so that we acknowledge receipt of our messages, and then press the pull button. In the message list below, you should see the messages you published earlier. To remove these messages from the topic, click the ack button next to each one. If you wait a while, the deadline for acknowledgement will expire and you’ll have to pull your messages again to acknowledge them.
Working with Snapshots
Once you’ve acknowledged all your messages, try pulling from the subscription again. You’ll notice that you won’t get any more messages. This is because they are removed from the topic once they have been acknowledged by the subscriber. Let’s say that something went wrong with our subscriber processing logic and we didn’t manage to process those messages correctly. Fortunately, we created a snapshot before publishing our messages, so in this instance we can replay our messages once we’ve fixed our error. Close the messages panel and click the replay messages button. In the replay messages pop-up, we need to tell pub/sub to seek the state of our topic at the point that our snapshot was created. Under the seek section, choose to a snapshot, and then choose our release-0-backup snapshot. Finally click seek, to revert the state of the topic.
Re-open the messages panel by clicking on the view messages button and hit the pull button once more. You should see that your messages are once again displayed in the list. Thanks to our snapshot, we’ve successfully managed to mitigate a potential loss of data. Finish up by deleting all the resources we’ve created. Flick through each tab in the pub/sub browser, click the checkbox next to each resource and then click the delete button at the top of the page.
We’ve now covered the fundamentals of working with Cloud Pub/Sub. Let’s recap some of the key points of this post to wrap things up:
- Cloud Pub/Sub is a messaging service for your applications and services.
- Publishers are applications/services that produce data. They publish data as messages to a pub/sub topic.
- Subscribers are applications/services that consume data. They consume data via a subscription to a pub/sub topic.
- You can publish data from many publishers and consume data from many subscribers to a single pub/sub topic.
- Subscribers must acknowledge receipt of messages within the acknowledgement deadline time limit. If messages are not acknowledged in this time, they are resent.
- Unacknowledged messages are deleted from a topic after 7 days.
- Snapshots can capture the acknowledgment state of messages at a given point in time. Any messages unacknowledged or published after a snapshot is created can be resent by seeking to the snapshot.
- Pub/Sub messages must be smaller than 10MB in size.
- Cloud Pub/Sub can be used for processing data in (near) real-time.