Design a Notification System with AWS Serverless — Notes and Highlights

Building a scalable, highly available, and reliable notification system using AWS Serverless, EventBridge, and SQS.

In recent years, the notification function has emerged as a prominent feature in many applications. Personally, I’ve taken on the challenge of developing notification systems for several products. One of them is the Salon Manager application.

Building a scalable system capable of dispatching millions of notifications daily is no small feat. This is precisely why I find it valuable to document key insights and lessons from my experiences.

Achieving this task demands a profound comprehension of the notification ecosystem; without it, the requirements can easily become nebulous and open-ended.

Understand the notification system and establish the design scope

A notification alerts a user of important information like product updates, reminder events, offerings, etc. It has become an important part of the application features list.

The notification is more than just a mobile push notification. Depending on the receiver's characteristics, generally, there are five types of notification formats:

  • Mobile push notification
  • SMS message
  • Email
  • Web push notification
  • 3rd-party app notification (Slack-like apps)

In order to clarify more requirements and understand the system context, I will provide the requirements of the notification system from my experiences.

  • The system supports push Notification, SMS messages, Email, and 3rd-party app notifications.
  • It’s a near real-time system. We want users to receive notifications as soon as possible. However, if the system is under a high workload, a slight delay is acceptable.
  • The supported devices: Mobile devices (iOS and Android), and laptop/desktop.
  • Notification can be triggered by the client application event or it can also be scheduled on the server-side.
  • Users can opt out if they don’t want to receive notifications in the future.
  • Roughly, I want to send out 10M push notifications, 5M emails, and 1M SMS messages each day.

Propose high-level design

First, we need to figure out a high-level design that supports various notification types: SMS messages, Email, iOS push notifications, Android push notifications, and Slack app notifications.

Then the system should be structured with the following components:

  • Configuration of different notification types
  • Gathering contact information flow
  • Notification sending and receiving flow

High level of different notification types with AWS

We start by looking at how each notification type works at a high level.

SMS Messaging with AWS SNS

We primarily need four components to send an SMS message:

  • Producer — A producer builds and sends notification requests to the SMS service. To construct a notification request for SMS, the producer should provide the following data: User’s phone number with country code, SMS subject/content under JSON dictionary payload.
  • SMS Service — This is an AWS Lambda function to process custom business logic and trigger SMS sending.
  • AWS SNS or 3rd-party SMS service — This is an AWS service for sending SMS but in order to increase high availability and resilience, I added a 3rd-party SMS service option such as Twilio or Nexmo. By default, the SMS service will invoke AWS SNS but if something goes wrong, we can switch to other SMS services easily.
  • SMS device — It is the end client that receives SMS.

Email notification with AWS SES

The producer should provide the user’s email address and email content to the Email Service function.

Although I had set up AWS SES service for sending email by default, there is an option to switch to 3rd-party services such as Sendgrid or Mailchimp for a specific requirement.

iOS push notification with SNS + APNS

The Producer will provide user information such as device tokens and notification content to the Mobile Push Service.

The Mobile Push Service will build and send notification requests to SNS. The iOS push notification request should construct the following data:

  • Device token — This is a unique identifier used for sending push notification
  • Payload — This is a JSON dictionary format that is accepted by the APNS definition

APNS — This is a remote service provided by Apple to propagate push notifications to iOS devices.

iOS device is the end client that receives push notifications.

Android push notification with SNS + FCM

Android has a similar notification flow. Instead of using APNS, Firebase Cloud Messaging (FCM) is used to send push notifications to Android devices.

Slack app notification

The producer will provide message content and topic/channel address properly to the 3rd-party App Push Service.

The AWS SQS is a message queue for controlling rate limit because many 3rd party API has that constraint. We need to invoke the 3rd party API politely.

Gathering contact information flow

In order to send notifications, we need to gather various such as mobile device tokens, email addresses, phone numbers, and 3rd-party channel information.

This is a simplified database tables schema to store contact info. It’s a single NoSQL DynamoDB table with email, phone, device tokens, and external channels.

Contacts table schema:

Especially, the device_tokens should be stored in JSON format. Here is an example:

[
 {
   "deviceToken": "[device token uuid]",
   "platform": "apns"
 },
 {
   "deviceToken": "[device token uuid]",
   "platform": "fcm"
 }
]

And the external_channels field

[
  {
      "platform": "slack",
      "url": "[unique url to channel]",
      "status": true
  },
  {
      "platform": "another-service",
      "url": "...",
      "status": false
  }
]

A user can have multiple devices, 3rd party channels, indicating that a push notification can be sent to all the user’s devices.

Notification sending and receiving flow

I will present the initial design and then highlight some optimizations.

Notification System:

So, I think the best way to go through the diagram is from left to right:

The External Producer 1 to N — They represent different services that would like to send notifications via APIs provided by the Notification System. For example, a billing service sends an SMS to remind customers of their due payment or a delivery message of a shopping website to their customers.

The API Gateway will provide an API interface to the producer and route requests to the Notification Service (Lambdas) with params properly

The Notification Service is like a backend service. It provides the following functionalities:

  • Carry out basic validations to verify email, phone numbers, device tokens, etc.
  • Query the database to fetch the data needed to generate a notification event.
  • Push notification data to the event bus for parallel processing.

The Contacts DB — It is a DynamoDB table that stores data about users, contact info, settings, etc.

The EventBridge is an AWS service and we use it as an event bus. We also need to define Event Rules to route events to the queue properly.

Here is an example of the notification event. Each detail-type will be targeted to a notification type. Therefore, SQS queues filter events based on the attribute pattern.

{
  "id": "<required::uuid>",
  "source": "payment_request_event",
  "detail-type": ["payment_notification_sms"],
  "resources": ["payments"],
  "detail": {...}
  "time": "<required>",
  "region": "<required>",
  "account": "<required>"
}

The Message Queues — We use them to remove dependencies between components. SQS queues serve as buffers when high volumes of notifications are to be sent out. Each notification event type is assigned to a distinct message queue so that an outage in one sending service will not affect other notification types.

The Workers — a list of Lambda services that poll notification events from SQS queues and send them to the corresponding service.

The SNS or 3rd-party services — These services are responsible for delivering notifications to consumers. While integrating with 3rd-party services, we need to pay attention to extensibility and high availability. A good example of extensibility is a flexible system that can easily switch on/off a third-party service. Another important consideration is the high availability that a third-party service might be unavailable somehow then we should be able to switch to another service and mitigate the impact on business as much as possible.

Design deep dive and optimizations

In the high-level design, we discussed 3 main parts of the notification system: Different types of notifications, gathering contact info flow, and notification sending/receiving flow. There are some interesting topics, I would like to highlight following:

  • Security of events and in push notification
  • Notification template and settings
  • Reliability and resilience
  • Retry mechanism
  • Rate limiting
  • Monitor queued notification and event tracking

Security of events and in push notification

  • In case of storing sensitive data, we should enable data protection of DynamoDB as encryption at rest and integrate with AWS Key Management Service (AWS KMS) for managing the encryption keys that are used to encrypt tables. And use IAM roles to authenticate access to DynamoDB.
  • Implementing the least privilege principle on accessing resources
  • Enable data protection of EventBridge as encryption in transit by using SSL/TLS to communicate with AWS resources. Recommend TLS 1.3.
  • For iOS, and Android apps, appKey and appSecret are used to secure push notification APIs. Only authenticated or verified clients are allowed to send push notifications using APIs. These credentials should be stored and encrypted by using Secret Manager or Parameter Store.

Notification template and settings

  • We should create a notification template for the same notification type which follows a similar format. It can be reused and avoid building every notification content from scratch.
  • A notification template is a preformatted notification content to create unique notifications by customizing parameters, tracking links, etc. We can store these notification templates in S3 buckets with defined prefixes.
  • In order to provide users with fine-grained control over notification settings, we can store them in a separate notification setting table. Before any notification is sent to a user, we first check if a user is happy to receive this type of notification.

Reliability and resilience

  • Prevent data loss — One of the most important non-functional requirements in a notification system is that it cannot lose data. Notifications can be delayed or re-ordered, but should never lost. In order to satisfy this requirement, the notification system persists notification data in another log table and implements a retry mechanism.
  • Receive a notification exactly once? — No, it can’t. From the SLA of 3rd-party service providers, although notification is delivered exactly once most of the time, the distributed nature could result in duplicate notifications. We can just reduce the duplication occurrence, then introduce a deduplication mechanism and handle failures carefully.
  • Here is a simplified logic: When a notification event first comes, we check if it has been delivered before by checking theeventId . If it is delivered successfully before, it is discarded. Otherwise, we will send out the notification.
  • Resilience infrastructure — We should consider deploying on multiple Availability Zones, you can design and operate applications and databases that automatically failover between zones without interruption. Availability Zones are more highly available, fault-tolerant, and scalable than traditional single or multiple data center infrastructures.

Retry mechanism

  • When a SNS/3rd-party service fails to send a notification, the notification will be added to the dead-letter queue for retrying. If the issue persists, an alert will be sent out to the developers in charge.

Rate limiting

  • We should consider sending notifications politely. To avoid overwhelming users with too many notifications. By using SQS and limiting the number of notifications a user can receive in a period of time, we can increase the politeness of our notification system.

Monitor queued notification and event tracking

  • We should use AWS CloudWatch metrics to monitor the notification system. Key metrics to monitor are the total number of events in EventBirdge and the total number of queued notifications. If these two metrics are large, then the notification events are not processed fast enough by workers. It means we should scale out, and more workers are needed.
  • Event tracking — There are some custom metrics such as open rate, click rate, and engagement that are important to understanding customer behaviors. We should assign events with statuses: created → pending → sent → opened → clicked or error, unsubscribed. Integrate event statuses into the notification system, we can trace notification events.

Updated high-level architecture

Optimized notification system with AWS:

Final notes

While this article may be on the longer side, it underscores the indispensability of notifications in keeping us informed of crucial information.

In this piece, I aim to elucidate the blueprint for a scalable, highly available, and reliable notification system that accommodates various notification types, including mobile push notifications, SMS messaging, email, and 3rd-party app notifications.

To achieve this, I opted for an event-driven architecture, leveraging EventBridge and SQS queues to decouple system components.

In our design, we made extensive use of AWS services in a Serverless framework, a choice that not only ensures efficiency but also minimizes pricing and operational costs.

The design adheres to the principles of the Twelve-Factor App, treating backing services as attached resources, storing configurations in the environment, and treating logs as event streams, among other considerations.

I trust that this information will prove valuable in your own design endeavors. Thank you for taking the time to read this far!

Enjoyed this article?

Share it with your network to help others discover it

Continue Learning

Discover more articles on similar topics