AWS Architecting: The New Way

Not the usual basic stuff…repeated again and again…but:

a simple definition and then
the strange things and
a few interesting examples.

Just for fun. With an accent on hybrid environments. So, advanced topics made easy….with useful ways for avoiding Traps..

What’s the trick: if you’re a beginner you have to go and look up the items you don’t know…. but you can learn far much faster..

VMs: EC2

Amazon EC2 (Elastic Compute Cloud) is the backbone of AWS. It provides on-demand, resizable virtual servers in the cloud. Instead of buying a physical box (server), you “rent” a slice of a high-performance server (an Instance) and pay for it by the second or hour.

What’s new: Nitro Systems

Nitro Systems (C5, M5, or R5 families): The new EC2 architecture. dedicated hardware chips that handle networking, storage, and security. This frees up nearly 100% of the server’s CPU and memory for your applications. In a traditional cloud, the main CPU spends about 30% of its power just managing the “housekeeping” of virtualization. Nitro fixes this by offloading these tasks to specialized Nitro Cards:

Networking (VPC): handles all packet routing and encryption, meaning zero lag from the host CPU.
Storage (EBS): High-speed encryption and data transfer for your disks handled by hardware.
Lightweight Hypervisor: the remaining software hypervisor is tiny.
Bare Metal Support: Nitro allows you to rent the entire physical server while still using AWS features like VPC and EBS.
Nitro Security Chip: monitors every system boot to ensure no unauthorized code (like a rootkit) has been installed at the hardware level. It checks digital signs.
Confidential Computing (Nitro Enclaves): You can create a “black box” inside your instance

What’s new: IMDSv2

VMs need to ask info about themselves; it is done with Instance Metadata Service (IMDS). http://169.254.169.254

The original version (v1) operates on a simple Request/Response pattern. Any process on the instance — or any process that can trick the instance into making an outbound call — can access the metadata. This opened the door to Server-Side Request Forgery (SSRF) attacks, for scraping IAM security credentials in plain text.

IMDSv2 introduces a session-oriented flow. It requires two steps to get data:

First, the client must initiate a session by sending an HTTP PUT request to create a token. This request must include a header specifying the “Time to Live” for that token.
Second, every subsequent GET request for metadata must include that specific token in the header.

It utilizes a specific network trick: the PUT request for the token is sent with an IP Header Hop Limit (TTL) of 1.

What’s important: Placement Groups

When you deploy instances, AWS spreads them across the underlying hardware to maximize availability.

This can be a performance killer for nodes that need to talk to each other constantly. So, a Cluster Placement Group packs your instances as close together as possible within a single Availability Zone. Technically, this often means they are on the same physical rack or even the same network spine. However, the risk here is a “single point of rack failure.

Instead the Partition Placement Group ensures that your instances are spread across different logical partitions (and therefore different physical racks) within a single AZ. It is perfectfor Kafka or Cassandra. AWS guarantees that no two partitions share the same hardware. If Rack A fails, your database remains operational because the other two nodes are physically isolated. In Hybrid environments you need to care about Asynchronous Hybrid Consistency. So, use a Cluster Placement Group inside AWS for the high-performance “hot” data layer and use the hybrid link only for asynchronous replication or “warm” standby. Never put a synchronous “wait” operation across a hybrid tunnel. If you do, you have created a system that is bound by the slowest fiber optic cable in your provider’s network.

Spread Placement Groups is the most granular level of isolation, ensuring that every single instance is on a totally distinct rack with its own network and power source.

What’s important: Spot Instances

Spot Instances are AWS’s “spare capacity” sold at a massive discount (up to 90% off). The catch? AWS can reclaim them with only a 2-minute notice — Spot Instance Interruption Notice, if they need the capacity back for full-paying customers.

You need to use this signal to trigger a proactive “Draining” phase.using Amazon EventBridge. Create a rule that listens for the EC2 Spot Instance Interruption Notice event. This rule triggers an AWS Lambda function.

Draining the Target Group: It immediately tells the Elastic Load Balancer (ALB/NLB) to set the instance state to draining. This stops new connections while allowing existing ones to finish.
State Checkpoint: If the worker is processing a message from an SQS queue or a Kinesis stream, the Lambda sends a “SIGTERM” to the application process.

To ensure this works, use AWS Fault Injection Service (FIS) to simulate Spot interruptions during peak hours.

What’s important: Performance Traps

EBS Optimized bottleneck. Many architects select an instance based on CPU and RAM but forget that storage throughput and network bandwidth often share the same pipes.

Elastic IP (EIP) obsession. Static IPs my represent a trap that kills scalability.

use PrivateLink or Transit Gateway to handle cross-environment communication. If you find yourself manually managing more than a handful of EIPs, your architecture is brittle.
use DNS-based service discovery or Global Accelerator to abstract the entry points, ensuring that if an instance dies or a region fails, your hybrid tunnels don’t require a manual update of firewall rules.

Serveless: Lambda

Lambda is actually a distributed event-processor that can either be a scaling asset or a bottleneck. In a hybrid world, Lambda acts as the “glue” that connects legacy on-prem signals to modern cloud-native responses.

A Lambda only exists when there is work to do. You pay only for the execution time, which, when combined with Provisioned Concurrency, can handle sudden bursts of traffic — like a morning login surge — without the 5-minute spin-up time of an Auto Scaling Group.

What’s important: VPC Cold Start

When your Lambda needs to talk to a private resource (like an on-prem database via Direct Connect), it must attach to a VPC. Historically, this caused a massive delay. While AWS has improved this with Hyperplane, the “New Way” requires careful subnet planning. If you exhaust your IP space in a small subnet, your Lambdas will fail to scale, effectively DOS-ing your own application.

AWS Hyperplane is a massive, internally distributed scale-out tier that powers AWS’s most high-performance networking services like Network Load Balancer (NLB), NAT Gateway, and PrivateLink (the one needed with Lambdas). Under the hood it works like Kubernetes (Control Plane and State Control).

What’s important: Recursive Loop

If you have a Lambda that writes to S3 → S3 triggers same Lambda…it can execute millions of times in minutes. Without a kill-switch, this doesn’t just crash a server — it fires your entire bank account (or your job) at cloud speed. You can use Lineage Tracking (X-Ray Tracing), the 16-Limit Rule (loop block) and Emergency Alerts (AWS Health Dashboard notification).

Horrible experiences:

The “Cross-Account” Blind Spot: If your loop jumps between two different AWS accounts or through a third-party API (like a Slack webhook or Stripe), the tracing metadata is often lost.

Experience: A developer sets up a loop where Account A sends a message to Account B, which sends it back. AWS doesn’t “see” the recursion across the boundary, and the bill grows unchecked.

The “Slow-Burn” Trap: Loop detection is designed to catch rapid fires. A loop that triggers once every 5 minutes won’t hit the “16-times-fast” threshold.

Experience: A cleanup script accidentally recreates a 1GB file every hour. It stays under the radar for a full year, quietly costing thousands in “zombie” storage and execution fees.

The “Downstream Chaos” Trap: Even if AWS stops the Lambda, the Action B might have already caused damage (e.g., sending 16,000 unintentional emails to customers or deleting 16 versions of a database record).

Best Practices

Never us a single bucket for everything — the Two-Bucket Rule:

Bucket A (Ingest): Lambda is triggered by ObjectCreated events here.
Bucket B (Output): Lambda writes the finished file here. No triggers exist on this bucket.

The Prefix (Folder) Partition: if you must use one bucket, use strict Prefix Filtering.

Trigger: Only on uploads/ prefix.
Action: Lambda processes the file and writes to processed/.
Trap: If you accidentally set the trigger to the root of the bucket, the prefix filter is ignored, and the loop begins.

Idempotency ensures that if an event is delivered twice (a common occurrence in distributed systems), the second execution does nothing.

Metadata Guardrails: Before processing, the Lambda checks the S3 file’s metadata for a flag (e.g., x-amz-meta-status: completed). If the flag exists, the Lambda exits immediately.

The “Receipt” Database (DynamoDB): The Lambda stores the requestID or file_hash in a DynamoDB table with a short TTL (Time to Live).

Experience: If S3 sends the same event again due to a network flicker, the Lambda sees the “receipt” in DynamoDB and stops before doing expensive work.

The “Renaming” Disaster: Developers often think, “I’ll just process the file and overwrite the original to save space.”

Experience: Overwriting a file counts as a New Object Created event. The Lambda sees the “new” file, processes it again, and repeats until the AWS account is throttled or the budget is gone.

The “S3 Batch” Explosion: Using S3 Batch Operations to trigger Lambdas on millions of files.

Experience: If your Lambda has a logic error that creates a new file for every processed one, you don’t just have a loop; you have a chain reaction that can generate millions of invocations per second.

Memory as a Power Lever: Lambda scales CPU and Network bandwidth linearly with Memory. If your function is slow, don’t just look at the code; increase the memory to 1769 MB (which equals 1 full vCPU).

The /tmp Space Strategy: For big data or machine learning tasks, remember that Lambda now supports up to 10GB of ephemeral /tmp storage. This allows you to pull large datasets from your on-premise data lake, process them locally in the Lambda environment, and push only the results to DynamoDB, bypassing expensive network egress for raw data.

Database Connection Pooling: Lambda is short-lived, but RDS connections are “heavy.” If 1,000 Lambdas fire at once, they will crash a traditional database by opening too many concurrent connections. The professional solution is RDS Proxy, for pooling connections.

Environment Variable Leak

Environment variables cannot be rotated without a redeployment. So, if a DBA rotates a password on an on-premise SQL server, your entire Lambda fleet might crash simultaneously because they are still using the cached, hardcoded string. Solution: Lambda Extensions → sidecars of Lambdas: they run in Parallel. Instead of your function code calling an API like AWS Secrets Manager or HashiCorp Vault every time it runs — which adds latency and cost — an extension runs as a separate process within the Lambda execution environment.This extension can pre-fetch secrets, cache them locally, and even handle the background logic of refreshing them before they expire. Your application code simply queries a local HTTP endpoint or a shared file path.

IAM Roles Anywhere: give an on-premise server the same “identity” as an EC2 instance. The local Servers can use local Certificate Authority (CA) certificates to be exchanged for temporary AWS IAM credentials. No long-lived User Access Keys.

Storage: S3 — Tips

The Request Cost Trap: The $0.005 Hidden Tax

The cost of API requests. S3 storage is cheap, but PUT and LIST requests are not. If you have a microservice that writes thousands of tiny 1KB files instead of one 1MB file, your “Request” costs will eventually exceed your “Storage” costs. In a hybrid environment, if an on-premise script performs a LIST operation on a bucket with millions of objects every few minutes to check for new data, you are essentially burning your own budget. The professional move is to use S3 Event Notifications with SQS or EventBridge to "push" updates to your consumers rather than forcing them to "poll" the bucket.

The Performance Trap: The Prefix Bottleneck

S3 can handle roughly 3,500 PUT/COPY/POST/DELETE and 5,500 GET requests per second per prefix. If all your data is dumped into s3://my-bucket/data/, you will hit a "503 Slow Down" error once you scale. The "New Way" is to introduce entropy or date-based prefixes (e.g., s3://my-bucket/2026/04/13/unique-id/) to spread the load across S3’s internal index shards. This allows you to scale horizontally to millions of requests per second without touching a single configuration setting.

The Retrieval Trap: The “Cold Data” Penalty

Moving data to S3 Glacier or Infrequent Access (IA) sounds like a great way to save money for a large company. However, if you move data to IA and then a developer runs a search tool that scans the entire bucket, you will be hit with a “Retrieval Fee” that can be 10x the storage savings.

Use S3 Intelligent-Tiering. It is the only storage class that automatically moves data between “Frequent” and “Infrequent” tiers based on actual access patterns without any retrieval fees or operational overhead.

Security: The “Public by Accident” Defense

In a hybrid mesh, you might need to share S3 data with a partner or an on-premise application. The old way was using IAM Users and Access Keys.

Use S3 Access Points. Instead of one giant, complex Bucket Policy you create specific “Access Points” for different applications (e.g., finance-access-point, devops-access-point). Each has its own mini-policy. Combine it with Block Public Access enforced at the Account level.

Multipart Uploads

Always implement a Lifecycle Policy to delete Incomplete Multipart Uploads. When a large file upload from your on-premise server fails halfway through, the “parts” stay in S3 forever, invisible to a normal ls command, but you are still billed for them. One simple rule to "Abort incomplete multipart uploads after 7 days" can do the job.

A multipart upload can be started via the CLI or if a Console, not from the Console. However, the AWS Console automatically uses multipart uploads behind the scenes. When you drag and drop a file larger than 160MB into, the browser initiates a multipart upload in parallel.

S3 and Data Lakes

In a modern enterprise, S3 is no longer just “storage.” It is the physical storage layer for your Data Lake.

The Rename/Move Bottleneck

In a traditional file system (HDFS), moving a folder is a metadata operation — it is nearly instantaneous. In S3, there is no such thing as a “folder.” If Spark finishes a task and needs to move data from a temporary directory to a final production path, S3 has to copy every single object and then delete the originals.

If your Spark job generates 100,000 small files, this “commit” phase will take longer than the actual data processing. This is the S3 Consistency/Rename Trap. The professional solution is to use the S3A Committer or the AWS Glue Optimized Committer, so writing directly to the final destination using multipart uploads, and only finalizing them once the job is successful.

The Partitioning Trap: Over-partitioning

Often data lakes are over-partitioned by every possible column (e.g., /year/month/day/hour/minute/user_id/). While this seems organized, it creates a "Small File Problem." Spark has to make an HTTP HEAD request for every single file to build its execution plan.

If your data lake has millions of tiny files, Spark will spend 90% of its time just “listing” the files before it even starts reading data. The “New Way” is to aim for file sizes between 128MB and 512MB. Use tools like coalesce() or repartition() in your Spark code to merge small fragments into substantial objects that take advantage of S3’s high-throughput sequential reads.

Parquet and Columnar Pruning

To make S3 and Spark work in harmony, you must use a columnar format like Apache Parquet or ORC.

When Spark queries a Parquet file on S3, it uses Columnar Pruning and Predicate Pushdown. Instead of downloading the entire 1GB file, Spark reads the Parquet footer, identifies exactly which “chunks” of data contain the relevant columns or rows, and uses HTTP Range Gets to fetch only those specific bytes. This reduces your network traffic and S3 GET costs by up to 99%.

S3 Select: Offloading Compute to Storage

The ultimate “out-of-the-box” trick for Spark developers is S3 Select.

You send a SQL query to S3 (e.g., SELECT * FROM s3object WHERE age > 21), and S3 filters the data internally, returning only the matching rows to Spark. This saves CPU cycles on your Spark nodes and drastically reduces the amount of data traveling over the network—critical if your Spark cluster is in a different VPC or even a different region than your data.

What’s new: Mountpoint for S3

Common practice is to force S3 to act like a local drive using FUSE-based tools like s3fs.

The Mountpoint for S3 is the “New Way.” It is a specialized open-source file client developed by AWS that maps S3 buckets to a local Linux directory. Unlike its predecessors, it doesn’t try to be a general-purpose file system; it is built specifically for high-throughput, read-heavy workloads like machine learning training, rendering, and large-scale data ingestion.

The “No-Write/No-Append” Reality

Mountpoint is NOT a drop-in replacement for an EBS volume or an EFS share. To achieve its speed, Mountpoint makes significant trade-offs:

Sequential Writes Only: You can create new files, but you cannot “append” to an existing file. If your application tries to open an existing log file and add a line, the operation will fail.
No File Locking: If two servers mount the same bucket, there is no “flock” mechanism to prevent them from overwriting each other.
Limited Metadata: Traditional Linux commands like chown or chmod do not work. The permissions are inherited from the IAM role of the EC2 instance or the credentials provided by your on-premise server.
IAM Roles Anywhere Integration: To use Mountpoint on-premise securely, don’t use static Access Keys. Combine Mountpoint with IAM Roles Anywhere. This allows your local Linux daemon to exchange its local hardware certificate for short-lived tokens, which Mountpoint then uses to authenticate with S3.
Predictive Prefetching: Mountpoint is “smart.” If it detects your application is reading a file sequentially (like a Spark job or a Video Transcoder), it starts pre-fetching the next blocks into memory before the application even asks for them. This effectively hides the “speed of light” latency between your data center and the AWS region.

Apache Iceberg and Apache Hudi.

These technologies introduce a Metadata Layer between your processing engine (Spark/Flink) and your physical data (S3). They treat the data lake not as a file system, but as a transactional database.

The Architecture: Commits without Renames

The core reason Iceberg and Hudi solve the “Rename” problem is that they are immutable. When Spark finishes a task, it doesn’t move files from a /tmp folder to a /final folder. Instead, it writes new data files to S3 and then creates a new Snapshot file (the metadata).

This snapshot file explicitly lists exactly which data files make up the current “state” of the table. To “commit” the data, Spark simply points the table’s metadata to the new snapshot. It is an atomic operation. For an architect, this means your data is never in a “partial” state. If a hybrid sync fails halfway through, the metadata remains pointed at the old snapshot, and your users never see corrupted or incomplete data.

Apache Iceberg: The Hidden Partitioning King

The biggest advantage for a programmer using Iceberg is Hidden Partitioning. In the old way, if you partitioned by day, your query had to explicitly include WHERE date = '2026-04-13'. If a developer forgot that clause and just queried a timestamp, Spark would scan the entire bucket.

Iceberg handles the relationship between the physical data and the logical column. You can tell Iceberg to partition by “day,” but the user just queries a standard timestamp. Iceberg automatically “prunes” the files, reading only the relevant S3 prefixes.

Apache Hudi: The Incremental Powerhouse

If your on-premise database is constantly sending updates (Change Data Capture — CDC) to the cloud, Hudi is your best tool.

Hudi supports Merge-On-Read. Instead of rewriting a whole Parquet file every time a single row changes (which is TERRIBLE in S3), Hudi writes a small “delta” log. When a user reads the table, Hudi merges the base file and the log file on the fly.

The Trap: Metadata Bloat

The danger with these formats is that every “commit” creates new metadata files. Over time, your S3 bucket will be filled with thousands of tiny metadata snapshots. If you don’t run Maintenance Jobs (like expire_snapshots in Iceberg or clustering in Hudi), your query performance will degrade because the engine has to read too much metadata before it even gets to the data.

Time Travel for Free

Because these formats keep old snapshots for a defined period, they enable Time Travel. A programmer can run a query like SELECT * FROM table FOR SYSTEM_TIME AS OF 'yesterday'. This is a lifesaver in hybrid environments. if an on-premise ingestion script accidentally corrupts a table, you don't have to restore from a backup. You simply point the metadata back to the snapshot from one hour ago. This is because in Iceberg or Hudi, data is never overwritten — it is only added. If your script “updates” 1,000 rows, Iceberg doesn’t change the existing Parquet files. Instead:

It writes new Parquet files containing the corrected data.
It creates a new Metadata Snapshot that points to these new files and ignores the old ones.

This “Time Travel” capability only works if you haven’t run your Maintenance/Vacuum jobs yet. To save money, “Expire Snapshots” tasks can be used to delete old, unused physical files.

S3 and Athena

Athena queries data and works by integrating directly with the AWS Glue Data Catalog. When you run a SQL query, Athena doesn’t “scan” the S3 bucket: it reads the Iceberg metadata snapshots.

The “Select *” Financial Disaster

The biggest trap for a programmer using Athena is the Scan-Based Pricing. You are billed per Terabyte of data scanned. If you have a table with 100 columns and you run SELECT * FROM iceberg_table, Athena has to pull all 100 columns from S3.

In a high-scale environment, a single “lazy” query from a developer can cost $50 or $100. The professional best practice is Columnar Projection. Since your data is in Parquet/Iceberg format, you should only ever select the specific columns you need.

Best Practice: Partition Projection and Iceberg

Even with Iceberg, if your table grows to millions of files, the Glue Data Catalog can become a bottleneck for metadata lookups.

The “New Way” involves using Partition Projection and Iceberg’s Manifest Caching. Iceberg keeps a “Manifest File” (a list of all data files) already written on S3.

Partition Projection tells Athena the mathematical path to the metadata.
Manifest Caching provides Athena with a pre-made list of files.

Best Practices with Athena

Workgroups for Budgeting: In a large company, you must use Athena Workgroups. You can create a “Dev” workgroup and a “Prod” workgroup, setting a hard limit on the amount of data a single query can scan (e.g., 10TB). If a developer writes a rogue Cartesian join, Athena will kill the query before it drains the department’s budget.
CTAS (Create Table As Select): If you have a massive “raw” Iceberg table that is messy, use Athena to create a “curated” version. The CTAS command allows you to transform data, re-partition it, and save it as a new Iceberg table in one step. It’s the easiest way to perform ETL (Extract, Transform, Load) without writing a single line of Python or Spark code.
Athena Federated Query: In your hybrid/multicloud environment, you likely have data in on-premise PostgreSQL or RDS. Athena can use Lambda Connectors to run a single SQL query that joins an S3 Iceberg table with a live database on-premise.

Cross-Region Disaster Recovery: The Metadata Mirror

In a traditional disaster recovery (DR) setup, you would use S3 Cross-Region Replication (CRR) to copy your files from Region A to Region B. However, if you don’t replicate the Metadata and the Glue Data Catalog, your data lake in the secondary region is just a “swamp” of unreadable Parquet files.

The Trap: The Sync-Gap Hallucination

The biggest trap is assuming that because S3 is replicating, your table is safe. S3 CRR is asynchronous. If your primary region fails, you might have the data files for Snapshot #100 in your DR region, but your Glue Catalog might still think the latest state is Snapshot #98.

In a large company, this leads to Data Inconsistency. Your standby Athena engine will query a table that technically has missing pieces, causing your automated DR reports to fail or, worse, provide incorrect financial figures.

For maximum resilience, you should use S3 Multi-Region Access Points (MRAP). This provides a single global endpoint for your applications. If a region goes dark, AWS can automatically failover the traffic to the secondary region.

But for the tables themselves, we use a two-pronged approach:

S3 Replication with Versioning: You enable CRR for the S3 bucket. This handles the physical “heavy lifting” of moving the Parquet and Metadata files.
Glue Catalog Replication: You use the AWS Glue Cross-Region Replication feature. This ensures that the table definitions, partitions, and — crucially — the pointers to the Iceberg metadata files are synchronized between regions.

The “Pilot Light” vs. “Warm Standby”

Pilot Light: You replicate the data but keep your Athena workgroups or EMR clusters in the DR region turned off. This is the most cost-effective strategy for large companies. You only pay for storage and replication. If a disaster occurs, your “Automation Engine” (via Terraform or CloudFormation) spins up the compute layer in minutes.
Warm Standby: You have a small Athena workgroup or a “thinned out” Spark cluster always running in the secondary region. You use this to run Data Integrity Checks on the replicated Iceberg tables every hour. This ensures that in a real crisis, you know for a fact that your DR site is 100% consistent.

Databases

Aurora DB Clusters:

An Aurora DB Cluster consists of one or more DB Instances and a single Cluster Volume: separation of compute and storage. In traditional RDS, each replica has its own copy of the data (EBS volumes). In Aurora, all instances — the 1 Writer and up to 15 Readers — share the same virtual storage volume.

This volume is physically replicated 6 ways across 3 Availability Zones. This means if an entire AZ goes down, your data is still available in two other locations.

1. Horizontal Scaling (Read Replicas) Since all replicas share the same storage, adding a new reader doesn’t require a long “data sync” or “snapshot restore.” You can spin up a new reader in about 60 seconds, and it is immediately ready to serve traffic. You can have up to 15 replicas, and with Reader Endpoints, Aurora load-balances your read traffic across them.

2. Vertical Scaling (The Instance Class) If your writer is struggling with complex joins, you scale it vertically by changing the instance class (e.g., from r6g.large to r6g.4xlarge). Because the storage is detached, the "downtime" for a vertical scale is just the time it takes to reboot the instance—typically less than 30 seconds.

3. The Invisible Scale (Storage) In traditional RDS, you have to “provision” disk space (e.g., 500GB). If you run out, the DB crashes. In Aurora, storage is Elastic. It starts at 10GB and automatically grows in 10GB increments up to 128TB.

Aurora Serverless v2: Scaling at the Micro-Level

For workloads with unpredictable spikes, Aurora Serverless v2 lets you define a range of Aurora Capacity Units (ACUs) (e.g., 0.5 to 16 ACUs).

Aurora monitors CPU, memory, and network pressure in real-time and scales the instance up or down in fractions of a second. It doesn’t reboot; it simply adds or removes RAM and CPU to the running process.

The Trap: The Writer Endpoint Bottleneck

The biggest trap for architects is the Single Writer Limit. While you can scale reads almost infinitely with 15 replicas, you still only have one writer instance for the cluster. If your hybrid application performs massive write bursts (millions of INSERTS/second), scaling the instance class might not be enough. Solutions:

Aurora Multi-Master: (For specific MySQL versions) allowing multiple instances to write to the same volume.
Write-Splitting: Ensuring the application logic sends all SELECT queries to the Reader Endpoint and only INSERT/UPDATE to the Writer Endpoint.

In a Multi-Master cluster, all nodes share the same underlying 6-way replicated storage. If Writer A dies, your application simply catches the connection error and immediately retries the write on Writer B. There is zero database downtime.

Specific architectural constraints that you must understand:

Conflict Resolution: If Writer A and Writer B try to update the exact same row at the exact same millisecond, the storage layer will allow the first one to finish and reject the second one. Your application must be written to handle these Conflict Exceptions and retry.
Single-Region Only: Unlike Aurora Global Database, Multi-Master is strictly confined to a single AWS Region. It is designed for Availability Zone (AZ) resilience, not global geographic distribution.
Scale Limits: Currently, Multi-Master clusters are limited to 4 writer nodes. This is not for scaling to millions of writes; it is for ensuring that writes never stop.
No Read Replicas: In a Multi-Master cluster, you don’t have dedicated “Reader” nodes. Every node is a writer. This means you can’t use the standard “Reader Endpoint” to scale out read-heavy workloads.

Aurora Global Database

It uses the storage layer to replicate data to another region in less than 1 second. If your primary region fails, you can promote the secondary region to full read/write status in less than a minute.

DynamoDB

Access Patterns: distributed hash table that provides single-digit millisecond performance at any scale.

Data is physically spread across different storage partitions based on a Partition Key (PK).

Partition Key (PK): This is the “Address.” DynamoDB runs the PK through a hash function to decide which physical server the data lives on.
Sort Key (SK): Within that address, the SK allows you to model one-to-many relationships (e.g., PK: USER#123, SK: ORDER#2026-04-14).

“Hot Partition” Nightmare

In a large company, the most common failure is choosing a PK with low cardinality (like Country or Status). If 90% of your users are in the "USA," 90% of your traffic will hit a single physical partition. Even though DynamoDB is "infinite," a single partition has a hard limit of 3,000 Read Capacity Units (RCUs) and 1,000 Write Capacity Units (WCUs).

If you hit this limit, your application will receive ProvisionedThroughputExceededException. Ensure your PK is highly distributed, like a UUID, Email, or DeviceID, to spread the load evenly across the entire backend fleet.

Joins: Single-Table Design

In DynamoDB, Joins do not exist. So, you put different types of data in the same table, using the PK and SK to group related items together. When you perform a Query operation on a PK, you can fetch the user profile and all their orders in a single network round-trip.

Indexing: GSI vs. LSI

When you need to query data by a different attribute (e.g., finding a user by PhoneNumber instead of UserID), you use a Global Secondary Index (GSI).

GSI: Creates a shadow copy of your data with a different PK/SK. It is eventually consistent but works across the entire table.
LSI: Must be created at table birth and shares the same PK.

DynamoDB has a hard limit of 400KB per item.

The Fix: Use the Claim Check Pattern. Store the large file in S3 and store only the S3 URL in DynamoDB.

On-Demand vs. Provisioned: For unpredictable hybrid traffic, use On-Demand Mode. You pay only for what you use. For steady, baseline enterprise workloads, use Provisioned Mode with Auto Scaling to save up to 70% in costs. Here, you specify the number of Read Capacity Units (RCUs) and Write Capacity Units (WCUs) you want. You don’t have to change these numbers manually. You set a “Target Utilization” (e.g., 70%). AWS will automatically increase or decrease your provisioned units as your traffic changes. So, it calculates the new capacity needed to bring you back to 70% utilization.
DynamoDB Streams: Use this to trigger a Lambda every time data changes. Hybrid synchronization — when a cloud record is updated, the stream triggers a Lambda that pushes the update to your on-premise legacy system.
TTL (Time to Live): Tell DynamoDB to automatically delete items after a certain date (e.g., session tokens). It costs zero and doesn’t consume your write capacity. It’s the ultimate “out-of-the-box” cleanup tool.

Global Tables for Multi-Region Multi-Master

For your cross-region DR, DynamoDB offers Global Tables. You can have a writer in New York and a writer in Milan. DynamoDB handles the multi-master replication in the background. If a region fails, your app simply points to the other region. There is no “promotion” or “failover” needed; it is truly active-active.

More on Scaling: Partition Split DynamoDB stores your data on Partitions (physical storage nodes). A single partition can hold roughly 10GB of data and support 3,000 RCUs / 1,000 WCUs. As your data grows or your throughput requirements increase, DynamoDB performs a Partition Split.

Storage Split: When a partition exceeds 10GB, AWS splits it in half and moves the data to two new nodes.
Throughput Split: If you provision more than 3,000 RCUs for the table, AWS splits the data across multiple partitions to handle the load.

Adaptive Capacity. DynamoDB automatically reallocates throughput to your “hot” partitions. If Partition A is doing 90% of the work, DynamoDB will “borrow” the unused capacity from the idle partitions to keep Partition A running. However, this is not a magic bullet. If your total provisioned capacity is exceeded, or if a single partition hits the hard physical limit (1,000 WCUs), you will still face throttling.

Best Practices for High-Scale Scaling

Avoid “Hot” Keys: As we discussed in the main concepts, scaling is only effective if your Partition Key is well-distributed.
Burst Capacity: DynamoDB retains a small amount of “unused” capacity (up to 5 minutes) to handle very brief spikes.
GSI Scaling: Remember that Global Secondary Indexes (GSIs) scale independently from the main table. If your main table is scaled to 10,000 WCUs but your GSI is stuck at 100 WCUs, the writes to your main table will be throttled because DynamoDB must update the GSI synchronously.

Scalability through Caching

For ultra-high-read workloads, use DAX (DynamoDB Accelerator). It is an in-memory cache that sits in front of your table, reducing read latency from milliseconds to microseconds. DAX is a Regional Service that lives inside your VPC (Virtual Private Cloud).

DynamoDB is a Public API service (living outside your VPC), while DAX is a VPC-resident service.

ElastiCache and RDS

In a high-performance AWS architecture, ElastiCache sits in the same VPC as your RDS instances.

The Trap: The “Lazy Writing” Desync

Most developers use the Cache-Aside pattern: read from ElastiCache; if it’s not there, read from RDS and write to ElastiCache. The trap occurs during an Update. If your application updates RDS but the ElastiCache “write-back” fails due to a brief network hiccup or a race condition between two Lambda functions, your database says “Balance is $100” while your cache still says “Balance is $500.”

The best practice is to Invalidate first: always delete the cache key before updating the database. If the database update fails, the next read will simply fetch the old (but correct) data from RDS and repopulate the cache.

The Multi-AZ “Ghost” Read

ElastiCache Redis supports Multi-AZ with Auto-Failover. You have a Primary node and several Read Replicas. The trap: Redis replication is asynchronous. If your application writes to the Primary and immediately tries to read that same key from a Replica in a different AZ, the data might not have arrived yet.

The Security Group “Silent Kill”

Unlike DynamoDB (which uses IAM), ElastiCache uses Security Groups and standard Redis AUTH. A common SysOps trap is configuring the RDS security group perfectly but forgetting that the Application Server needs a specific outbound rule to hit the ElastiCache port (6379 for Redis). Because Redis clients often have long “connect” timeouts, your application might appear to hang for 30 seconds before failing.

ElastiCache Global Datastore. It replicates your Redis cluster across regions with less than 1 second of latency. This is the perfect companion to Aurora Global Database.

RDS and Redis: SQL and noSQL

Redis is a Key-Value store; it doesn’t understand “JOINs” or “Foreign Keys.”

The RDS Side: You have a query that joins Users, Orders, Products, and Discounts. It takes 200ms of CPU time on RDS.
The Redis Side: You store the final JSON result of that query under a single key (e.g., user_dashboard_123).

Reference Data and Session State

Every large company has “Reference Data” — things like currency exchange rates, country codes, or tax rules. You can cache these small, static values in Redis with a long TTL (Time To Live). Sessions are high-write and short-lived. Use Redis for the session. If the session expires, Redis deletes it automatically via TTL.

The “Shape-Shifting” Trap

If you fetch a row from RDS, turn it into a JSON string, and save it to Redis, your application has to “Parse” that JSON every time it reads from the cache.

Only cache what is truly “expensive” to compute. If a query is a simple SELECT * FROM users WHERE id = 1, and your RDS has a primary key index, don't cache it.

AWS DMS: The Hybrid Data Bridge for ETL and DWH

AWS Database Migration Service (DMS) is more than a migration tool, it is a high-speed, log-based Change Data Capture (CDC) engine that bridges your high-velocity on-premises databases with your cloud-native Data Warehouse (DWH).

DMS excels at the “E” (Extract) and “L” (Load) parts of the pipeline, moving data with sub-second latency from legacy systems into S3 or Redshift.

The core value of DMS in a DWH context is its ability to perform Full Load + Ongoing Replication.

Full Load: It takes a snapshot of your on-premises database and moves it to the DWH.
CDC: It reads the transaction logs (e.g., binlogs in MySQL, WAL in Postgres) without putting any query load on your production CPU. It streams only the “changes” (deltas) to the DWH in real-time.

“S3 Staging” Bottleneck

When you use Amazon Redshift as a DWH target, DMS doesn’t write directly to Redshift’s internal storage.

The Process: DMS writes data to an S3 Staging Bucket first and then issues a COPY command to move it into Redshift.
The Trap: If you don’t configure your S3 cleanup rules, that staging bucket will grow indefinitely, costing you thousands in “ghost” storage.
Use the S3DataDelivery settings to ensure DMS cleans up these temporary files immediately after they are successfully loaded into the DWH.

The “Zero-ETL” Evolution

For specific pairs (like Aurora to Redshift), AWS now offers Zero-ETL integrations. This uses the storage-layer replication we discussed in the Aurora chapter to stream data into the DWH with zero infrastructure management.

Amazon S3 Tables

S3 Tables are purpose-built for tabular data, providing a native Apache Iceberg experience directly within a new type of bucket: the Table Bucket.

Standard S3 buckets are “General Purpose” — they store images, logs, and zip files. Table Buckets are specialized. They don’t have “Folders”; they have Namespaces and Tables.

Namespaces: Think of these as Schemas in a relational database. They group related tables together.
Tables: These are native Iceberg tables. S3 manages the entire lifecycle of the table for you.

The primary bottleneck in traditional S3-based data lakes was the Request Rate Limit (3,500 PUTs / 5,500 GETs per prefix). If you tried to write 100,000 small files to a single partition, S3 would throttle you.

S3 Tables bypass these limits:

10x Higher Transactions: S3 Tables are designed to handle up to 10x higher transactions per second compared to self-managed Iceberg on standard S3.
Built-in Maintenance (The “Janitor”): In a self-managed lake, you have to run manual Spark jobs to “compact” small files (vacuuming). S3 Tables do this automatically in the background. It merges small files, removes unreferenced snapshots, and cleans up orphaned data without you paying for a Spark cluster to do it.
3x Faster Query Planning: Because the metadata is handled natively by S3, engines like Athena and Spark can “plan” the query — identifying which files to read — up to 3x faster.

Because S3 Tables are built on Apache Iceberg, they inherit the ID-based Schema Evolution: you can rename columns, drop fields, or reorder data without ever rewriting.

The Trap: The Price of Convenience: While S3 Tables simplify everything, there is a “Managed Service” cost.

The Trap: You are billed not just for storage, but for the Automatic Maintenance tasks (Compaction and Snapshot Management).
The Decision: If you have a small, static dataset, a standard S3 bucket is cheaper. But if you have a massive, streaming data lake with constant updates and deletes, the performance gains and the elimination of manual “Cleanup” jobs usually result in a lower Total Cost of Ownership (TCO).
Even though S3 Tables use a new bucket type, they are fully accessible via the Iceberg REST Catalog.

Beyond Flow Logs

Reachability Analyzer: a configuration analysis tool that performs a virtual walk from a source to a destination without sending a single real packet.

You pick a source (e.g., an EC2 instance) and a destination (e.g., an RDS instance or an Internet Gateway).
It analyzes your Route Tables, Security Groups, NACLs, and Gateways. If the path is blocked, it tells you exactly where the “No” came from (e.g., “Security Group sg-123 is missing an inbound rule for port 5432”).

Network Access Analyzer: While Reachability Analyzer asks “Can I get there?”, the Access Analyzer asks “Who else can get here?”

The Difference: It uses formal reasoning to find all possible paths that match your criteria. For example, you can create a “Scope” that says: “Find every path from the Public Internet to my Private Subnets.”
The Result: It will flag any unintended “backdoors” created by a misconfigured Transit Gateway or an accidentally attached Internet Gateway.

Network Manager & Route Analyzer: If your problem involves a Transit Gateway (TGW) or a Direct Connect back to your on-premise data center, you use the Route Analyzer within Network Manager.

It analyzes the routing logic across multiple VPCs and on-premise segments. It can identify “Blackhole” routes in your TGW Route Tables — a common issue when an on-premise BGP session drops and the route isn’t withdrawn properly.
Visual Monitoring: Network Manager gives you a global dashboard to see the health of your VPN tunnels and Direct Connect links in real-time.

Scaling (Scaling “Up”) — Vertical

Vertical scaling means adding more power (CPU, RAM, Disk) to an existing resource.

You change the instance type (e.g., moving an RDS database from a t3.medium to an r6g.large).
Your Aurora database is hitting 90% CPU during end-of-month reporting. You “scale up” the writer instance to a larger class with more memory to handle the complex joins.

It almost always requires a reboot (downtime). Also, there is a “ceiling” — once you reach the largest available instance (like a u-24tb1.metal), you can’t go any higher.

Horizontal Scaling (Scaling “Out”)

Horizontal scaling means adding more instances of the same resource. This is the foundation of high availability.

Instead of one giant server, you have five small servers behind a Load Balancer.
Example: A web application gets a sudden spike in traffic. The Auto Scaling Group (ASG) detects the load and launches three additional EC2 instances. The Load Balancer immediately starts sending traffic to the new “targets.”
No downtime. If one server fails, the others keep running. There is practically no “ceiling.”

The Three Smart Scaling Methods

In the “New Way,” we don’t just add servers; we choose a Policy that dictates how the system “breathes.”

A. Target Tracking Scaling (The “Cruise Control”)

You pick a metric (like 70% CPU) and tell AWS: “Keep the fleet at this level.”

Example: “I want my ASG to stay at 70% CPU.” If traffic spikes and CPU hits 90%, AWS adds instances until the average across the fleet drops back to 70%. If it drops to 30%, it removes instances.

B. Step Scaling (The “Manual Gears”)

You define specific “steps” based on the size of the problem.

Example: * If CPU is between 70% and 80%, add 2 instances.
If CPU is > 80%, add 5 instances.

best for high-velocity spikes where you need to react aggressively to big jumps in load.

C. Predictive Scaling

AWS uses Machine Learning to look at your traffic history from the last 14 days.

Example: Your company always has a massive login surge every Monday at 9:00 AM. Reactive scaling would wait for the crash at 9:01 AM. Predictive Scaling sees the pattern and starts booting instances at 8:45 AM so they are “Warm” and ready when the users arrive.

The Scaling “Anti-Patterns” (The Traps)

The “Flapping” Trap: If your scale-in policy is too aggressive, the system might add a server, then delete it 2 minutes later, then add it again. This wastes money and destabilizes the app. The Fix: Use “Cooldown periods” to let the metrics settle.
The “Zombie” Instance: An instance scales out but fails its health check. The Load Balancer doesn’t send it traffic, but the ASG thinks it’s “On.” The Fix: Connect your ASG health checks to the ELB
The Database Bottleneck: You scale your web servers to 100 instances, but they all try to connect to a single RDS instance. The database crashes.

Spot Instances within an Auto Scaling Group (ASG): iyou can achieve up to 90% savings compared to On-Demand, but you need to use Mixed Instances Policies and Capacity Rebalancing to make Spot nearly as reliable as On-Demand.

Never use 100% Spot for a production workload.

Example Scenario: The 70/30 Split

You have a web fleet that needs at least 10 instances to handle baseline traffic.

On-Demand Base: 3 instances (Your “Always On” core).
On-Demand Percentage Above Base: 0% (All remaining capacity will be Spot).
Spot Percentage: 100%.
Result: The first 3 servers are reliable On-Demand. Any scaling beyond that (the “burst”) uses Spot. If Spot capacity is unavailable, the ASG can be configured to temporarily fall back to On-Demand to keep the site alive.

The Setup: Key Configuration Steps

A. Attribute-Based Instance Selection (ABS)

You tell AWS: "I need any instance with 2 vCPUs and 8GB of RAM."

Why: This gives the ASG access to dozens of “Spot Pools.” If t3.large is out of stock, it might give you an m5.large or a c5.large at the same price. Flexibility is the only way to avoid interruptions.

B. Allocation Strategy: “Price Capacity Optimized”

AWS looks at both the price and the real-time capacity of the Spot pools. It will choose the instance type that is least likely to be interrupted, giving you the best balance of cost and stability.

C. Capacity Rebalancing

The Problem: Normally, AWS gives you a 2-minute warning before a Spot instance is taken back.
The Fix: When Capacity Rebalancing is on, AWS monitors “at-risk” instances. If it sees a pool is getting tight, it sends a signal 5–10 minutes early.
The Action: The ASG proactively starts a new instance in a healthier pool. Once the new one passes health checks, it gracefully drains and terminates the at-risk instance before the 2-minute warning even hits.

The Setup Checklist (CLI/Console)

Launch Template: Create a template with your AMI and user data. Don’t specify an instance type here if using Attribute-Based Selection.
ASG Mixed Instances Policy:

Set Base On-Demand Capacity (e.g., 2).
Set On-Demand Percentage Above Base (e.g., 20%).

Instance Requirements: Define your vCPU and Memory range.
Enable Capacity Rebalancing: Toggle this to On.
Termination Policy: Set to AllocationStrategy to ensure the ASG removes the most expensive or at-risk instances first during a scale-in.

Never put stateful data on Spot. If your application saves files to a local drive (not S3/EFS) or maintains a local database, those files will vanish when the Spot instance is interrupted.

NAT Gateway is often called the “hidden tax” of AWS because its pricing is tiered in a way that catches architects off guard. It isn’t just one fee; it’s a triple-tax on your data.

NAT Gateway: The Triple-Tax Model

When you send 1 GB of data to the internet through a NAT Gateway, you aren’t paying $0.045. You are likely paying 3x to 4x that amount due to how the charges stack:

You pay ~$32/month per gateway just for it to sit there. In a standard 3-AZ (Availability Zone) high-availability setup, you need three gateways, costing you ~$97/month before a single byte is moved.
The “Processing” Tax ($0.045/GB): AWS charges you for every gigabyte that touches the gateway, both inbound and outbound.
The “Egress” Tax ($0.09/GB): On top of the NAT processing fee, you still have to pay the standard AWS Data Transfer Out fee to the internet.
Total Cost: For 1 GB of internet traffic, you are paying roughly $0.135/GB.

If you try to save money by using only one NAT Gateway for your entire VPC (instead of one per AZ), you will incur in another problem.

If an EC2 instance in AZ-b talks to a NAT Gateway in AZ-a, you pay an additional $0.01/GB for cross-AZ data transfer in both directions. Often, the cost of this "inter-AZ" traffic is higher than the $32/month you saved by deleting the extra gateway.

The S3 & DynamoDB “Silent Leak”

The most common reason for a massive NAT bill is misconfiguration.

By default, if your private instances talk to S3 or DynamoDB, that traffic goes through the NAT Gateway. You are paying $0.045/GB to talk to another AWS service that should be free to access.

The Fix: Use VPC Gateway Endpoints. They are 100% free and keep your S3/DynamoDB traffic off the NAT Gateway entirely.

Smart Ways to Cut the Bill

VPC Interface Endpoints: For services like ECR (container images) or Secrets Manager, use Interface Endpoints (PrivateLink). While they have an hourly cost, their data processing fee (~$0.01/GB) is 75% cheaper than NAT Gateway.
fck-nat (NAT Instances): For dev/test environments, many senior sysops are moving back to “NAT Instances” using open-source AMIs like fck-nat. Running a tiny t4g.nano instance as a NAT can cost $3/month instead of $32, with zero per-GB processing fees.
IPv6-Only Subnets: In 2026, the “Pro” move is to go IPv6-only. IPv6 traffic uses an Egress-Only Internet Gateway, which is completely free and has no data processing charges.

CloudFront: The Global Perimeter

Amazon CloudFront is more than a CDN: it is a Global Front Door. It is a security perimeter, a serverless compute platform, and a private high-speed highway that connects your users to AWS.

CloudFront is the first line of defense in the AWS “Defense in Depth” model. Because it sits at the Edge Locations (colocation public centers), it can stop attacks before they ever reach your VPC or your expensive databases. it can perform SSL Termination at the CloudFront Edge Location:

Policy: You should set this to “Redirect HTTP to HTTPS” or “HTTPS Only.”
Certificates: You use AWS Certificate Manager (ACM) to issue a free SSL certificate.
The Trap: To use an ACM certificate with CloudFront, the certificate must be requested in the us-east-1 (N. Virginia) region, regardless of where your origin or users are located.
It works like ALB (only DNS) but you can ask a unique IP address to your distribution.The Trap: AWS charges $600 per month for this.

Additional features:

WAF Integration: includes a native, one-click setup for AWS WAF (Web Application Firewall). It automatically applies managed rules to block SQL injection, Cross-Site Scripting (XSS), and known botnets.
Automatic DDoS Protection: Every CloudFront distribution is protected by AWS Shield Standard for free. It mitigates Layer 3 and 4 attacks (like SYN floods) at the edge.

Field-Level Encryption: You can encrypt sensitive data (like a credit card number) at the Edge using a public key. Instead of encrypting the whole request, CloudFront “snipes” specific fields (like credit_card_number or social_security_number) and encrypts them individually using Asymmetric Encryption (RSA).

Public Key Upload: You upload a Public Key to CloudFront. You keep the Private Key locked in a secure vault (like a hardware security module or a specific isolated microservice).
Profile & Config: You create a “Field-Level Encryption Profile” where you tell CloudFront: “Look for a POST field named ‘CC’ and encrypt it using this Public Key.”
Edge Encryption: When a user submits a form, the CloudFront Edge location intercepts the POST request. It finds the CC field and scrambles it into a cipher-text string before forwarding the request to your origin.
Encrypted Transit: The request travels through your Load Balancer, Web Server, and Logging systems.
Secure Decryption: Only the final “Payment Service” (which has the Private Key) can decrypt and use the original value.

Global Interconnectivity

When a user in Milan talks to a server in Virginia, their data usually hops across the “Dirty Public Internet,” which is slow and unpredictable.

The “New Way”: CloudFront uses the AWS Global Network Backbone.

TCP Optimization: CloudFront optimizes the “Handshake” between the user and the Edge. Once the data reaches the Edge, it travels over AWS’s private, high-speed fiber directly to your origin (S3, ALB, etc.).
Origin Shield: For massive global apps, you can enable a centralized “super-cache” layer. Edge Locations (PoPs): Hundreds of sites physically close to users.
Regional Edge Caches (REC): Larger caches that serve as a middle tier (Free/Default).

Origin Shield (The New Tier): A specific, high-capacity cache you choose in the AWS Region closest to your origin. So, 3 levels of cache:

Edge Locations (PoPs): Hundreds of sites physically close to users.
Regional Edge Caches (REC): Larger caches that serve as a middle tier (Free/Default).
Origin Shield (The New Tier): A specific, high-capacity cache you choose in the AWS Region closest to your origin

CloudFront Functions vs. Lambda@Edge

You can move moving logic out of the data center and into the network itself.

CloudFront Functions (2026) are Lightweight JavaScript (ES6),Ultra-fast, that you can use for URL rewrites, Header manipulation, JWT validation.

Lambda@Edge Full Node.js / Python that you can use for Complex SEO, Image resizing, A/B testing, DB lookups.

Relation with Other Services (The Ecosystem)

S3 (The Classic): Use Origin Access Control (OAC) to ensure your S3 bucket is completely private. The only way to see the data is through CloudFront’s secure, cached URLs.
Application Load Balancer (ALB): CloudFront can act as a proxy for dynamic content. It keeps the connection to the ALB “warm,” drastically reducing the Time to First Byte (TTFB).
API Gateway: You can place CloudFront in front of your APIs to provide global caching and DDoS protection for your REST or GraphQL endpoints.
VPC Origins: New for 2025/2026, you can now connect CloudFront directly to resources in a private subnet (like an internal ALB) without exposing them to the public internet at all.

Security Savings Bundle

CloudFront Security Savings Bundle. You commit to a monthly CloudFront spend, and in exchange, AWS gives you AWS WAF for free (up to a certain usage).

Main Menu

Easy AWS: Architecting Revisited

New approach: advantages, pitfalls and exchange of experiences on the cloud (examples)