Back to Articles
Cloud Architecture

Zero Trust on AWS: Moving Beyond the Perimeter Before a Breach Forces You To

"Zero Trust" gets thrown around in vendor marketing so often that it's started to lose meaning. Most teams hear it and think it means buying a new product. It doesn't. Zero Trust is an architectural philosophy — and on AWS, it's achievable today with services you're probably already paying for but not using to their full potential.

The core principle is simple: never trust, always verify. Don't trust a request because it came from inside your VPC. Don't trust a role because it was assumed by an EC2 instance you provisioned. Don't trust network location as a substitute for authentication. Verify identity, enforce least-privilege, and assume breach — then build your architecture around those assumptions.

This article walks through how I implement Zero Trust architecture in AWS environments: the network design, the IAM layer, the monitoring, and the organizational changes that make it stick.

The perimeter model assumes everything inside the network is safe. Modern attacks — credential theft, supply chain compromise, insider threats — all operate from inside the perimeter. Zero Trust is the correct response to how attacks actually happen.

1. Why This Is a Business Decision, Not Just a Security One

Before getting into the technical implementation, it's worth addressing the business case — because Zero Trust architecture has real ROI, not just compliance checkboxes.

The average cost of a cloud data breach in 2025 was over $4.8 million, according to IBM's annual report. The largest driver of that cost isn't the breach itself — it's lateral movement. An attacker compromises one credential, one container, one Lambda function, and from there moves through a flat network until they find what they're looking for. A Zero Trust architecture doesn't prevent the initial compromise; it eliminates the lateral movement. The blast radius of any single compromised identity or workload is contained to exactly what that identity was permitted to access — nothing more.

For teams working toward SOC 2 Type II, PCI-DSS v4, FedRAMP Moderate, or HIPAA, Zero Trust architecture directly maps to control requirements around network segmentation, least-privilege access, and continuous monitoring. Implementing it properly means you're not just more secure — you're auditable.

2. VPC Architecture: Segmentation Is the Foundation

A flat VPC where every workload can reach every other workload is the opposite of Zero Trust. The goal is to design network topology such that a compromised workload can only communicate with the specific resources it needs — and nothing else.

Multi-tier subnet design

Start with strict subnet segmentation by function:

# Terraform — foundational VPC structure
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = { Name = "prod-vpc-zero-trust" }
}

# Public subnets — load balancers and NAT gateways ONLY
resource "aws_subnet" "public" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = false  # Never auto-assign public IPs

  tags = { Name = "public-${count.index}", Tier = "public" }
}

# Application subnets — no direct internet access
resource "aws_subnet" "app" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = { Name = "app-${count.index}", Tier = "application" }
}

# Data subnets — isolated, no NAT gateway route
resource "aws_subnet" "data" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 20)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = { Name = "data-${count.index}", Tier = "data" }
}

# Management subnets — bastion/SSM access only
resource "aws_subnet" "mgmt" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 30)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = { Name = "mgmt-${count.index}", Tier = "management" }
}

The data subnets have no route to the internet — not through a NAT gateway, not through a Transit Gateway to on-premises. The only way in or out is through the application tier, and that path is controlled by security groups that reference specific security group IDs, not CIDR ranges.

Security groups as micro-perimeters

Stop using CIDR blocks in your security group rules. The moment you write 10.0.0.0/8 as an ingress source, you've given access to everything in that range — including workloads you haven't deployed yet. Use security group references instead:

# Application tier can only receive from the load balancer SG
resource "aws_security_group_rule" "app_ingress_from_alb" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.app.id
  source_security_group_id = aws_security_group.alb.id
  description              = "Allow HTTPS from ALB only"
}

# Data tier can only receive from the application tier SG
resource "aws_security_group_rule" "db_ingress_from_app" {
  type                     = "ingress"
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  security_group_id        = aws_security_group.data.id
  source_security_group_id = aws_security_group.app.id
  description              = "Allow PostgreSQL from app tier only"
}

# No egress to 0.0.0.0/0 from data tier — ever
resource "aws_security_group_rule" "db_egress_deny_all" {
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks       = []  # Empty — no egress rules = deny all
  security_group_id = aws_security_group.data.id
}

3. Eliminate SSH: AWS Systems Manager Session Manager

In a Zero Trust model, there is no SSH. No bastion hosts. No open port 22 anywhere in your security groups. Instead, all administrative access goes through AWS Systems Manager Session Manager — authenticated by IAM, logged to CloudTrail and optionally to S3, and requiring no inbound network access to the instance.

# IAM policy for developers — SSM access scoped to tagged instances
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "SSMSessionManagerAccess",
      "Effect": "Allow",
      "Action": [
        "ssm:StartSession",
        "ssm:TerminateSession",
        "ssm:ResumeSession",
        "ssm:DescribeSessions",
        "ssm:GetConnectionStatus"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "ssm:resourceTag/Environment": "${aws:PrincipalTag/Environment}",
          "ssm:resourceTag/AllowSSM": "true"
        }
      }
    },
    {
      "Sid": "DenyDirectSSHAnywhere",
      "Effect": "Deny",
      "Action": "ec2-instance-connect:SendSSHPublicKey",
      "Resource": "*"
    }
  ]
}

The condition keys here are doing serious work: a developer tagged as Environment=staging can only start SSM sessions on instances tagged Environment=staging. Production access requires a production-tagged identity — which comes from your IdP, not a manually managed IAM policy.

4. AWS PrivateLink: Keep Your Traffic Off the Internet

Every time your application in a private subnet calls s3.amazonaws.com, that request leaves your VPC through a NAT gateway and traverses the public internet — even though it's going to an AWS service. VPC Endpoints eliminate this entirely, and they're a Zero Trust requirement for any data-sensitive workload.

# Gateway endpoints — S3 and DynamoDB (no cost)
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = "*"
      Action    = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
      Resource  = [
        "arn:aws:s3:::${var.app_bucket}",
        "arn:aws:s3:::${var.app_bucket}/*"
      ]
    }]
  })
}

# Interface endpoints — for services like SSM, ECR, Secrets Manager
resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ssm"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.app[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "secrets_manager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.app[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

With private_dns_enabled = true, your application code doesn't change — it still calls secretsmanager.us-east-1.amazonaws.com, but that DNS name now resolves to a private IP inside your VPC. Traffic never leaves. And the endpoint policy lets you restrict which S3 buckets are reachable through this endpoint, preventing a compromised workload from exfiltrating data to an attacker-controlled S3 bucket.

5. IAM Condition Keys: Enforcing Zero Trust at the Identity Layer

Network controls stop lateral movement at the infrastructure layer. IAM condition keys enforce Zero Trust at the identity layer — ensuring that even a valid identity can only act from expected contexts.

# Restrict S3 access to requests originating inside the VPC
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAccessOutsideVPC",
      "Effect": "Deny",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::prod-sensitive-data",
        "arn:aws:s3:::prod-sensitive-data/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:SourceVpc": "vpc-0abc123def456"
        }
      }
    },
    {
      "Sid": "RequireSSLInTransit",
      "Effect": "Deny",
      "Action": "s3:*",
      "Resource": "*",
      "Condition": {
        "Bool": {
          "aws:SecureTransport": "false"
        }
      }
    },
    {
      "Sid": "DenyUnencryptedObjectUploads",
      "Effect": "Deny",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::prod-sensitive-data/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "aws:kms"
        }
      }
    }
  ]
}

The aws:SourceVpc condition is the Zero Trust layer for S3: even if an attacker steals the IAM credentials for your application role and tries to use them from their own environment, they cannot access the bucket. The credentials only work from inside your VPC.

Require MFA for sensitive operations

# Require MFA for any destructive action in production
{
  "Sid": "RequireMFAForDestructiveActions",
  "Effect": "Deny",
  "Action": [
    "ec2:TerminateInstances",
    "rds:DeleteDBInstance",
    "s3:DeleteBucket",
    "iam:DeleteRole",
    "iam:DetachRolePolicy",
    "kms:ScheduleKeyDeletion"
  ],
  "Resource": "*",
  "Condition": {
    "BoolIfExists": {
      "aws:MultiFactorAuthPresent": "false"
    },
    "NumericGreaterThan": {
      "aws:MultiFactorAuthAge": "3600"
    }
  }
}

6. Continuous Verification with GuardDuty and Security Hub

Zero Trust isn't a state you achieve — it's a continuous process. You need to know when something deviates from expected behavior, and you need to respond automatically when possible.

# Enable GuardDuty across all regions via Terraform
resource "aws_guardduty_detector" "main" {
  enable = true

  datasources {
    s3_logs { enable = true }
    kubernetes { audit_logs { enable = true } }
    malware_protection {
      scan_ec2_instance_with_findings {
        ebs_volumes { enable = true }
      }
    }
  }
}

# EventBridge rule — auto-respond to high-severity findings
resource "aws_cloudwatch_event_rule" "guardduty_high" {
  name        = "guardduty-high-severity"
  description = "Capture GuardDuty HIGH findings for auto-response"

  event_pattern = jsonencode({
    source      = ["aws.guardduty"]
    detail-type = ["GuardDuty Finding"]
    detail = {
      severity = [{ numeric = [">=", 7] }]
    }
  })
}

resource "aws_cloudwatch_event_target" "isolate_instance" {
  rule      = aws_cloudwatch_event_rule.guardduty_high.name
  target_id = "IsolateCompromisedInstance"
  arn       = aws_lambda_function.incident_response.arn
}

The Lambda function triggered by this rule does two things: it attaches an isolation security group to the flagged instance (removing all ingress and egress except to your security tooling), and it pages your on-call engineer with the full finding details. This is automated incident containment — the blast radius is limited within seconds of detection, not hours.

7. The Organizational Layer: What Technology Alone Can't Fix

The technical controls above are straightforward to implement. The harder part is the organizational discipline that makes them stick. Three things matter:

Infrastructure as Code for everything. Every security group rule, every IAM policy, every VPC endpoint must be in Terraform or CloudFormation. Console changes are blocked by an SCP. If it's not in code, it doesn't exist — and it will drift. In every environment I've inherited, the worst security findings were resources created manually two years ago that nobody remembered.

Security review as part of the PR process. Use tools like Checkov, tfsec, or Semgrep in your CI pipeline to catch misconfigurations before they reach production. A security group rule opening port 22 to 0.0.0.0/0 should fail the pipeline, not get caught in a quarterly review.

Incident response playbooks tested before you need them. Your automated isolation Lambda is worth nothing if nobody knows what to do after it fires. Document the response procedure, run a tabletop exercise, and make sure the security team can operate the tooling under pressure.

Zero Trust architecture built on top of a manual, console-driven workflow will erode within six months. The discipline that makes it permanent is treating infrastructure change the same way you treat code change: reviewed, tested, versioned, and deployed through a pipeline.

Where to Start If Your Environment Is Already Established

If you're reading this with an existing AWS environment and wondering where to begin, the priority order I use in engagements is:

  1. Audit your current state first. Run AWS Trusted Advisor, Security Hub, and an IAM credential report. Understand what you have before changing anything.
  2. Eliminate 0.0.0.0/0 security group rules. This is the highest-impact, lowest-risk first step. Replace with security group references wherever possible.
  3. Enable GuardDuty and Security Hub in all regions. You need visibility before you can enforce anything.
  4. Add VPC Gateway Endpoints for S3 and DynamoDB. Free, low-risk, immediate reduction in internet exposure.
  5. Migrate to SSM Session Manager and close port 22. This takes a day per environment and eliminates an entire attack surface.
  6. Build out IAM condition keys and SCPs to enforce the network and identity controls that can't be enforced at the VPC layer alone.

Each of these steps is individually deployable without a massive architectural overhaul. Done in sequence over a quarter, they transform a conventionally secured AWS account into one that can genuinely claim Zero Trust posture — and defend it in front of an auditor.

If you're working through this and want a structured assessment of where your environment stands, or if you need someone to design and implement this architecture from scratch, reach out. This is the core of what I do for clients.

Ready to architect Zero Trust into your AWS environment?

I design and implement Zero Trust architecture for teams preparing for security audits, compliance requirements, or who simply need their cloud posture to match their risk profile.