- AWS at Scale
- Posts
- AWS at Scale #5: Cloud Platform in Action.
AWS at Scale #5: Cloud Platform in Action.
You may do AWS, but do you do AWS at Scale?
Introduction
In preparation for this post (and to lay some foundations), luckily, we’ve already covered a number of key topics:
An introduction to AWS at Scale and what it means to me.
Ensuring that you make the right career choice and nail it.
Understanding AWS platform concepts at scale.
An introduction to the platform to consumer provider model.
This post will build on these key topics to frame what a cloud platform looks like at scale (and in action).
If you can execute this vision, then the excellent benefits and efficiencies delivered are as follows:
✅ A comprehensive architectural engagement model for net new AWS demand.
✅ Fast automated AWS account vending.
✅ Reduced distributed DevOps resources, enabling a centralised model.
✅ Centralised platform team with expert consumer onboarding & support.
✅ Centralised architectural, DevSecOps and platform principles, standards, guidelines, governance, compliance and ways of working.
✅ Reduced cost of supplier consulting through design and initial implementation.
✅ Bootstrap faster
✅ Reduced TCO of platform management.
✅ Fast innovation.
✅ Confidence in governance & compliance when building products and services on AWS, at scale.
AWS Cloud Platform in Action
What does it look like?
This is an initial snapshot, please click the image to see a larger version. We’ll talk a lot more about the detail of this example platform later in this post
It’s a pretty complex diagram, but when you break it down, it’s pretty simple (and we’ll do that later).
Before we get to that though, let’s take a look as some requirements that we might want to define, upfront, to commission a project (and it’s funding) to deliver a platform at this scale:
AWS (at Scale) Platform Requirements
1. Establish and build a provider to consumer model
The provider (typically a cloud platform capability) must deliver a consistent opinionated AWS cloud platform configuration covering core platform level functions, such as:
➡️ A ‘provider’ repo with PR workflows for all account, infosec, general hardening & VPC vending purposes plus overall landing zone configurations (everything as code, more on that later..)
➡️ Provide platform support (core landing zone, control tower, supporting services and eco system tooling)
➡️ Platform monitoring (platform uptime, region uptime, core AWS services, core landing zone functionality) and notifications to all stakeholders and workload teams when services are degraded.
➡️ Automated governance and compliance, implemented during vending and during runtime.
➡️ Mandatory tagging, captured and delivered through a net new demand engagement model, then provisioned through AWS account vending and subsequent resource provisioning.
➡️ Centralised InfoSec tooling (code & vulnerability scanning, centralised logging, privileged access management, core platform policies, RBAC, MFA etc).
The consumers (typically the builders) are then expected to behave as compliant platform guest by:
☑️ Aligning their workload architecture against opinionated architecture standards,
☑️ Assessing their workloads for tiering alignment and business criticality
☑️ Consuming centralised modules as intended
☑️ Participate in inner sourcing of modules
☑️ Ensure they do everything as code, and avoid clickops
☑️ Providing regular and honest consumer feedback
☑️ Ensuring that their project has funding and a valid cost code,
☑️ Ensuring that their mandatory tag values are kept up to date.
2. Be Consumer Focused
Focus on the consumer and build a formidable platform brand and incredible developer experience.
➡️ An entry point for net new demand, typically through an ITSM workflow that then triggers and architectural engagement process (more on that later..)
➡️ A separate ‘consumer’ repo with PR workflows for provisioning AWS resources within the vended AWS account. Consumers can start building very quickly
➡️ Supported IaC modules that consumers quickly can use for provisioning the most popular commoditised AWS resources, again consumers can start building very quickly.
➡️ A delegated FinOps capability for financial tracking with guidance and guardrails on cost savings.
➡️ A set of observability/SRE standards and acceptable tiers for workload availability.
3. Everything as Code
Weed out the those ClickOps!
❇️ All platform landing zone, infrastructure & workload configuration needs to auditable both by humans and machines for governance and compliance.
❇️ Ability to vend a new landing zone, as code, to support any additional consumer landing zones identified during future M&A activities.
❇️ Ability to detect drift at the workload and platform landing zone layer.
❇️ Ability to detect drift between a dev and consumer landing zones.
❇️ All platform landing zone and workloads changes pass through PR and approval requests for change audibility and traceability.
❇️ All AWS resources in every workload account is to be provisioned as code wherever possible
❇️ Everything scannable, at rest and at runtime
❇️ Eradicate ClickOps unless explicitly required
4. Integrate Software Delivery Life Cycle Management at the Platform Level
At the platform level, build a test and release process for landing zone level integrations and changes. This isn’t always an obvious requirement but we need a dev platform level landing zone (defined as code) as part of a promotional workflow to test and promote landing zone level changes and feature releases into production.
These are the platform level functions that are we needed to apply an SDLC mindset and functional approach to test and release into a production landing zone.
👉 VPC & security baseline vending mechanics
👉 Primary provider repo
👉 Consumer repo vending process
👉 Core transit gateway changes
👉 East-west inspection VPC changes
👉 Centralised egress inspection VPC changes
👉 Region promotion and demotion
👉 Control Tower & Landing Zone upgrades
5. Integrate Software Delivery Life Cycle Management at the Workload Level
Introducing SDLC environment segregation at the workload level as a clear marker for environment and workload separation. This will need to the following AWS account patterns that generally look like this:
🔆 SDLC account structure (dev, stage and prod). A full software delivery lifecycle injected into critical workloads by default, no environment sharing, no workload sharing.
🔆 Standalone accounts. Primarily selected for non critical workloads and utilities or 3rd party tooling. These accounts allow for tooling and other services to be deployed outside of the SDLC account structure where needed.
🔆 Sandbox accounts. An area for learning, hypothesis testing and general sandbox play.
6. AWS Region Usage Types
Automated Full, Lite and Non-LZ AWS region management
Be smart about how to deploy centralised egress and east west inspection VPC’s. One of the suggested requirements is everything as code, in order to do that we needed a native AWS product that we can deploy, test and manage through code.
We also wanted to smart about the economics of how we deploy workloads into regions, so you can split all required AWS regions into 3 categories (both from a core networking perspective and in terms of regions you’ll allow consumers to deploy resources into):
🌍 Full Regions:
Most used regions. These regions are fully supported, therefore they have full east-west inspection and local centralised egress.
🌍 Lite Regions:
A limited amount of workloads in these regions. Workloads get access to east-west inspection as normal however centralised egress is remote (it passes through the nearest full region)
🌍 Non Regions:
These regions have no workloads and are not supported. There is no east-west traffic inspection and deploying resources here isn’t supported.
You want the ability to managed the region categorisation as code, this means that if a requirement comes up to promote a none region to a lite region, the process would be:
✔️ Update the code through a PR and approval workflow > deploy to dev LZ
✔️ Test
✔️ Promote change through approval workflow to consumer LZ.
The same process should work with Lite > Non-LZ and Lite to Full etc. All managed via code. This approach also ensures firewall costs are kept down.
7. PAM for all AWS Accounts
A robust ‘privileged access model’ for all web access into AWS accounts
There are two key elements to this requirement that would not only improve our security posture but also drive an important change in behaviour, attitude and ways of working.
✋ No persistent elevated web console administrative access to any AWS account.
⏳ Limited time bound access to the AWS web console and CLI, available through a request and approval workflow.
We want to ensure that we are opinionated and driving the correct behaviour based on the requirement of 'everything as code'. Non persistent elevated console/cli access along with time bound access privileges ensures that consumer teams are building their workloads as code with the provided repo, PR workflow and infrastructure as code modules.
8. Automate Everything
We wanted to ensure that we were taking full advantage of the 'everything as code' requirement, therefore wanting to automate everything within our vending process and beyond into future demand management via ServiceNow.
Initially, this is what we set out to automate:
1️⃣ All Landing Zone patterns
→ Dev landing zone as described above.
→ Consumer 1 landing zone (the first production landing zone)
→ A potential consumer 2 landing zone (any future new landing zone required due to M&A activities)
2️⃣ Advanced, opinionated VPC reference architecture that should cater for existing workloads, the future of microservices and modern networking plus a tiering / business criticality model.
→ This should be optional during the vend and post vend.
→ Consumer teams cannot change the configuration of the VPC.
→ Consumers must have a VPC lookup capability for the deployment of resources when needed.
3️⃣ Core Transit Gateway configuration and automated VPC attach for private/backend subnet:
→ AWS region promotion and demotion driven by workload demand.
→ All inspection VPC configurations defined as code.
→ All AWS account patterns as detailed above.
→ All AWS security baselines for:
4️⃣ InfoSec Baselines
→ Guardduty configuration
→ SecurityHub configuration
→ Centralised logging and threat detection.
→ Control Tower controls
→ Everything auditable
→ Everything scannable
→ All available AWS roles only available through time bound privileged access management.
5️⃣ Other Useful Stuff
→ Mandatory tagging keys and injected values enforced through automated account vending
→ AWS resource naming standards enforced through automated account vending
→ AWS resource backup policies available for consumption through tagging.
→ AWS account enrolment into AWS Private Marketplace.
→ Automated AWS live diagramming.
Principles/Standards
In order to achieve some of these requirements, we need to set out and define a number of useful principles/standards (specific to AWS cloud) that should agreed by all senior architecture stakeholders across the business thorough some form of architecture review board.
Here’s 9 worth mentioning:
1. Apply Tiering
The rationale underpinning the adoption of this standard is focused on facilitating a understanding of both the business and supporting services by identifying critical components early in the development life-cycle, the organisation can then make informed decisions regarding strategic investments in availability, instrumentation, recoverability, SRE support, and incident/service management.
2. Integrate Business Continuity into the Pipeline
This standard strategically emphasizes the utilization of the Continuous Integration/Continuous Deployment (CI/CD) pipelines as a fundamental mechanism for the recovery and rebuilding into alternate AWS accounts or regions, if necessitated.
Examples include:
Ensuring that all cloud resources are provisioned seamlessly within the CI/CD pipeline.
Recovering a platform to a new AWS account under circumstances where the original account is inaccessible, suspended, or compromised.
Deploying to alternative AWS regions in response to regional service failures, such as those affecting S3, Lambda or other regional AWS services.
3. Pipeline driven change management:
This standard strategically emphasises the utilization of the Continuous Integration/Continuous Deployment (CI/CD) pipelines as a fundamental mechanism for change management.
Examples include:
Ensuring that all cloud resources are provisioned seamlessly within the CI/CD pipeline.
Ensuring that any feature releases and improvements specific to IaC follow an SDLC approach through segmented environments before promotion to production.
Ensuring that any IaC specific hot/break fixes are integrated into the CI/CD pipeline.
The rationale behind the adoption of this method is focused on elevating the resilience, reliability, recoverability and portability of platforms and services. By embracing this approach, there’s a heightened commitment to the increased adoption of infrastructure as code, fostering a robust framework for platform recovery, and instilling greater confidence in the resilience of the overall platforms.
4. Everything as code:
This standard strategically emphasises the utilisation of infrastructure as code for all provisioning and to permit ‘ClickOps’ (manual console driven tasks) by exception.
The rational supporting this standard include but not limited to:
Unifies ways of working across practices for DevSecOps, Observability, Security, Networks, Cloud Platform, Architecture.
Everything is portable and reproducible, especially when required to rebuild a landing zone, workload or reproduce our landing zone in an alternative Cloud platform.
Everything is traceable and auditable
Everything is version controlled
The rationale behind the adoption of this standard is centralised around elevating the resilience, auditability, reliability, recoverability and portability of platforms and services. By embracing this approach, there is a heightened commitment to the increased adoption of infrastructure as code, fostering a robust framework for platform recovery, and instilling greater confidence in the resilience, recovery and compliance of the platforms.
Adopting this standard will also set the stage for a future multi-cloud adoption.
5. Provisioning over configuration:
This standard strategically emphasizes the utilisation of native cloud-based distributed services, including but not limited to S3, Lambda, DynamoDB, ECS, EKS, Aurora, DSQL and OpenSearch, for a diverse array of business solutions that demand:
Automated, cost-effective scaling up and down of services based on demand.
Automated fail-over between zones and regions to meet tiering requirements.
A contemporary and modular approach to systems architecture, characterized by the decoupling of components into more manageable, independent entities.
Provisioning over configuration to support immutable resources, specifically containers and serverless architectures over virtual machine OS management.
The rationale for adopting this strategic approach is comprehensive and includes imperatives such as workload modernization, the assurance of predictable scalability, heightened reliability, robust security measures, seamless integration capabilities, furthermore, this approach serves to mitigate the risk of adverse impact during infrastructure outages, affirming its role in fortifying system resilience.
If you’ve ever come across the term ‘cattle not pets’ this is basically the same thing

6. Tag appropriately
This standard strategically emphasises the utilisation of mandatory tags. The rational supporting this standard include but not limited to:
A clear ownership of AWS resources
The continued enablement and support of a FinOps capability.
Enhanced data provided to service & incident management
The ability to target more operational automation
Providing support for the CMDB/CSDN and a Technology Data Model
By adopting this standard, there is a heightened commitment in the DevOps, Developer and Cloud Platform groups to the increased adoption of tagging at resource provisioning time with a unified taxonomy, fostering a robust framework for data management, and instilling greater confidence in reporting, incident management, data modelling and FinOps.
7. Adopt Instrumentation & Observability Tooling to Facilitate Successful Operations
This standard strategically champions the adoption of instrumentation and observability tooling to facilitate load testing, observability, monitoring, accurate costs, and analysis.
Key examples include:
Simulation of end-user/machine load testing.
Simulation of programmatical load testing.
Simulation of data movement
Simulation of alerting and notifications against baseline criteria.
The rationale for embracing this approach is centered on cultivating a predictable cost and sizing model through simulated load testing, with the added benefit of continuous measurement throughout the entire product life cycle. Moreover, by incorporating instrumentation and observability from the inception of the platform, this standard aims to fortify the foundation for comprehensive analysis and continuous improvement throughout its lifecycle.
8. Apply segregation
This standard strategically advocates for the systematic organization of cloud platform-specific services and functionality into distinct, modularized components. Key examples include:
AWS Workload Accounts: Ensuring the segregation of environments for production and non-production purposes and across workloads.
Infrastructure as Code Terraform Modules: Employing modularized, iterated upon, reviewed, tested, and code-scanned modules for the management of infrastructure.
VPC/Workload Network Segmentation: Implementing explicit network segmentation within Virtual Private Clouds (VPCs) to curtail default any-to-any access.
Infrastructure as Code State Management: Segregating infrastructure states for production and non-production, as well as service-specific segregation.
The rationale underpinning the adoption of this method is multifaceted. Each area of segregation functions akin to a bulkhead, effectively containing the impact of misconfigurations or compromises. This systematic approach not only bolsters security but also enhances the overall reliability and availability of the cloud platforms.
9. Principle of least privilege.
This standard strategically emphasizes the adherence to the Principle of Least Privilege, a practice that restricts access rights for users, accounts, and machine access to the minimum necessary for the successful execution of their tasks. Examples include:
Implementation of approval workflows governing time-bound privileged access management, particularly concerning any elevated human access related to production workloads.
Provisioning of resources exclusively through a Continuous Integration/Continuous Deployment (CI/CD) pipeline, incorporating approval workflows and restricting access rights to tasks that are essential.
The rationale for the adoption of this principle is comprehensively centered on operational efficiency, auditability, and compliance with regulatory standards. Additionally, this strategic approach aligns with industry best practices, ensuring the integrity of our systems, and fostering a secure digital environment for all stakeholders involved.
The Outcome
An AWS Platform in Action:
After all those requirements and standards, this is what we end up with. An AWS Cloud platform, at scale and in action!
Let me explain what’s going on here:
We’re going to break this down into 5 columns and go through each one, step by step.
The Engagement Column
This column is used to illustrate the entry point to the platform for a workload request.
The intention here is that your cloud platform team and architects are are involved early enough in the process to ensure that workload/consumer teams submit their request with all the required information and have considered the following attributes as part of their planning, way upfront, no asking for AWS accounts at the last minute.
Engagement also ensures that a relationship is established between the provider and consumer as early as possible, this increases the chances that the consumer will align their workload to platform standards much earlier in the development cycle.
During the engagement process, the platform team and architects have a chance to set the stage by setting out the standards and guardrails required to build successfully on the platform, ensuring that the consumers remain good tenants by not violating any policies, ensuring that tiering, DR and business criticality is considered, ensuring they have the skills available to do everything as code and that the required AWS account patterns have been considered correctly.
The following data points should also be requested as part of a workload request:
👉 Requester Information
👉 Technical Owner Information
👉 Cost Codes
👉 Predicted Annual Spend
👉 Security Contact Information
👉 Workload Request Type (sandbox, SDLC)
👉 Unique key / identifier of the workload
👉 Details of Vendors, Partners, Solutions Architects and Enterprise Architects aligned on project
👉 Tier of Workload
👉 Region for Deployment
👉 VPC required or not
👉 Resource that are likely to be deployed
👉 Architectural schematics and artefacts
👉 Business case documentation
These data points ensure that the right values are captured for inserting mandatory tags during the AWS account and VPC vending process as well ensuring that the right AWS account patterns, regions and modules are available for consumption.
The ITSM Column
This is the IT-Service Management layer. Initially the request is completed through something like ServiceNow with a native approval process specific to cost code holders and line managers before being passing the vending off to GitHub for platform provider vending of the required accounts and VPC configurations.
The data points captured above are typically inserted into a ServiceNow form to drive the automated vending process once the request has been approved.
The AWS Column
Once approved through the ITSM process, AWS Accounts are vended into the required AWS Organisational Unit (OU), the AWS account patterns on offer are:
Sandbox AWS Accounts
A personal sandbox area (not meant to be shared with a team, but can be if needed) is for investigation, validating ideas, early product testing etc. A sandbox account is non-persistent, resources within the account will be purged and the it should never be promoted to production.
Stand Alone AWS Accounts
A stand alone account for shared services that are local to your workload, examples are:
A localised Standalone Account:
Logging account for developer access
Localised integrated services for instrumentation and monitoring
Standby DR account
ECR account for building & maintaining container images
Examples of core platform functionality stand alone accounts:
Private marketplace
Cloud Intelligence Dashboards
Core Networking
Logging
Auditing
Shared Services
ECR and container storage
SDLC (Dev, Staging & Production Accounts)
Why have 3 accounts and why are they segmented as separate environments?
Dev:
This is a development area for the workload team to test the implementation of new workload features. Your workload should be provisioned here using the same tools as production as be as close to production as possible.
Staging:
This is where your workload is tested with consumers, capacity tested, threat modelled, tested against original budget, it should mirror production.
Production:
This is where your production workload is hosted live.
Key Points:
Terragrunt is integrated out of the box
A consumer mono repo with a promotional PR workflow allowing for promotion between dev, stage and prod is vended for all Infrastructure as Code (spanning all 3 accounts).
VPCs are deployed.
The Automated Control Plane Column
Once the selected AWS accounts, VPCs and consumer repos are vended an attachment is made to the core network and the majority of the components highlighted in this column are in place by default this then triggers the majority of compliance and platform governance at scale:
AWS Organizations applies the required Service Control Policies (SCPs)
AWS Control Tower applies the required controls
The AWS VPC (if selected at vend) connects itself to the AWS Transit Gateway for centralised egress and east west traffic inspection (for shared services only)
Accounts are automatically enrolled into AWS private marketplace
Events are triggered to configure Single Sign-On (SSO) for time bound, role based Console/CLI access based on Privileged Access Management (PAM) model.
Centralized logging and auditing in configured
AWS Budgets set with anomaly detection configured
Mandatory tagging taxonomy applied
The Common Developer Experience Column
This essentially the ‘consumer experience’.
Once the AWS environment has been fully vended, a welcome pack goes out to the consumer and a onboarding sessions are scheduled, from there the consumer experience should include:
Provider and platform documentation.
How to videos.
Recommended training materials to build muscle memory.
Recommended certification.
Regular platform comms.
A standardised CI/CD pipeline & repo structure from product to product.
300+ Infrastructure as Code modules to get started building with AWS services immediately.
Reference architectures.
Design patterns.
Provider support via a Cloud Platform team.
Architecture support, governance & acceptance criteria.
The biggest takeaway by far is a consistent developer and DevOps experience across all workloads that pass through the platform and into production.
If you nail this, moving from product to product, for all engineers, is consistent, suppliers know what is required of them during initial scoping stages and the cost of supplier work, DevOps and general support will hit a level of predictability that you may never of experienced before.
Conclusion
There we have it, an AWS cloud platform at scale. If you enjoyed this post please share it on your socials and tag me in @leewynne
Next up we have building a workload on our cloud platform.
Reply