AWS DevOps Interview Preparation Guide
Comprehensive Reference for Interview Preparation and Daily Use
Table of Contents
1. Section 1: DevOps Tools & Practices
Infrastructure as Code (Terraform)
IAM Policies & Permissions
CI/CD Design & Build Management
Scripting & Automation
Containerization & Orchestration
Version Control & Repositories
2. Section 2: AWS Services & Concepts
Core AWS Services
DevOps on AWS
Security & Compliance
3. Quick Reference Checklists
SECTION 1: DevOps Tools & Practices
🧱 Infrastructure as Code (IaC) - Terraform
Core Concepts
Modules
Reusable infrastructure components that encapsulate resources
Root modules contain main configuration; child modules are called by root
Use input variables and outputs for flexibility and data flow
Organize by logical grouping (networking, compute, database)
State Files
terraform.tfstate tracks current infrastructure state
JSON format containing resource metadata and dependencies
Critical for determining what changes need to be applied
Never manually edit state files
Use terraform state commands for state manipulation
Remote Backends
Store state files remotely for team collaboration
Enables state locking to prevent concurrent modifications
Common backends: S3 + DynamoDB (AWS), Terraform Cloud, Azure Storage
Configuration example:
hcl
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Resource Provisioning and Lifecycle
Resource Creation
Define resources in .tf files using HCL syntax
Run terraform init to initialize providers
Use terraform plan to preview changes
Execute terraform apply to create resources
Resource Updates
Modify resource configuration in .tf files
Terraform detects changes and determines update method
In-place updates preserve resource identity
Replacement creates new resource and destroys old one
Resource Destruction
terraform destroy removes all managed resources
Remove resource block from config and apply to destroy specific resources
Use prevent_destroy lifecycle argument to protect critical resources
Lifecycle Meta-Arguments
create_before_destroy : Create replacement before destroying original
prevent_destroy : Block resource destruction
ignore_changes : Ignore specific attribute changes
replace_triggered_by : Force replacement when specific resources change
AWS Provider Configuration and Best Practices
Provider Configuration
hcl
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
ManagedBy = "Terraform"
}
}
}
Best Practices
Use workspaces for environment separation (dev, staging, prod)
Implement state locking with DynamoDB to prevent conflicts
Version control all .tf files; exclude .tfstate from git
Use variables for configurable values; avoid hardcoding
Structure projects with modules for reusability
Run terraform fmt to maintain consistent formatting
Use terraform validate to check configuration syntax
Always review terraform plan output before applying
Tag all resources for cost tracking and management
Use data sources to reference existing infrastructure
Implement CI/CD for automated terraform deployments
🔐 IAM Policies & Permissions
Role-Based Access Control (RBAC)
IAM Users
Individual identities with long-term credentials (access keys)
Use for permanent human users or legacy applications
Enable MFA for enhanced security
Avoid using root user for daily operations
IAM Roles
Temporary security credentials assumed by trusted entities
No long-term credentials (access keys)
Can be assumed by users, applications, or AWS services
Support cross-account access and federated identities
Automatically rotate credentials
IAM Groups
Collection of users with shared permissions
Attach policies to groups rather than individual users
Simplifies permission management at scale
Users inherit all permissions from groups they belong to
Service Roles
Allow AWS services to perform actions on your behalf
EC2 instances assume roles via instance profiles
Lambda functions use execution roles
ECS tasks use task roles for granular permissions
Policy Creation and Evaluation
Policy Structure
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {
"IpAddress": {
"aws:SourceIp": "203.0.113.0/24"
}
}
}
]
}
Policy Types
Identity-based policies: Attached to users, groups, or roles
Resource-based policies: Attached to resources (S3 buckets, SQS queues)
Permission boundaries: Set maximum permissions for entities
Service Control Policies (SCPs): Organization-wide restrictions
Session policies: Temporary policies for assumed role sessions
Policy Evaluation Logic
1. Explicit Deny: Always takes precedence (immediate deny)
2. Explicit Allow: Required from at least one policy
3. Implicit Deny: Default if no explicit allow exists
Evaluation Flow
Check for explicit deny in all policies (SCPs, permission boundaries, identity/resource policies)
If explicit deny found, request is denied
Check for explicit allow in applicable policies
If no explicit allow, request is denied (implicit deny)
Evaluation Context
AWS Organizations SCPs (if applicable)
Permission boundaries (if set)
Identity-based policies
Resource-based policies
Session policies (for assumed roles)
Permission Boundaries and Trust Relationships
Permission Boundaries
Advanced feature for delegating permissions management
Sets maximum permissions an IAM entity can have
Used to prevent privilege escalation
Common in multi-account or team-based scenarios
Does not grant permissions, only limits them
Trust Relationships (Trust Policies)
Defines who or what can assume a role
Attached to IAM roles, not users or groups
Specifies trusted principals (AWS accounts, services, federated users)
Example trust policy:
json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}]
}
Cross-Account Access
Trust policy in target account allows source account
Source account user needs permission to assume role
External ID for enhanced security with third parties
Use roles instead of sharing credentials
Grant least privilege - only required permissions
Use managed policies for common use cases
Create custom policies for specific requirements
Regularly review and audit permissions
Enable MFA for sensitive operations
Use conditions to add additional security controls
Test policies with IAM policy simulator
🔁 CI/CD Design & Build Management
Pipeline Tools Overview
Jenkins
Self-hosted, open-source automation server
Extensive plugin ecosystem (2000+ plugins)
Groovy-based pipeline definitions (Jenkinsfile)
Declarative and Scripted pipeline syntax
Distributed builds with master-agent architecture
Blue Ocean UI for modern visualization
GitHub Actions
Cloud-native CI/CD integrated with GitHub
YAML-based workflow definitions
Event-driven (push, PR, schedule, manual)
Matrix builds for parallel testing
Marketplace with pre-built actions
Free tier for public repositories
Concourse CI
Container-based pipeline execution
Pipeline as code with YAML configuration
Reproducible builds (every task in fresh container)
Resource-oriented architecture
Strong isolation between pipeline steps
Visual pipeline representation
Pipeline Creation, Triggers, and Stages
Common Pipeline Stages
1. Source Stage
Checkout code from version control (Git, CodeCommit)
Trigger on commit, PR, or schedule
Clone repository with specific branch/tag
2. Build Stage
Compile source code
Run unit tests
Static code analysis (SonarQube, linting)
Build artifacts (JAR, WAR, Docker images)
3. Test Stage
Integration tests
Security scanning (SAST, DAST, dependency checks)
Performance testing
Quality gates for code coverage and complexity
4. Deploy Stage
Deploy to target environment (dev → staging → prod)
Blue-green or canary deployments
Database migrations
Configuration management
5. Verify Stage
Smoke tests post-deployment
Health checks and monitoring
Rollback on failure
Pipeline Triggers
SCM polling (check for changes periodically)
Webhooks (immediate notification on push)
Scheduled (cron-based)
Manual approval gates
Upstream/downstream pipeline dependencies
Integration with Testing Tools
Unit testing frameworks (JUnit, PyTest, Jest)
Integration testing (Selenium, Postman/Newman)
Security tools (OWASP ZAP, Snyk, Trivy)
Code quality (SonarQube, CodeClimate)
Apache Maven
Build Lifecycle Phases
1. validate: Validate project structure and configuration
2. compile: Compile source code
3. test: Run unit tests
4. package: Package compiled code (JAR, WAR)
5. verify: Run integration tests
6. install: Install package to local repository
7. deploy: Deploy package to remote repository
Running Maven Commands
bash
# Clean and build
mvn clean install
# Skip tests during build
mvn clean install -DskipTests
# Run specific phase
mvn package
# Run specific goal
mvn dependency:tree
Dependency Management
Central Repository (Maven Central) for public artifacts
POM (pom.xml) defines project dependencies
Transitive dependencies automatically resolved
Dependency scopes: compile, provided, runtime, test, system
Dependency version management with properties
Exclude transitive dependencies when conflicts arise
Plugin Usage
Plugins extend Maven functionality
Common plugins:
maven-compiler-plugin : Configure Java version
maven-surefire-plugin : Run unit tests
maven-failsafe-plugin : Run integration tests
maven-assembly-plugin : Create distribution packages
maven-shade-plugin : Create uber JAR
Configure plugins in <build><plugins> section
Bind plugin goals to lifecycle phases
POM Structure
xml
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>my-app</artifactId>
<version>1.0.0</version>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<version>2.7.0</version>
</dependency>
</dependencies>
</project>
Best Practices
Use parent POM for multi-module projects
Define versions in properties for consistency
Use dependency management section for version control
Leverage profiles for environment-specific builds
Keep dependencies up to date
Use Maven wrapper (mvnw) for consistent builds
🧪 Scripting & Automation
Writing Reusable Scripts
Script Design Principles
Single Responsibility: Each script should do one thing well
Parameterization: Accept inputs via arguments or environment variables
Idempotency: Safe to run multiple times without adverse effects
Error Handling: Gracefully handle failures with meaningful messages
Logging: Provide visibility into script execution
Documentation: Include usage instructions and examples
Modular Code Structure
Break complex scripts into functions
Separate configuration from logic
Use libraries and modules to avoid duplication
Create reusable utility functions
AWS SDK Usage (boto3 for Python)
Basic boto3 Usage
python
import boto3
from botocore.exceptions import ClientError
# Create service client
s3 = boto3.client('s3', region_name='us-east-1')
# Using resource interface (higher-level)
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket('my-bucket')
Common Operations
S3 Operations
python
# Upload file
s3.upload_file('local.txt', 'bucket-name', 'key.txt')
# Download file
s3.download_file('bucket-name', 'key.txt', 'local.txt')
# List objects
response = s3.list_objects_v2(Bucket='bucket-name', Prefix='folder/')
EC2 Operations
python
ec2 = boto3.client('ec2')
# Describe instances
response = ec2.describe_instances(
Filters=[{'Name': 'tag:Environment', 'Values': ['production']}]
)
# Start instances
ec2.start_instances(InstanceIds=['i-1234567890abcdef0'])
Session Management
python
# Using specific credentials
session = boto3.Session(
aws_access_key_id='ACCESS_KEY',
aws_secret_access_key='SECRET_KEY',
region_name='us-east-1'
)
s3 = session.client('s3')
# Assume role for cross-account access
sts = boto3.client('sts')
response = sts.assume_role(
RoleArn='arn:aws:iam::123456789012:role/MyRole',
RoleSessionName='session-name'
)
Python, Bash, PowerShell Scripts
Python Best Practices
python
import logging
import sys
from typing import List, Dict
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def process_items(items: List[str]) -> Dict:
"""Process list of items with error handling."""
results = {'success': [], 'failed': []}
for item in items:
try:
# Process item
logger.info(f"Processing {item}")
results['success'].append(item)
except Exception as e:
logger.error(f"Failed to process {item}: {str(e)}")
results['failed'].append(item)
return results
if __name__ == '__main__':
items = sys.argv[1:]
results = process_items(items)
logger.info(f"Processed: {len(results['success'])} successful")
Bash Scripting Best Practices
bash
#!/bin/bash
set -euo pipefail # Exit on error, undefined variables, pipe failures
# Configuration
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly LOG_FILE="${SCRIPT_DIR}/script.log"
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*" | tee -a "${LOG_FILE}"
}
# Error handling
error_exit() {
log "ERROR: $1"
exit 1
}
# Main function
main() {
log "Starting script execution"
# Check prerequisites
command -v aws >/dev/null 2>&1 || error_exit "AWS CLI not found"
# Script logic
aws s3 ls || error_exit "Failed to list S3 buckets"
log "Script completed successfully"
}
# Cleanup on exit
cleanup() {
log "Cleaning up temporary files"
}
trap cleanup EXIT
main "$@"
PowerShell for Windows Automation
powershell
[CmdletBinding()]
param(
[Parameter(Mandatory=$true)]
[string]$Environment
)
# Error handling
$ErrorActionPreference = "Stop"
# Logging
function Write-Log {
param([string]$Message)
$timestamp = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
Write-Host "[$timestamp] $Message"
}
try {
Write-Log "Starting deployment to $Environment"
# AWS operations
$instances = aws ec2 describe-instances --filters "Name=tag:Environment,Values=$Environment" | ConvertFrom-Json
Write-Log "Found $($instances.Reservations.Count) instances"
} catch {
Write-Log "ERROR: $_"
exit 1
}
Error Handling and Logging
Error Handling Strategies
Use try-catch blocks for exception handling
Validate inputs before processing
Implement retry logic with exponential backoff
Provide specific error messages for debugging
Log errors with context (timestamp, operation, inputs)
Logging Best Practices
Use structured logging (JSON format) for machine parsing
Include log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
Log to both console and file
Rotate log files to manage disk space
Include correlation IDs for distributed systems
Sanitize sensitive data before logging
Retry Logic Example
python
import time
from botocore.exceptions import ClientError
def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except ClientError as e:
if e.response['Error']['Code'] == 'ThrottlingException':
wait_time = 2 ** attempt
logger.warning(f"Throttled, retrying in {wait_time}s")
time.sleep(wait_time)
else:
raise
raise Exception("Max retries exceeded")
📦 Containerization & Orchestration
Docker
Image Creation and Dockerfiles
Dockerfile Structure
dockerfile
# Use specific version tags, not 'latest'
FROM node:16-alpine
# Set working directory
WORKDIR /app
# Copy dependency files first (layer caching)
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production && \
npm cache clean --force
# Copy application code
COPY . .
# Non-root user for security
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
USER nodejs
# Expose port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD node healthcheck.js || exit 1
# Start command
CMD ["node", "server.js"]
Multi-Stage Builds
dockerfile
# Build stage
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Production stage
FROM node:16-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY package*.json ./
RUN npm ci --only=production
CMD ["node", "dist/server.js"]
Docker Best Practices
Use official base images from trusted sources
Pin specific image versions (avoid latest tag)
Minimize layer count and image size
Leverage build cache by ordering Dockerfile instructions
Use .dockerignore to exclude unnecessary files
Run containers as non-root users
Scan images for vulnerabilities (Trivy, Snyk)
Use multi-stage builds to reduce final image size
Volumes
Persist data outside container lifecycle
Named volumes: Managed by Docker
Bind mounts: Mount host directories
tmpfs mounts: In-memory storage
bash
docker run -v myvolume:/app/data myimage
docker run -v /host/path:/container/path myimage
Networking
Bridge: Default network, containers can communicate
Host: Container uses host network stack
Overlay: Multi-host networking for Swarm/Kubernetes
None: Disable networking
bash
docker network create mynetwork
docker run --network mynetwork myimage
Kubernetes
Pods
Smallest deployable unit in Kubernetes
Contains one or more containers
Shares network namespace and storage volumes
Ephemeral by design
yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
containers:
- name: nginx
image: nginx:1.21
ports:
- containerPort: 80
Deployments
Manage ReplicaSets and Pods
Declarative updates and rollbacks
Rolling updates with zero downtime
Self-healing (restart failed pods)
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.21
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
Services
Expose pods to network traffic
Load balancing across pod replicas
Service discovery via DNS
Service Types
ClusterIP: Internal access only (default)
NodePort: Expose on each node's IP at static port
LoadBalancer: External load balancer (cloud provider)
ExternalName: Map to external DNS name
yaml
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
type: LoadBalancer
selector:
app: nginx
ports:
- port: 80
targetPort: 80
ConfigMaps and Secrets
ConfigMaps: Non-sensitive configuration
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
database_url: "postgres://db:5432"
log_level: "info"
Secrets: Sensitive data (base64 encoded)
yaml
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
username: YWRtaW4=
password: cGFzc3dvcmQ=
Using ConfigMaps and Secrets
yaml
containers:
- name: app
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: db-secret
Helm Charts and Cluster Management
Helm Charts
Package manager for Kubernetes
Templated YAML manifests
Versioning and rollback support
Values file for customization
Chart Structure
mychart/
├── Chart.yaml # Chart metadata
├── values.yaml # Default configuration values
├── charts/ # Dependency charts
└── templates/ # Kubernetes manifest templates
├── deployment.yaml
├── service.yaml
└── ingress.yaml
Helm Commands
bash
# Install chart
helm install myapp ./mychart -f custom-values.yaml
# Upgrade release
helm upgrade myapp ./mychart
# Rollback to previous version
helm rollback myapp 1
# List releases
helm list
# Uninstall release
helm uninstall myapp
Cluster Management
Namespaces
Logical isolation within cluster
Resource quotas per namespace
RBAC policies scoped to namespace
bash
kubectl create namespace production
kubectl get pods -n production
Resource Quotas and Limits
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
RBAC (Role-Based Access Control)
ServiceAccounts for pods
Roles and ClusterRoles for permissions
RoleBindings and ClusterRoleBindings to assign
yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
subjects:
- kind: ServiceAccount
name: app-sa
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Health Checks
Liveness Probe: Restart container if unhealthy
Readiness Probe: Remove from service if not ready
Startup Probe: Delay other probes until app starts
yaml
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Cluster Best Practices
Use namespaces for environment separation
Set resource requests and limits on all containers
Implement network policies for security
Use Horizontal Pod Autoscaler (HPA) for scaling
Regular cluster and component updates
Monitor cluster health with Prometheus/Grafana
Backup etcd data regularly
Use Pod Disruption Budgets for availability
📁 Version Control & Repositories
Git and GitHub
Branching Strategies
Git Flow
main: Production-ready code
develop: Integration branch for features
feature/*: New features, branch from develop
release/*: Release preparation, branch from develop
hotfix/*: Emergency fixes, branch from main
Workflow:
1. Create feature branch from develop
2. Complete feature, merge back to develop
3. Create release branch when ready for production
4. Test and fix in release branch
5. Merge release to main and develop
6. Tag release in main
7. Hotfixes branch from main, merge to main and develop
Trunk-Based Development
Single main branch (trunk)
Short-lived feature branches (1-2 days max)
Frequent integration to main
Feature flags for incomplete features
Continuous Integration/Deployment
Requires strong automated testing
GitHub Flow
Simplified workflow for continuous deployment
main branch always deployable
Create feature branch from main
Open Pull Request early for discussion
Deploy from branch for testing
Merge to main and deploy immediately
Pull Requests, Hooks, and Collaboration
Pull Request Best Practices
Small, focused changes (easier to review)
Clear title and description
Reference related issues
Self-review before requesting others
Respond to feedback promptly
Keep PRs up to date with base branch
Code Review Guidelines
Review within 24 hours
Focus on logic, not style (use linters)
Ask questions, don't demand changes
Approve when concerns are addressed
Use review comments for discussion
Request changes for blocking issues
Branch Protection Rules
Require pull request reviews (1-2 reviewers)
Require status checks to pass (CI tests)
Require branches to be up to date
Restrict who can push to branch
Prevent force pushes
Require linear history (no merge commits)
Git Hooks
Client-side hooks run on local machine
Server-side hooks run on Git server
Common Hooks
pre-commit: Run linters, formatters before commit
commit-msg: Validate commit message format
pre-push: Run tests before pushing
post-merge: Update dependencies after pull
Example pre-commit hook:
bash
#!/bin/bash
# Run linting before commit
npm run lint
if [ $? -ne 0 ]; then
echo "Linting failed. Fix errors before committing."
exit 1
fi
Commit Message Conventions
Use conventional commits format
Format: <type>(<scope>): <subject>
Types: feat, fix, docs, style, refactor, test, chore
Keep subject under 50 characters
Use imperative mood ("Add feature" not "Added feature")
Examples:
feat(auth): add OAuth2 login support
fix(api): handle null response from database
docs(readme): update installation instructions
refactor(utils): simplify date formatting function
Collaboration Workflows
Fork and Pull Request
Fork repository to personal account
Clone forked repository
Create feature branch
Make changes and push to fork
Open pull request to original repository
Maintainers review and merge
Shared Repository
All contributors have write access
Create branches in same repository
Open pull requests for review
Merge after approval
Git Best Practices
Commit frequently with logical changes
Write meaningful commit messages
Keep commits focused (one logical change per commit)
Pull/fetch regularly to stay updated
Use .gitignore to exclude generated files
Never commit secrets or credentials
Use git rebase for clean history
Tag releases with semantic versioning
Create .gitattributes for consistent line endings
SECTION 2: AWS Services & Concepts
☁️ Core AWS Services
EC2 (Elastic Compute Cloud)
Instance Types
General Purpose (T3, M5): Balanced CPU/memory, web servers, small databases
Compute Optimized (C5, C6): High CPU, batch processing, gaming servers
Memory Optimized (R5, X1): Large datasets, in-memory databases, caching
Storage Optimized (I3, D2): High IOPS, data warehousing, log processing
Accelerated Computing (P3, G4): GPU instances for ML, graphics rendering
Instance Sizing
nano, micro, small, medium, large, xlarge, 2xlarge, etc.
T3 instances: Burstable performance (credits system)
Use AWS Compute Optimizer for right-sizing recommendations
Auto Scaling
Auto Scaling Groups (ASG)
Maintain desired capacity of instances
Scale based on metrics or schedule
Distribute instances across Availability Zones
Health checks and automatic replacement
Integration with ELB for load balancing
Launch Templates/Configurations
Define instance configuration (AMI, instance type, key pair)
Launch templates support versioning
Include user data for initialization scripts
Specify IAM instance profile for permissions
Scaling Policies
Target Tracking: Maintain metric at target value (e.g., 70% CPU)
Step Scaling: Scale in steps based on threshold breaches
Simple Scaling: Single scaling action per alarm
Scheduled Scaling: Scale at specific times (predictable load)
Predictive Scaling: ML-based forecasting of capacity needs
Security Groups
Virtual firewall for instances
Stateful (return traffic automatically allowed)
Rules specify protocol, port, and source/destination
Default deny all inbound, allow all outbound
Can reference other security groups as source
Best practice: Separate security groups by tier (web, app, database)
Example security group rules:
Allow HTTP (port 80) from 0.0.0.0/0
Allow HTTPS (port 443) from 0.0.0.0/0
Allow SSH (port 22) from corporate IP range
Allow PostgreSQL (port 5432) from application security group
S3 (Simple Storage Service)
Storage Classes
S3 Standard: Frequently accessed data, low latency, high durability
S3 Intelligent-Tiering: Automatic cost optimization, moves between tiers
S3 Standard-IA: Infrequent access, lower cost, retrieval fee
S3 One Zone-IA: Single AZ, 20% cheaper than Standard-IA
S3 Glacier Instant Retrieval: Archive with millisecond retrieval
S3 Glacier Flexible Retrieval: Archive, retrieval minutes to hours
S3 Glacier Deep Archive: Lowest cost, 12-hour retrieval
Bucket Policies
JSON-based resource policies attached to buckets
Control access for principals (users, accounts, services)
Can allow or deny actions (GetObject, PutObject, DeleteObject)
Use conditions for fine-grained control (IP address, MFA, encryption)
Example bucket policy:
json
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "PublicReadGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {
"IpAddress": {"aws:SourceIp": "203.0.113.0/24"}
}
}]
}
Lifecycle Rules
Automate transition between storage classes
Expire (delete) objects after specified time
Delete old versions in versioned buckets
Abort incomplete multipart uploads
Reduces storage costs automatically
Example lifecycle rule:
Transition to Standard-IA after 30 days
Transition to Glacier after 90 days
Expire after 365 days
Versioning
Keep multiple variants of objects
Protect against accidental deletion
Enable MFA Delete for additional protection
Previous versions retained until explicitly deleted
Suspended versioning stops creating new versions
Once enabled, cannot fully disable (only suspend)
S3 Best Practices
Enable versioning for critical data
Use lifecycle policies to optimize costs
Enable server access logging for audit trails
Implement bucket policies with least privilege
Use S3 Transfer Acceleration for faster uploads
Enable default encryption at bucket level
Use S3 Object Lock for compliance (WORM)
Monitor with CloudWatch metrics and S3 Storage Lens
VPC (Virtual Private Cloud)
VPC Components
Subnets
Subdivide VPC CIDR block
Public Subnet: Has route to Internet Gateway
Private Subnet: No direct internet access
Each subnet in one Availability Zone
Plan CIDR blocks carefully (cannot change after creation)
Route Tables
Control traffic routing for subnets
Each subnet associated with one route table
Routes define destination and target
Most specific route takes precedence
Local route (VPC CIDR) automatically created
Example routes:
10.0.0.0/16 → local (VPC traffic)
0.0.0.0/0 → igw-xxx (internet traffic)
192.168.0.0/16 → vgw-xxx (VPN traffic)
Internet Gateway (IGW)
Allows VPC resources to access internet
Horizontally scaled, redundant, highly available
One IGW per VPC
Requires route in route table (0.0.0.0/0 → IGW)
Requires public IP or Elastic IP on instance
NAT Gateway
Enable private subnet instances to access internet (outbound only)
Managed service, scales automatically
Deploy in public subnet
Create route in private subnet route table (0.0.0.0/0 → NAT Gateway)
Charged per hour and per GB processed
For multi-AZ, create NAT Gateway in each AZ
NAT Instance (legacy approach)
EC2 instance configured as NAT
Cheaper than NAT Gateway but requires management
Single point of failure unless configured for HA
Must disable source/destination check
VPC Peering
Private connection between two VPCs
Works across accounts and regions
Non-transitive (A-B and B-C doesn't mean A-C)
CIDR blocks must not overlap
Update route tables in both VPCs
No bandwidth bottleneck or single point of failure
VPC Endpoints
Private connection to AWS services without IGW/NAT
Gateway Endpoints: S3 and DynamoDB (no cost)
Interface Endpoints: Other services via PrivateLink (charged)
Traffic doesn't leave AWS network
Improve security by avoiding public internet
Network ACLs (NACLs)
Stateless firewall at subnet level
Rules evaluated in number order
Support allow and deny rules
Default NACL allows all traffic
Custom NACLs deny all by default
Separate inbound and outbound rules
VPC Flow Logs
Capture IP traffic information
Can be created at VPC, subnet, or ENI level
Publish to CloudWatch Logs or S3
Useful for troubleshooting connectivity issues
Analyze security group and NACL effectiveness
VPC Design Best Practices
Use multiple Availability Zones for high availability
Separate public and private subnets
Plan CIDR blocks for future growth
Use security groups as primary access control
Implement defense in depth (SG + NACL)
Use VPC endpoints to reduce data transfer costs
Enable VPC Flow Logs for security monitoring
Tag all VPC resources for cost allocation
RDS (Relational Database Service)
Supported Database Engines
PostgreSQL
MySQL
MariaDB
Oracle
SQL Server
Amazon Aurora (MySQL and PostgreSQL compatible)
Deployment Options
Single-AZ
One database instance in single Availability Zone
Lower cost, suitable for dev/test
Downtime during maintenance
Multi-AZ
Primary in one AZ, standby replica in another
Synchronous replication for high availability
Automatic failover (1-2 minutes)
No performance impact from replication
Use for production databases
Same endpoint after failover
Read Replicas
Asynchronous replication from primary
Read-only copies for scaling read traffic
Can be in same region or cross-region
Up to 5 read replicas per primary
Can be promoted to standalone database
PostgreSQL and MySQL support cascading replication
Backups
Automated Backups
Daily full backup during backup window
Transaction logs backed up every 5 minutes
Point-in-time recovery to any second
Retention period: 0-35 days (default 7)
Stored in S3, no additional charge
Deleted when DB instance deleted
Manual Snapshots
User-initiated database snapshots
Retained until explicitly deleted
Can copy across regions
Can share with other AWS accounts
Use for major changes or pre-production snapshots
Monitoring
CloudWatch Metrics
CPU Utilization
Database Connections
Free Storage Space
Read/Write IOPS
Read/Write Latency
Network Throughput
Enhanced Monitoring
Real-time OS metrics (1-60 second intervals)
Process and thread information
More granular than CloudWatch
Agent runs on DB instance
Additional cost per instance
Performance Insights
Visualize database load
Identify performance bottlenecks
Top SQL queries by load
Wait event analysis
Free for 7 days retention, charge for longer
RDS Best Practices
Enable Multi-AZ for production databases
Use read replicas to offload read traffic
Enable automated backups with appropriate retention
Create manual snapshots before major changes
Monitor CloudWatch metrics and set alarms
Use parameter groups for database configuration
Apply patches during maintenance windows
Enable encryption at rest for sensitive data
Use IAM database authentication where supported
Implement connection pooling in applications
ECS/EKS (Container Services)
ECS (Elastic Container Service)
Launch Types
EC2: Run containers on managed EC2 instances
More control over infrastructure
Can use Reserved Instances for cost savings
Requires cluster management
Fargate: Serverless container execution
No infrastructure management
Pay per task
Simpler to operate
Core Concepts
Task Definitions
Blueprint for application
Specifies Docker images, CPU, memory
Container port mappings
Environment variables and secrets
Volume definitions
IAM task role for permissions
Versioned (immutable once created)
Services
Maintain desired count of tasks
Integrate with load balancers (ALB/NLB)
Auto scaling based on metrics
Rolling updates with deployment configurations
Service discovery via AWS Cloud Map
Clusters
Logical grouping of tasks and services
Can contain EC2 instances or Fargate tasks
Use namespaces for resource isolation
ECS Best Practices
Use Fargate for simplicity, EC2 for cost optimization
Define resource limits (CPU, memory) accurately
Use task IAM roles instead of container credentials
Implement health checks in task definitions
Use service auto scaling for variable load
Enable container insights for monitoring
Use secrets management (Secrets Manager, Parameter Store)
Implement blue/green deployments for zero downtime
EKS (Elastic Kubernetes Service)
Architecture
Managed Kubernetes control plane
Deploy worker nodes as EC2 or Fargate
Integrates with AWS services (IAM, VPC, ELB)
Multi-AZ control plane for high availability
Automatic version updates and patching
Node Groups
Managed Node Groups: AWS manages EC2 instances
Automated updates and patching
Auto Scaling Group integration
One-click updates
Self-Managed Nodes: You manage EC2 instances
More control and customization
Use when specific configurations needed
Fargate Profile: Serverless pod execution
No node management
Pay per pod
Networking
Uses AWS VPC CNI plugin
Pods get IP addresses from VPC
Native AWS networking integration
Supports security groups for pods
IAM Integration
IAM roles for service accounts (IRSA)
Fine-grained permissions for pods
No need for node-level IAM credentials
Service Discovery
AWS Cloud Map for DNS-based discovery
CoreDNS for in-cluster service discovery
External DNS for external services
EKS Best Practices
Use managed node groups for easier operations
Implement cluster autoscaler for scaling nodes
Use IAM roles for service accounts (IRSA)
Deploy control plane across multiple AZs
Implement pod security policies
Use namespaces for workload isolation
Monitor with Container Insights and Prometheus
Regularly update cluster and node versions
Use AWS Load Balancer Controller for ingress
Implement network policies for pod-to-pod security
🔧 DevOps on AWS
CodePipeline
End-to-End CI/CD Orchestration
Automate release process from source to production
Visual pipeline designer in AWS Console
Integrates with AWS and third-party tools
Manual approval actions for gates
Parallel and sequential action execution
Pipeline Structure
Stages
Sequential phases (Source, Build, Test, Deploy)
Can have multiple actions per stage
Transitions between stages can be disabled
Failed stages stop pipeline execution
Actions
Individual tasks within stages
Run sequentially or in parallel
Input/output artifacts passed between actions
Action providers: CodeCommit, GitHub, Jenkins, CodeBuild, CodeDeploy
Source Stage Providers
AWS CodeCommit
GitHub/GitHub Enterprise
Amazon S3
Bitbucket Cloud
AWS ECR (Docker images)
Build Stage Providers
AWS CodeBuild
Jenkins
CloudBees
TeamCity
Deploy Stage Providers
AWS CodeDeploy
AWS Elastic Beanstalk
AWS ECS/EKS
AWS CloudFormation
AWS S3 (static websites)
Third-party deployment tools
Pipeline Execution
Triggered automatically on source changes
Manual execution via console or CLI
Scheduled execution via CloudWatch Events
Webhook triggers from external systems
Artifacts
Stored in S3 bucket
Passed between stages
Versioned automatically
Encrypted at rest
Best Practices
Use separate pipelines for environments
Implement automated testing at multiple stages
Use manual approval for production deployments
Enable artifact encryption
Monitor pipeline execution with CloudWatch
Use parameter overrides for environment-specific configs
Implement rollback mechanisms
Tag pipeline resources for cost tracking
CodeBuild
Fully Managed Build Service
Scales automatically (no build queue)
Pay per build minute
Pre-configured build environments
Custom Docker images supported
No build servers to manage
Buildspec File
YAML file defining build instructions
Must be named buildspec.yml (default)
Can specify alternate buildspec file
Defines build phases and commands
Build Phases
yaml
version: 0.2
phases:
install:
runtime-versions:
nodejs: 16
commands:
- echo Installing dependencies
pre_build:
commands:
- echo Running tests
- npm test
build:
commands:
- echo Building application
- npm run build
post_build:
commands:
- echo Build completed
artifacts:
files:
- '**/*'
base-directory: dist
cache:
paths:
- 'node_modules/**/*'
Build Environment
Compute types: small (3 GB), medium (7 GB), large (15 GB)
Operating systems: Ubuntu, Amazon Linux 2, Windows Server
Pre-installed tools: Docker, Git, AWS CLI, language runtimes
Custom Docker images for specific requirements
Environment Variables
Plaintext: Defined in build project
Parameter Store: Retrieve from Systems Manager
Secrets Manager: Retrieve sensitive values
Available to build commands
Caching
Cache dependencies to S3
Speeds up subsequent builds
Specify cache paths in buildspec
Local caching for Docker layers
Build Artifacts
Output files from build
Stored in S3
Can be encrypted
Used by subsequent pipeline stages
Integration with Testing Tools
Run unit tests in pre_build phase
Integration tests in post_build
Security scanning (SAST, dependency checking)
Code coverage reports
Publish test results to CodeBuild
Best Practices
Cache dependencies to speed up builds
Use appropriate compute size for build
Implement parallel builds for faster execution
Use VPC configuration for private resource access
Enable CloudWatch Logs for debugging
Use secrets management for credentials
Implement build badges for status visibility
Set timeout to prevent runaway builds
CodeDeploy
Automated Deployment Service
Deploy to EC2, Lambda, or ECS
Multiple deployment strategies
Automatic rollback on failure
Integration with load balancers
Deployment Strategies
In-Place Deployment
Update existing instances
Application stopped, new version installed
Brief downtime during deployment
Suitable for dev/test environments
Cannot deploy to immutable infrastructure
Blue/Green Deployment
Deploy to new set of instances (Green)
Test green environment
Shift traffic from old (Blue) to new (Green)
Keep blue environment for rollback
Zero downtime
Double resources during deployment
Deployment Configurations
EC2/On-Premises
CodeDeployDefault.AllAtOnce: Deploy to all instances simultaneously
CodeDeployDefault.HalfAtATime: 50% at a time
CodeDeployDefault.OneAtATime: One instance at a time
Custom: Define percentage or count
Lambda
Canary: Small percentage, then all at once
Linear: Traffic shifted in equal increments
All-at-once: Immediate shift
ECS
Linear: Shift traffic in equal increments
Canary: Shift percentage, then remaining
All-at-once: Immediate shift
AppSpec File
Defines deployment instructions
Lifecycle event hooks
Different format for EC2, Lambda, ECS
EC2 AppSpec Example
yaml
version: 0.0
os: linux
files:
- source: /
destination: /var/www/html
hooks:
BeforeInstall:
- location: scripts/install_dependencies.sh
timeout: 300
ApplicationStart:
- location: scripts/start_server.sh
timeout: 300
ApplicationStop:
- location: scripts/stop_server.sh
timeout: 300
Lifecycle Event Hooks
ApplicationStop: Stop application
DownloadBundle: Download revision
BeforeInstall: Pre-installation tasks
Install: Copy files (automatic)
AfterInstall: Post-installation tasks
ApplicationStart: Start application
ValidateService: Verify deployment
Rollback
Automatic rollback on:
Deployment failure
CloudWatch alarm threshold breach
Manual rollback via console or CLI
Redeploys last known good revision
Integration with Load Balancers
Deregister instances during deployment
Health checks validate successful deployment
Re-register instances after deployment
Works with ALB, NLB, Classic Load Balancer
Best Practices
Use blue/green for production deployments
Implement comprehensive health checks
Set appropriate timeout values
Use lifecycle hooks for validation
Enable automatic rollback on failure
Test AppSpec scripts thoroughly
Monitor deployments with CloudWatch
Use deployment groups for logical grouping
Implement gradual traffic shift for Lambda/ECS
Tag deployment groups for organization
Additional AWS DevOps Services
AWS CloudFormation
Infrastructure as Code using JSON/YAML templates
Declarative resource provisioning
Stack management (create, update, delete)
Change sets preview modifications
Drift detection identifies manual changes
Cross-stack references for dependencies
Nested stacks for modular templates
StackSets for multi-account/region deployment
AWS Systems Manager
Parameter Store
Store configuration data and secrets
Hierarchical key structure (/prod/db/password)
String, StringList, SecureString types
Free for standard parameters
Versioning and change history
IAM-based access control
Session Manager
Browser-based shell access to instances
No SSH keys or bastion hosts required
Session logging to S3 or CloudWatch
IAM-based access control
Works with on-premises servers
Automation
Pre-defined runbooks for common tasks
Custom automation documents
Scheduled or event-triggered execution
Patch management and maintenance windows
AWS CloudWatch
Metrics
Collect and track metrics from AWS services
Custom metrics from applications
1-minute or 5-minute granularity
Metric math for calculations
Detailed monitoring for EC2 (1-minute intervals)
Logs
Centralized log management
Log groups and log streams
Filter patterns for searching
Metric filters to create metrics from logs
Subscription filters for real-time processing
Export to S3 for archival
Alarms
Monitor metrics and trigger actions
States: OK, ALARM, INSUFFICIENT_DATA
Actions: SNS notifications, Auto Scaling, EC2 actions
Composite alarms for complex conditions
Dashboards
Visual representation of metrics
Multiple widgets (graphs, numbers, text)
Share across accounts
Real-time or historical data
AWS X-Ray
Distributed tracing for microservices
Request flow visualization
Identify performance bottlenecks
Analyze errors and exceptions
Service map showing dependencies
Trace sampling to control cost
Integration with Lambda, ECS, Elastic Beanstalk
🔐 Security & Compliance
IAM Best Practices
Account Security
Enable MFA for root account and privileged users
Never use root account for daily operations
Create IAM users with minimum necessary permissions
Use strong password policy
Rotate credentials regularly (90 days recommended)
Delete unused credentials and users
Roles and Permissions
Use IAM roles instead of long-term access keys
Grant least privilege - only required permissions
Use managed policies for common scenarios
Create custom policies for specific requirements
Review permissions regularly
Remove unused permissions
Service Control and Monitoring
Enable CloudTrail for all API activity logging
Use IAM Access Analyzer to identify unintended access
Monitor IAM credential usage reports
Set up alerts for suspicious activity
Use AWS Organizations for centralized control
Implement Service Control Policies (SCPs)
Application Security
Use IAM roles for EC2 instances (instance profiles)
Implement temporary credentials via STS
Use IAM roles for service accounts in EKS (IRSA)
Avoid embedding credentials in code
Use AWS SDK credential chain
Implement credential rotation for applications
Cross-Account Access
Use IAM roles for cross-account access
Implement external ID for third-party access
Require MFA for sensitive cross-account operations
Audit cross-account access regularly
Policy Management
Use policy conditions for fine-grained control
Implement permission boundaries for delegation
Test policies with IAM policy simulator
Version control policy documents
Document policy decisions and exceptions
Encryption at Rest and in Transit
Encryption at Rest
EBS Volumes
Encrypt via KMS when creating volume
Enable encryption by default for region
Encrypted volumes produce encrypted snapshots
Can copy unencrypted snapshot as encrypted
No performance impact from encryption
S3 Buckets
SSE-S3: S3-managed keys (AES-256)
SSE-KMS: KMS-managed keys (audit trail, key rotation)
SSE-C: Customer-provided keys (you manage keys)
Client-side encryption: Encrypt before upload
Enable default encryption at bucket level
Enforce encryption via bucket policy
RDS Databases
Enable encryption when creating DB instance
Cannot encrypt existing unencrypted DB
Transparent Data Encryption (TDE) for Oracle/SQL Server
Encrypted DB creates encrypted snapshots
Read replicas must have same encryption status
DynamoDB
Encryption at rest enabled by default
Uses AWS owned keys or KMS customer managed keys
Encrypts tables, indexes, streams, backups
EFS (Elastic File System)
Enable encryption at rest when creating file system
Uses KMS for key management
Cannot enable after creation
AWS KMS (Key Management Service)
Create and manage encryption keys
Customer managed keys (CMK) or AWS managed keys
Automatic key rotation (once per year)
Key policies for access control
CloudTrail logging of key usage
Envelope encryption for large data
Import your own keys (BYOK) supported
Encryption in Transit
TLS/SSL
Use HTTPS for all API communications
AWS services support TLS 1.2+
Use ACM (AWS Certificate Manager) for SSL certificates
Automatic certificate renewal
Integration with CloudFront, ALB, API Gateway
VPN Connections
Site-to-Site VPN for on-premises connectivity
IPsec encryption by default
Client VPN for remote user access
Uses TLS-based VPN protocol
Application-Level Encryption
Encrypt sensitive data in application code
Use AWS Encryption SDK for client-side encryption
Implement field-level encryption for specific data
Use HTTPS for web applications
Best Practices
Enable encryption by default where available
Use KMS customer managed keys for audit requirements
Rotate encryption keys regularly
Use separate keys for different data classifications
Implement encryption in all environments (dev, prod)
Document encryption standards and key management
Test disaster recovery with encrypted data
Secrets Management
AWS Secrets Manager
Store, retrieve, and rotate secrets
Automatic rotation for RDS, Redshift, DocumentDB
Custom Lambda function for other secret types
Versioning of secrets (AWSCURRENT, AWSPREVIOUS)
Fine-grained IAM policies for access control
Encryption using KMS
Cross-region secret replication
Charged per secret and API call
Use Cases
Database credentials
API keys and tokens
SSH keys
Application configuration
Automatic Rotation
Configure rotation schedule (days)
Uses Lambda function to update secret
Simultaneous use of old and new versions during rotation
RDS integration updates database and secret atomically
Retrieving Secrets
python
import boto3
client = boto3.client('secretsmanager')
response = client.get_secret_value(SecretId='prod/db/password')
secret = response['SecretString']
AWS Systems Manager Parameter Store
Store configuration data and secrets
Free tier available (standard parameters)
Advanced parameters for higher throughput
SecureString type encrypted with KMS
Hierarchical structure for organization
Versioning and change history
No automatic rotation (manual update)
Integration with CloudFormation and other services
Comparison: Secrets Manager vs Parameter Store
Secrets Manager: Automatic rotation, higher cost, designed for secrets
Parameter Store: No rotation, lower cost (free tier), general configuration
Best Practices
Never hardcode secrets in source code
Use Secrets Manager for secrets requiring rotation
Use Parameter Store for non-sensitive configuration
Implement least privilege access to secrets
Enable CloudTrail logging for secret access
Use VPC endpoints for private access
Rotate secrets regularly (even without automatic rotation)
Use different secrets for different environments
Implement secret scanning in CI/CD pipelines
Document secret naming conventions
Compliance & Auditing
AWS CloudTrail
Records all AWS API calls in account
Who, what, when, where for every action
Logs delivered to S3 bucket
Optional CloudWatch Logs integration
Enable for all regions
Validate log file integrity with digest files
Key Information Captured
Identity of API caller
Time of API call
Source IP address
Request parameters
Response elements
Best Practices
Enable in all regions (multi-region trail)
Enable log file validation
Encrypt logs with KMS
Set S3 bucket lifecycle policies for retention
Use CloudWatch Logs for real-time monitoring
Create metric filters for security events
Monitor for unusual API activity
Integrate with SIEM tools
AWS Config
Track resource configuration changes over time
Evaluate compliance against rules
Configuration history and snapshots
Relationship tracking between resources
Remediation actions for non-compliant resources
Config Rules
AWS managed rules for common checks
Custom rules using Lambda functions
Evaluate on configuration change or periodic
Examples: encrypted volumes, approved AMIs, required tags
Compliance Monitoring
Continuous compliance assessment
Dashboard showing compliance status
Aggregate data across accounts (Config aggregator)
Generate compliance reports
AWS GuardDuty
Intelligent threat detection service
Analyzes CloudTrail, VPC Flow Logs, DNS logs
Machine learning for anomaly detection
Identifies compromised instances, reconnaissance
Findings prioritized by severity
Integration with Security Hub and EventBridge
Threat Detection
Unusual API calls
Unauthorized deployments
Compromised instances (cryptocurrency mining, malware)
Account compromise
Reconnaissance activity
AWS Security Hub
Centralized security findings
Aggregates from GuardDuty, Inspector, Macie, IAM Access Analyzer
Security standards compliance (CIS, PCI-DSS)
Automated remediation with EventBridge
Cross-account aggregation
Priority-based findings
AWS Compliance Programs
SOC 1, 2, 3 reports
PCI DSS compliance
HIPAA eligible services
GDPR compliance support
ISO certifications
FedRAMP authorization
Audit Best Practices
Enable CloudTrail in all accounts
Configure AWS Config for compliance monitoring
Enable GuardDuty for threat detection
Use Security Hub for centralized view
Implement automated remediation
Regular security assessments
Document compliance controls
Train teams on security best practices
Conduct periodic security reviews
Use AWS Artifact for compliance reports
Network Security
Security Groups
Stateful firewall at instance level
Allow rules only (no deny rules)
Evaluate all rules before deciding
Can reference other security groups
Separate rules for inbound and outbound
Changes take effect immediately
Best Practices
Separate security groups by tier (web, app, db)
Use descriptive names and descriptions
Reference security groups instead of CIDR when possible
Minimize use of 0.0.0.0/0 for inbound rules
Regularly audit security group rules
Remove unused security groups
Network ACLs (NACLs)
Stateless firewall at subnet level
Allow and deny rules
Rules evaluated in number order (lowest first)
Explicit deny or allow
Separate rules for inbound and outbound
Apply to all instances in subnet
NACL vs Security Groups
NACLs: Subnet level, stateless, allow/deny
Security Groups: Instance level, stateful, allow only
Use both for defense in depth
AWS WAF (Web Application Firewall)
Protect web applications from exploits
Deploy on CloudFront, ALB, API Gateway, AppSync
Customizable rules for filtering traffic
Managed rule groups from AWS and partners
Protection Against
SQL injection
Cross-site scripting (XSS)
Geo-blocking
Rate limiting
IP reputation lists
Bot control
Rule Types
IP match conditions
String match conditions
Geo match conditions
Size constraint conditions
SQL injection match
Cross-site scripting match
AWS Shield
DDoS protection service
Shield Standard: Automatic, free, layer 3/4 protection
Shield Advanced: Enhanced protection, DDoS Response Team, cost protection
Shield Advanced Features
Advanced DDoS detection and mitigation
24/7 access to DDoS Response Team (DRT)
DDoS cost protection (scaling charges waived)
Integration with WAF at no extra cost
Protection for EC2, ELB, CloudFront, Route 53, Global Accelerator
VPN and Private Connectivity
Site-to-Site VPN for on-premises connection
AWS Direct Connect for dedicated connection
VPN over Direct Connect for encrypted dedicated line
Client VPN for remote user access
VPC peering for private VPC-to-VPC
Network Security Best Practices
Implement defense in depth (SG + NACL + WAF)
Use VPC Flow Logs for traffic analysis
Enable GuardDuty for threat detection
Implement least privilege network access
Segment network with multiple subnets
Use private subnets for resources without internet access
Enable Shield Advanced for critical applications
Regular security group audits
Monitor for unusual network patterns
Use AWS Firewall Manager for centralized rule management
Quick Reference Checklists
📋 Pre-Deployment Checklist
Code Quality
Code reviewed and approved
All tests passing (unit, integration, security)
No critical security vulnerabilities
Code coverage meets requirements
Linting and formatting checks passed
Dependencies updated and scanned
Infrastructure
Infrastructure code validated (terraform plan)
Resource limits and quotas checked
Auto-scaling policies configured
Monitoring and alerting set up
Log aggregation configured
Backup and restore tested
Security
Security groups reviewed and minimized
IAM policies follow least privilege
Secrets stored securely (no hardcoded credentials)
Encryption enabled (at rest and in transit)
SSL certificates valid and renewed
Compliance requirements met
Documentation
Deployment runbook updated
Rollback procedure documented
Architecture diagrams current
Configuration changes documented
Known issues and workarounds listed
Contact information for escalation
Communication
Stakeholders notified of deployment window
Change management ticket approved
Deployment announcement sent
On-call team informed
Rollback decision-makers identified
Validation
Smoke tests prepared
Health check endpoints verified
Performance baseline established
Disaster recovery plan tested
Monitoring dashboards ready
Success criteria defined
🔒 Security Checklist
Identity and Access
MFA enabled for all privileged accounts
Root account not used for daily operations
IAM users follow least privilege principle
Unused IAM credentials removed
Cross-account access properly configured
Service roles used instead of access keys
Data Protection
Encryption at rest enabled (EBS, S3, RDS)
Encryption in transit enforced (TLS/SSL)
KMS keys properly managed
Secrets stored in Secrets Manager or Parameter Store
No credentials in source code or logs
Data backup and retention policies configured
Network Security
Security groups follow least privilege
NACLs configured for subnet protection
VPC Flow Logs enabled
WAF rules configured for web applications
DDoS protection enabled (Shield)
Private subnets for internal resources
Monitoring and Compliance
CloudTrail enabled in all regions
CloudWatch alarms for security events
AWS Config rules for compliance
GuardDuty enabled for threat detection
Security Hub aggregating findings
Regular security audits scheduled
Application Security
Input validation implemented
SQL injection prevention in place
XSS protection configured
CSRF tokens used
Rate limiting implemented
Security headers configured
🚀 Deployment Best Practices
Version Control
Use semantic versioning (MAJOR.MINOR.PATCH)
Tag releases in Git
Maintain changelog for each release
Branch protection rules enforced
Code review required before merge
Testing Strategy
Unit tests (>80% coverage)
Integration tests for critical paths
Security scanning (SAST/DAST)
Performance testing under load
Chaos engineering for resilience
Deployment Strategies
Blue/Green: Zero downtime, easy rollback, double cost temporarily
Canary: Gradual rollout, risk mitigation, requires monitoring
Rolling: Incremental updates, maintains capacity, slower deployment
Rollback Plan
Keep previous version artifacts
Database migration rollback scripts
Quick rollback procedure documented
Monitoring for rollback triggers
Communication plan for failures
Post-Deployment
Monitor error rates and latency
Verify all features working
Check logs for anomalies
Validate database migrations
Confirm backup success
Update documentation
🎯 Common Interview Topics
Technical Deep Dives
CI/CD Pipeline Design
Explain stages: Source → Build → Test → Deploy
Discuss artifact management between stages
Describe automated testing strategy
Explain deployment strategies (blue/green, canary, rolling)
Detail rollback mechanisms
Blue-Green vs Rolling Deployment
Blue/Green: Deploy to new environment, switch traffic, instant rollback
Rolling: Update instances incrementally, maintains capacity, gradual rollout
Trade-offs: Cost, downtime, rollback speed, complexity
Terraform State Management
State file tracks infrastructure
Remote backend (S3 + DynamoDB) for collaboration
State locking prevents concurrent modifications
Sensitive data in state requires encryption
State file should never be manually edited
Docker Multi-Stage Builds
Separate build and runtime stages
Reduces final image size
Only necessary files in production image
Improves security and performance
Example: Build stage with dev dependencies, runtime stage with only production files
Kubernetes Scaling Strategies
Horizontal Pod Autoscaler (HPA): Scale pods based on CPU/memory/custom metrics
Vertical Pod Autoscaler (VPA): Adjust pod resource requests/limits
Cluster Autoscaler: Scale worker nodes based on pending pods
Manual Scaling: kubectl scale for testing or planned events
IAM Policy Evaluation
Explicit Deny wins (immediate rejection)
Explicit Allow required from at least one policy
Implicit Deny if no explicit allow
Evaluation order: SCPs → Permission Boundaries → Identity/Resource Policies
VPC Design Patterns
Multi-tier architecture (public, private, database subnets)
Multi-AZ deployment for high availability
NAT Gateway per AZ for redundancy
VPC endpoints for AWS service access
Network segmentation with security groups and NACLs
RDS High Availability
Multi-AZ for automatic failover (1-2 minutes)
Read replicas for read scaling (not for HA)
Automated backups with point-in-time recovery
Manual snapshots before major changes
Cross-region replication for disaster recovery
Scenario-Based Questions
"How would you design a highly available web application?"
Multi-AZ deployment across at least 2 AZs
Application Load Balancer distributing traffic
Auto Scaling Group maintaining instance count
RDS Multi-AZ for database high availability
ElastiCache for session management and caching
S3 for static content with CloudFront CDN
Route 53 for DNS with health checks
CloudWatch monitoring and alarms
"Explain your approach to zero-downtime deployments"
Use blue/green or rolling deployment strategy
Implement health checks at load balancer
Connection draining during instance replacement
Database migrations backward compatible
Feature flags for gradual feature rollout
Comprehensive monitoring and automated rollback
Canary deployments to test with small traffic percentage
"How do you secure secrets in CI/CD pipelines?"
Store in AWS Secrets Manager or Parameter Store
Use IAM roles for service access (no hardcoded credentials)
Encrypt secrets at rest with KMS
Rotate secrets regularly
Audit secret access with CloudTrail
Use environment-specific secrets
Never commit secrets to version control
Implement secret scanning in pipelines
"Describe troubleshooting a production incident"
Check monitoring dashboards and alarms
Review recent deployments and changes
Analyze application and system logs
Check CloudWatch metrics for anomalies
Verify infrastructure health (EC2, RDS, load balancers)
Use distributed tracing (X-Ray) for request flow
Check VPC Flow Logs for network issues
Escalate to appropriate teams if needed
Document findings and resolution
"How would you optimize AWS costs?"
Right-size instances using Compute Optimizer
Use Reserved Instances or Savings Plans
Implement auto-scaling to match demand
S3 lifecycle policies to cheaper storage classes
Delete unused resources (EBS volumes, snapshots)
Use Spot Instances for fault-tolerant workloads
CloudWatch to identify underutilized resources
Cost allocation tags for visibility
Scheduled scaling for predictable patterns
📚 Key Concepts Summary
Infrastructure as Code
Terraform for multi-cloud provisioning
CloudFormation for AWS-native IaC
State management critical for team collaboration
Modules for reusable infrastructure components
Version control all infrastructure code
Container Orchestration
Docker for containerization and portability
Kubernetes for complex orchestration needs
ECS/Fargate for AWS-native container services
Helm for Kubernetes package management
Health checks and resource limits essential
CI/CD Principles
Automate everything (build, test, deploy)
Fast feedback loops for developers
Automated testing at multiple stages
Deployment strategies for risk mitigation
Monitoring and observability built-in
AWS Security
Least privilege IAM policies
Encryption everywhere (at rest and in transit)
Secrets management with dedicated services
Network isolation with VPCs and security groups
Continuous monitoring and compliance
High Availability
Multi-AZ deployments for redundancy
Auto Scaling for capacity management
Load balancing for traffic distribution
Database replication and backups
Disaster recovery planning and testing
Monitoring and Observability
CloudWatch for metrics and logs
X-Ray for distributed tracing
Custom metrics for business KPIs
Alarms for proactive issue detection
Dashboards for visibility
💡 Pro Tips for Interviews
Technical Communication
Start with high-level overview, then dive into details
Use diagrams when explaining architecture
Mention trade-offs for design decisions
Relate answers to real-world scenarios
Ask clarifying questions before answering
Demonstrating Experience
Share specific examples from past projects
Explain challenges faced and how you solved them
Discuss lessons learned from failures
Mention tools and technologies you've used
Show understanding of best practices
Problem-Solving Approach
Clarify requirements and constraints
Consider multiple solutions
Evaluate pros and cons
Recommend solution with justification
Discuss implementation steps
AWS-Specific Tips
Know the differences between similar services
Understand pricing models and cost optimization
Be familiar with AWS Well-Architected Framework
Stay updated on new AWS services and features
Mention AWS documentation and best practices
Common Mistake Patterns to Avoid
Don't just list features, explain use cases
Don't ignore security considerations
Don't forget monitoring and observability
Don't overlook cost implications
Don't skip disaster recovery planning
🔧 Troubleshooting Guide
EC2 Instance Issues
Cannot Connect via SSH
Check security group allows port 22 from your IP
Verify instance has public IP (if accessing from internet)
Confirm key pair is correct
Check NACL rules allow SSH traffic
Verify instance is in running state
Check route table for internet gateway route
High CPU Utilization
Check CloudWatch metrics for spike patterns
Review running processes (top, htop)
Analyze application logs for errors
Consider instance type upgrade
Implement auto-scaling if variable load
Check for resource-intensive queries or operations
Instance Status Checks Failing
System status check: AWS infrastructure issue (stop/start instance)
Instance status check: OS or instance issue (reboot or investigate)
Review system logs via console
Check for kernel panics or system errors
Container Issues
Docker Container Won't Start
Check container logs (docker logs)
Verify image exists and is accessible
Check port conflicts with host
Verify sufficient disk space
Review Dockerfile for errors
Check resource limits (memory, CPU)
Kubernetes Pod CrashLoopBackOff
Describe pod for error messages (kubectl describe pod)
Check pod logs (kubectl logs)
Verify image pull secrets for private registries
Check resource requests and limits
Verify ConfigMaps and Secrets exist
Review liveness and readiness probes
ECS Task Failing to Start
Check task definition for errors
Verify IAM task role has required permissions
Check security group allows required ports
Review CloudWatch logs for errors
Verify sufficient resources in cluster
Check ECR repository permissions
Database Issues
RDS Connection Timeouts
Verify security group allows port 3306 (MySQL) or 5432 (PostgreSQL)
Check NACL rules
Confirm instance is available
Verify endpoint and port are correct
Check VPC routing tables
Ensure client is in correct VPC/subnet
RDS High CPU or IOPS
Identify slow queries with Performance Insights
Review missing indexes
Check for lock contention
Consider read replicas for read-heavy workloads
Optimize queries and database schema
Scale up instance class if needed
Database Backup Failures
Check backup window configuration
Verify sufficient storage for backups
Review IAM permissions for RDS
Check automated backup retention period
Review RDS events for error messages
Network Issues
Cannot Reach Internet from Private Subnet
Verify NAT Gateway exists in public subnet
Check route table has 0.0.0.0/0 → NAT Gateway
Confirm NAT Gateway has Elastic IP
Verify security groups allow outbound traffic
Check NACL rules allow return traffic
ALB Returns 502 Bad Gateway
Check target group health checks
Verify targets are healthy and registered
Confirm security groups allow traffic
Review target application logs
Check for connection timeouts
Verify target group protocol and port
VPC Peering Not Working
Verify peering connection is active
Check route tables in both VPCs
Confirm CIDR blocks don't overlap
Verify security groups allow traffic
Check NACLs in both VPCs
CI/CD Pipeline Issues
CodeBuild Failing
Review build logs in CodeBuild console
Check buildspec.yml syntax
Verify IAM permissions for CodeBuild role
Check environment variables and secrets
Verify sufficient build timeout
Review VPC configuration if accessing private resources
CodeDeploy Deployment Failing
Review deployment logs
Check AppSpec file syntax
Verify IAM permissions for CodeDeploy
Check instance tags match deployment group
Review lifecycle hook scripts for errors
Verify CodeDeploy agent is running on instances
Terraform Apply Failing
Review error message carefully
Check for resource conflicts or duplicates
Verify AWS credentials and permissions
Ensure state file is not locked
Review terraform plan output
Check for API rate limiting
📖 Additional Resources
Official Documentation
AWS Documentation: https://docs.aws.amazon.com
Terraform Documentation: https://www.terraform.io/docs
Kubernetes Documentation: https://kubernetes.io/docs
Docker Documentation: https://docs.docker.com
AWS Training and Certification
AWS Skill Builder: Free digital training
AWS Solutions Architect Associate
AWS DevOps Engineer Professional
AWS Certified Developer Associate
Best Practice Frameworks
AWS Well-Architected Framework
12-Factor App Methodology
CNCF Cloud Native Principles
DevOps Research and Assessment (DORA) Metrics
Community Resources
AWS re:Post (community Q&A)
Stack Overflow
GitHub repositories and examples
AWS Blog and What's New feed
Continuous Learning
Stay updated with AWS announcements
Participate in AWS events and webinars
Practice with hands-on labs
Build personal projects
Contribute to open-source projects
📝 Final Notes
Interview Preparation Strategy
1. Week 1-2: Review core concepts and services
2. Week 3-4: Hands-on practice with AWS console and CLI
3. Week 5-6: Build sample projects demonstrating skills
4. Week 7: Mock interviews and scenario practice
5. Week 8: Review weak areas and final preparation
During the Interview
Listen carefully to questions
Think before answering
Admit if you don't know something
Show problem-solving approach
Ask questions about the role and team
Be enthusiastic and genuine
After the Interview
Send thank-you email within 24 hours
Reflect on questions you struggled with
Continue learning and improving
Follow up appropriately on timeline
Key Success Factors
Strong fundamentals: Understanding core concepts deeply
Hands-on experience: Practical application of knowledge
Problem-solving skills: Ability to troubleshoot and debug
Communication: Clearly explaining technical concepts
Continuous learning: Staying current with technologies
🎓 Conclusion
This guide covers the essential topics for AWS DevOps interviews and serves as a daily reference for common
tasks and best practices. Remember that interviews assess not just technical knowledge but also:
Problem-solving approach
Communication skills
Collaborative mindset
Learning agility
Cultural fit
Key Takeaways:
1. Master the fundamentals before diving into advanced topics
2. Practice hands-on with real AWS services
3. Understand the "why" behind best practices
4. Be prepared to discuss trade-offs in design decisions
5. Share real-world experiences and lessons learned
Good luck with your interviews and DevOps journey!
Document Version: 1.0
Last Updated: November 2025
For personal use and interview preparation