Skip to main content

AWS Interview Questions

Comprehensive collection of AWS interview questions organized by difficulty level, from Cloud Practitioner fundamentals through Solutions Architect advanced scenarios. Each answer includes key concepts and CLI examples where relevant.


Beginner Level (1-20)

1. What is AWS and why should companies use it?

AWS (Amazon Web Services) is a comprehensive cloud computing platform offering on-demand infrastructure, platforms, and software as services. Companies benefit from reduced capital expenditure (no need to buy hardware), scalability (pay only for what you use), global infrastructure (deploy in multiple regions), and access to cutting-edge technologies. AWS handles hardware management, security patching, and infrastructure scaling, letting teams focus on business logic rather than infrastructure operations.

2. What are the three main types of cloud computing models?

The three cloud models are: IaaS (Infrastructure as a Service) where you manage applications, data, runtime, and middleware while AWS handles virtualization, servers, storage, and networking (e.g., EC2); PaaS (Platform as a Service) where you manage applications and data while AWS handles everything else (e.g., Elastic Beanstalk); SaaS (Software as a Service) where AWS manages everything and you just use the application (e.g., Workmail, Chime).

3. What is AWS Global Infrastructure and what are its components?

AWS Global Infrastructure consists of Regions (geographic areas with multiple data centers), Availability Zones (isolated data centers within regions), Edge Locations (cache points for CloudFront CDN), and Local Zones (extend compute and database services closer to users). This distributed architecture enables low-latency access globally and provides high availability through redundancy across multiple zones.

4. What is an Availability Zone and why are they important?

An Availability Zone (AZ) is a separate data center within an AWS Region with independent power, cooling, and networking. Using multiple AZs provides high availability and disaster recovery because if one AZ fails, your application can continue running in another. AWS charges no data transfer fees between AZs within a region, making redundancy practical. Most production deployments span at least 2 AZs.

5. What is IAM and what are its core components?

Identity and Access Management (IAM) is AWS's authentication and authorization service. Core components include Users (individual identities), Groups (collections of users), Roles (sets of permissions for services or cross-account access), Policies (JSON documents defining permissions), and Access Keys (credentials for programmatic access). IAM enables fine-grained access control following the principle of least privilege.

# Create user and attach policy
aws iam create-user --user-name developer
aws iam attach-user-policy --user-name developer --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess

6. What is the difference between IAM Users and IAM Roles?

IAM Users have permanent credentials (access keys, passwords) and are intended for people or applications that need long-term access. IAM Roles have temporary credentials automatically rotated by AWS and are intended for services (EC2, Lambda) or cross-account access. Roles are more secure for service-to-service communication because credentials expire and are refreshed automatically without manual rotation.

7. What are EC2 instance types and can you name several?

EC2 instance types are categorized by purpose: General Purpose (t3, m5) for balanced compute/memory/network; Compute Optimized (c5) for batch processing and HPC; Memory Optimized (r5) for in-memory databases; Storage Optimized (i3) for NoSQL and data warehousing; Accelerated Computing (g4, p3) for GPU/FPGA workloads. Instance sizes range from nano to 24xlarge, with t-series providing burstable performance at lower cost for variable workloads.

8. What are S3 storage classes and when would you use each?

S3 storage classes optimize cost based on access patterns: Standard for frequent access; Standard-IA (Infrequent Access) for data accessed less than once monthly; Glacier for long-term archival (retrieval in hours); Glacier Deep Archive for rarely accessed backups (retrieval in 12 hours); Intelligent-Tiering for unknown access patterns (moves objects between tiers automatically). Use lifecycle policies to automatically transition objects between classes based on age.

9. What is a VPC and why is it important?

A Virtual Private Cloud (VPC) is an isolated network environment in AWS where you define IP address ranges, create subnets, configure route tables, and apply security controls. Every AWS resource must run within a VPC. VPCs provide network isolation (multi-tenancy security), allow you to segment applications into public and private subnets, enable custom routing, and support connectivity to on-premises networks. Each AWS account has a default VPC, but you can create custom VPCs for production workloads.

# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16

10. What is the difference between Security Groups and Network ACLs?

Security Groups are stateful firewalls at the instance level (like application firewalls) that allow or deny traffic based on rules; if you allow inbound traffic, return traffic is automatically allowed. Network ACLs (NACLs) are stateless firewalls at the subnet level; you must explicitly allow both inbound and outbound traffic. Security Groups are less restrictive and easier to manage, while NACLs provide additional subnet-level protection. Most deployments rely primarily on Security Groups with NACLs as secondary defense.

11. What is a public subnet vs. a private subnet?

A public subnet has a route to an Internet Gateway in its route table, allowing instances to reach the internet and be reached from the internet (if they have public IPs). A private subnet has no direct internet route; instances can reach the internet through a NAT Gateway/Instance in a public subnet but are not directly reachable from the internet. Private subnets are used for databases, caches, and other backend services that should not be internet-facing.

12. What is Route 53 and what are its main functions?

Route 53 is AWS's DNS service that translates domain names to IP addresses. Its main functions include Domain Registration (buy/manage domain names), DNS Management (create DNS records), Health Checks (monitor endpoint availability), and Traffic Routing (route traffic based on latency, geolocation, or failover). Route 53 integrates with ELB and CloudFront and enables zero-downtime deployments through weighted or failover routing policies.

13. What is CloudWatch and what does it monitor?

CloudWatch is AWS's monitoring and logging service that collects metrics from all AWS resources. It monitors EC2 instances (CPU, disk, network), application performance, costs, and custom metrics. CloudWatch Logs aggregate application and system logs, CloudWatch Alarms trigger notifications or auto-scaling actions when thresholds are exceeded, and CloudWatch Dashboards visualize metrics. Default metric retention is 15 months; detailed monitoring requires additional configuration.

14. What is the difference between CloudWatch and CloudTrail?

CloudWatch monitors resource performance metrics and logs application behavior in real-time (CPU, memory, latency, errors). CloudTrail is an audit logging service that records all AWS API calls and resource changes for compliance and troubleshooting, showing who did what and when. CloudWatch answers "What is happening now?", while CloudTrail answers "What happened and who did it?" Both are essential for operations and security.

15. What is EBS and how does it differ from Instance Store?

EBS (Elastic Block Storage) is persistent block storage that persists after instance termination, can be detached and reattached to other instances, and supports snapshots. Instance Store is temporary storage directly attached to the physical host; it offers higher performance but is lost when the instance stops or terminates. Use EBS for production databases and important data, Instance Store for caching and temporary data. EBS volumes are replicated across multiple servers within an AZ automatically.

16. What is an Elastic IP and why would you need one?

An Elastic IP is a static public IP address that persists across instance stop/start cycles. Unlike public IPs that change when an instance is stopped, Elastic IPs remain associated with your account. Use Elastic IPs for DNS records, emergency failover to standby instances, or when you need to whitelist a specific IP. AWS charges for unused Elastic IPs, so deallocate them when not in use. You can have a maximum of 5 per region (request increases via support).

17. What is Auto Scaling and how does it work?

Auto Scaling automatically adjusts the number of EC2 instances based on demand or schedules. Dynamic Scaling adds/removes instances based on CloudWatch alarms (e.g., CPU greater than 70% adds instances, less than 30% removes them). Scheduled Scaling launches instances at specific times. Predictive Scaling uses machine learning to forecast demand. Auto Scaling Groups define minimum, desired, and maximum instance counts. Combined with load balancers, Auto Scaling enables cost-efficient, highly available applications.

# Create Auto Scaling Group
aws autoscaling create-auto-scaling-group --auto-scaling-group-name my-asg --launch-configuration-name my-lc --min-size 1 --max-size 5 --desired-capacity 2

18. What is an Elastic Load Balancer and what are the types?

An Elastic Load Balancer distributes incoming traffic across multiple targets (instances, containers, IPs). Classic Load Balancer (CLB) routes layer 4 traffic (older, being phased out); Application Load Balancer (ALB) routes layer 7 traffic by hostname, path, or hostname; Network Load Balancer (NLB) handles extreme performance and millions of requests per second. ALB is recommended for most new applications; NLB for gaming, IoT, non-HTTP protocols; CLB for legacy applications.

19. What is CloudFormation and what are its main benefits?

CloudFormation is Infrastructure as Code (IaC) service where you define AWS resources in YAML or JSON templates. Benefits include version control (treat infrastructure like code), consistency (reproduce identical environments), automation (create/update entire stacks in seconds), templates reuse (parameter-driven templates), and cost tracking (tag stacks by cost center). CloudFormation handles resource dependencies automatically and can rollback changes if deployment fails.

20. What is the AWS Shared Responsibility Model?

The Shared Responsibility Model divides security between AWS and customers. AWS is responsible for infrastructure security (physical datacenters, hardware, virtualization), region/AZ availability, and service-level security. Customers are responsible for identity/access management, network configuration, OS patching, application security, and data encryption. Understanding this model is critical: AWS secures the cloud infrastructure; you secure your application and data within the cloud.


Intermediate Level (21-40)

21. How does VPC Peering work and what are its limitations?

VPC Peering creates a network connection between two VPCs allowing them to communicate as if they were on the same network using private IPs. Both VPCs must explicitly accept the peering connection. Peering is non-transitive (if VPC A peers with B and B peers with C, A cannot reach C), doesn't support overlapping CIDR blocks, and requires route table entries on both sides. Use VPC Peering for connecting multiple applications or cross-account scenarios instead of VPN for lower latency and higher bandwidth.

22. What is the difference between ALB and NLB?

ALB (Application Load Balancer) operates at layer 7 (application), routes based on hostname, URL path, or HTTP headers, has lower performance (1 million requests/sec), and is suitable for microservices and web applications. NLB (Network Load Balancer) operates at layer 4 (transport), handles millions of requests/second, supports non-HTTP protocols (TCP, UDP), offers ultra-low latency, and is used for gaming, real-time applications, and extreme performance scenarios. ALB is more flexible for routing; NLB is more performant.

23. What is a NAT Gateway and how is it different from a NAT Instance?

A NAT Gateway is an AWS-managed service allowing private subnet instances to initiate outbound internet connections while blocking inbound connections. It's highly available, automatically scales, and requires an Elastic IP. A NAT Instance is an EC2 instance running NAT software; it's cheaper but requires manual management, patching, and failover configuration. NAT Gateways are recommended for production; NAT Instances are legacy. NAT Gateways incur per-hour and data transfer charges.

24. How does S3 Lifecycle work and what are common use cases?

S3 Lifecycle policies automatically transition objects between storage classes or delete them based on age or conditions. Common policies: transition to Standard-IA after 30 days, to Glacier after 90 days, to Deep Archive after 1 year, and delete after 7 years (useful for compliance retention). Lifecycle policies reduce storage costs significantly for data with predictable access patterns. They can also remove incomplete multipart uploads after 7 days to free space.

# Example lifecycle policy transition
aws s3api put-bucket-lifecycle-configuration --bucket my-bucket --lifecycle-configuration file://lifecycle.json

25. What are Lambda cold starts and how can you minimize them?

A Lambda cold start occurs when a function hasn't been invoked for a while and AWS must provision new infrastructure, causing additional latency (typically 100-500ms). Warm starts reuse existing infrastructure and execute much faster. Minimize cold starts by: provisioned concurrency (pre-warmed instances), keeping function bundle small (less code and dependencies), using Ephemeral Storage strategically, and regular invocations via CloudWatch events. Languages like Go and Python have faster cold starts than Java.

26. What is the difference between Reserved Instances and Spot Instances?

Reserved Instances offer 30-72% discounts for 1 or 3-year commitments; you pay upfront or partially, ideal for baseline/predictable workloads, and can be resold on the Savings Plan marketplace. Spot Instances offer 70-90% discounts for unused capacity but can be interrupted with 2-minute notice; ideal for fault-tolerant batch jobs, big data processing, and non-critical workloads. Typical strategies: use Reserved Instances for baseline capacity, Spot for scaling beyond baseline, On-Demand for temporary spikes. Savings Plans offer similar discounts to Reserved Instances with more flexibility.

27. What is the difference between Multi-AZ and Read Replicas?

Multi-AZ Deployments in RDS create a synchronous standby replica in a different AZ for high availability and automatic failover; you pay for two instances but get single DNS endpoint; failover is automatic but causes brief interruption. Read Replicas are asynchronous copies of a database in the same or different region used for scaling read traffic; failover is manual and requires application changes. Use Multi-AZ for high availability and disaster recovery; use Read Replicas for read scaling and reducing load on primary database.

28. What is the difference between ECS and EKS?

ECS (Elastic Container Service) is AWS's proprietary container orchestration platform managing Docker containers on EC2 or Fargate, tightly integrated with AWS services, simpler to learn with less overhead. EKS (Elastic Kubernetes Service) is AWS's managed Kubernetes service, industry-standard orchestration tool, more complex with richer features, portable across cloud providers. Choose ECS for simpler deployments and tight AWS integration; choose EKS for Kubernetes expertise, multi-cloud strategies, or complex microservices architectures.

29. What is CloudFormation Drift Detection and why is it important?

Drift Detection compares your CloudFormation stack's current state against the template definition, identifying resources that were manually changed after stack creation. This detects configuration drift caused by manual modifications, mistakes, or security changes that bypass IaC. Run drift detection periodically to maintain infrastructure consistency and ensure everything matches version control. Resolving drift involves either updating the template or using update-stack to restore desired state.

# Detect stack drift
aws cloudformation detect-stack-drift --stack-name my-stack

30. How does IAM Role assume mechanism work for cross-account access?

Cross-account access requires two IAM Roles: a Source Role in Account A with permissions to call sts:AssumeRole, and a Target Role in Account B with a trust policy allowing Account A's role to assume it. When Account A's role assumes Account B's role, temporary credentials are returned via STS (Security Token Service), allowing temporary access to Account B resources. This is used for: cross-account deployments, delegating administration, and multi-account strategies without sharing long-term credentials.

31. What are IAM Policies vs. IAM Roles?

IAM Policies are JSON documents defining what actions are allowed or denied on specific resources; they define permissions. IAM Roles are sets of permissions assigned to users, services, or cross-account principals; they are the containers for policies. Policies are reusable across multiple roles; roles are assumed by identities. A role might have multiple policies attached. Policies follow the principle of least privilege by granting only required permissions. Use managed policies for common use cases, custom policies for specific requirements.

32. What is the AWS Organizations service and what are SCPs?

AWS Organizations consolidates multiple AWS accounts under one management account, enabling centralized billing, consolidated CloudTrail logging, and simplified account management. Service Control Policies (SCPs) are organization-level policies that set boundaries on what actions members can perform (even if they have IAM permissions). SCPs are filters: they don't grant permissions but prevent certain actions across an organization. Use SCPs for preventing service usage, restricting regions, or enforcing compliance across all accounts.

33. How does DynamoDB Provisioned vs. On-Demand mode differ?

Provisioned Mode requires specifying read/write capacity units; you pay for reserved capacity regardless of usage; suitable for predictable workloads. On-Demand Mode auto-scales capacity and you pay per request; suitable for unpredictable or variable workloads; costs more at high scale but useful for new applications or burstable traffic. Use Provisioned for steady-state production workloads; use On-Demand for development, testing, or applications with unknown capacity requirements. You can switch modes once per 24 hours.

34. What is API Gateway and what are its main features?

API Gateway is a managed service for creating, publishing, and managing REST and WebSocket APIs. Features include: request/response transformation, authentication/authorization (IAM, Cognito, Lambda authorizers), rate limiting and throttling, CORS configuration, request/response logging, caching, versioning, and deployment to stages. API Gateway integrates with Lambda (serverless APIs), EC2, and other services. It supports OpenAPI documentation generation and can transform requests to match backend requirements.

35. What is the difference between RDS and DynamoDB?

RDS (Relational Database Service) is a managed SQL database supporting PostgreSQL, MySQL, MariaDB, Oracle, SQL Server; ideal for structured data with complex queries and transactions; you choose instance sizes and manage scaling. DynamoDB is a managed NoSQL database ideal for unstructured data, high-throughput scenarios, and key-value access patterns; scales automatically in on-demand mode; no complex queries or joins. Use RDS for traditional applications with relational data; use DynamoDB for mobile apps, IoT, real-time analytics with simple access patterns.

36. How does Lambda function versioning and aliases work?

Lambda Versions are immutable snapshots of your function code and configuration; each publish creates a new version ($LATEST is always mutable). Aliases are pointers to specific versions enabling traffic shifting; you can point an alias at multiple versions with defined weights for canary or blue/green deployments. Use aliases in production instead of $LATEST to ensure consistent function invocation. CloudWatch logs and metrics are aggregated at alias level, making troubleshooting easier.

# Create alias pointing to version 5
aws lambda create-alias --function-name my-function --name prod --function-version 5

# Update alias to shift 10% traffic to version 6
aws lambda update-alias --function-name my-function --name prod --function-version 5 --routing-config AdditionalVersionWeight={"6"=0.1}

37. What is ElastiCache and what are its use cases?

ElastiCache is a managed in-memory data store supporting Redis and Memcached, providing sub-millisecond latency. Memcached is simpler (key-value), horizontally scalable, good for object caching. Redis supports data structures (lists, sets, sorted sets), persistence, replication, pub/sub messaging. Use cases: session storage, real-time leaderboards, recommendation engines, rate limiting, pub/sub messaging, and caching database queries. ElastiCache with read replicas across AZs provides high availability and automatic failover.

38. How do you approach scaling a monolithic application on AWS?

Strategies include: add a load balancer to distribute traffic, implement caching (ElastiCache) for frequent queries, separate read/write databases with read replicas, decouple components with SQS/SNS messaging, containerize with ECS/EKS, break into microservices gradually, use CDN (CloudFront) for static assets, implement auto-scaling for compute, optimize database with indexing and query analysis. Scaling is typically database-limited first, then CPU, then memory. Monitor with CloudWatch and identify bottlenecks before scaling architecture.

39. What is AWS Well-Architected Framework?

The Well-Architected Framework defines best practices across five pillars: Operational Excellence (monitoring, automation, IaC), Security (least privilege, encryption, compliance), Reliability (failover, redundancy, disaster recovery), Performance Efficiency (appropriate services, caching, optimization), and Cost Optimization (removing waste, choosing right services, scaling). AWS provides Well-Architected Review process and tools to assess architectures against these pillars. Use it to identify improvements and make architectural decisions aligned with AWS best practices.

40. What is the difference between stateful and stateless applications on AWS?

Stateless applications store no session data in memory; multiple instances are interchangeable; easily horizontally scalable (each request is independent). Stateful applications maintain session state in memory; requests must return to same instance; harder to scale (stickiness required). Make applications stateless by moving session storage to: databases (RDS), cache (ElastiCache), or external session managers. Load balancers support session stickiness, but stateless design is better for auto-scaling and high availability.


Advanced Level (41-60+)

41. Design a highly available multi-region AWS architecture for an e-commerce platform.

A resilient multi-region architecture includes: Route 53 with health checks and failover/geolocation routing to direct users to nearest healthy region; Global Accelerator for improved availability; VPC in each region with multi-AZ subnets; ALB/NLB distributing traffic across instances in each AZ; Auto Scaling Groups for compute elasticity; RDS Multi-Master (cross-region read replicas with bidirectional replication) or DynamoDB Global Tables for databases; S3 Cross-Region Replication for static assets; CloudFront for global edge caching; Lambda in each region for serverless components; Secrets Manager replication for credentials; backup strategy with automated snapshots replicated cross-region. Failover from primary to secondary region happens automatically via health checks (seconds to minutes). Use CloudWatch dashboards across regions and implement centralized logging via CloudWatch Logs insights.

42. How does CAP theorem apply to DynamoDB and what are the tradeoffs?

The CAP theorem states distributed systems can guarantee at most two of: Consistency (all nodes see same data), Availability (system always responds), Partition tolerance (system works despite network failures). DynamoDB prioritizes Availability and Partition Tolerance (AP system). DynamoDB offers eventual consistency by default (reads might see stale data) with faster, cheaper reads; strong consistency available at higher latency/cost. For financial systems requiring consistency, choose strong consistency and design idempotent operations; for social feeds/recommendations, eventual consistency is acceptable. Use DynamoDB Streams with Lambda for cross-region synchronization.

43. How would you migrate a monolithic application to microservices on AWS?

Approach: Phase 1 - Assessment: identify service boundaries and dependencies, evaluate current performance bottlenecks, estimate effort. Phase 2 - Strangler Pattern: incrementally extract services behind API Gateway/ALB while keeping monolith running; route specific request types to new services. Phase 3 - Data Migration: decouple databases (eventually each microservice owns its database); use CDC (Change Data Capture) via DMS or Debezium for consistency. Phase 4 - Communication: implement synchronous APIs (REST/gRPC), asynchronous messaging (SQS/SNS), and service discovery. Phase 5 - Deployment: containerize services with ECS/EKS, implement CI/CD with CodePipeline, use feature flags to control rollouts. Phase 6 - Observability: centralized logging (CloudWatch Logs), distributed tracing (X-Ray), metrics dashboards. Plan for 6-12 month migration per large monolith; microservices introduce operational complexity (orchestration, debugging, testing) offset by scalability and team agility benefits.

44. Explain blue/green deployment with CodeDeploy and zero-downtime strategies.

Blue/Green Deployment maintains two identical production environments: Blue (current version) and Green (new version). Traffic shift: CodeDeploy gradually shifts traffic from Blue to Green (or instantly), if issues occur, rollback is as simple as shifting traffic back. Implementation: two ALB target groups with identical EC2 instances, CodeDeploy updates Green instances, CloudWatch alarms monitor Green for failures, Route 53 or ALB weighted rules control traffic percentage. Zero-downtime requires: pre-warmed instances (avoid cold starts), database migrations backward-compatible, graceful shutdown of old instances, no breaking API changes. CodeDeploy hooks (BeforeInstall, ApplicationStart, ValidateService) verify health. Blue/green avoids data loss and enables instant rollback unlike canary deployments which gradually shift traffic.

45. How would you implement auto-scaling for a Lambda-based microservices architecture?

Lambda auto-scales automatically without configuration, but you control: Concurrency (max simultaneous executions), Reserved Concurrency (guaranteed capacity for critical functions), Provisioned Concurrency (pre-warmed instances for low latency). Set Reserved Concurrency for baseline, Provisioned for bursty traffic reduction. Event Source Scaling: SQS (Lambda polls, auto-scales based on queue depth), API Gateway (scales automatically, no configuration needed), DynamoDB Streams (Lambda auto-scales, respects max batch window). Monitoring: CloudWatch metrics (Throttles, Duration, Errors), X-Ray for tracing, CloudWatch Alarms for anomalies. Cost optimization: use Compute Savings Plans, analyze invocation patterns, consider reserved capacity for predictable workloads. Challenges: cold starts (mitigate with Provisioned Concurrency), concurrency limits (request increases), cross-function coordination (use SQS/SNS instead of direct calls).

46. How would you design a data lake on AWS and what services are involved?

Data Lake Architecture: Ingestion (AWS Glue, Kinesis, DMS, DataSync), Storage (S3 with partitioned Parquet/ORC format), Cataloging (Glue Data Catalog for metadata), Processing (Spark via EMR, Lambda for simple transforms, Glue jobs for ETL), Analytics (Athena for SQL queries, Redshift for complex analytics), Visualization (QuickSight dashboards). Key patterns: zone architecture (raw zone for unprocessed data, processed zone for cleaned data, curated zone for specific use cases), partitioning by date/source/region for query performance, Glue Crawlers to auto-discover schema. Data Governance: tag data by sensitivity, use Lake Formation for access control, enable S3 versioning for audit trail, implement data quality checks in Glue jobs. Cost optimization: use S3 Intelligent-Tiering, Athena query optimization (partition pruning, columnar format), Redshift Spectrum for large datasets. Data lakes enable analytics at scale but require careful governance to avoid "data swamp" (unusable data).

47. Explain VPC Endpoint types and their use cases.

Gateway Endpoints (S3, DynamoDB) redirect traffic to AWS services without leaving AWS network; free, no additional configuration beyond route table entries. Interface Endpoints (1000+ services like RDS, ElastiCache, Kinesis) create private network interfaces (ENIs) within your VPC; traffic uses PrivateLink; compatible with on-premises connections via VPN. Use cases: S3 Gateway Endpoint prevents data egress charges for private instances, DynamoDB Gateway for private database access, RDS Interface Endpoint for private database connections, Secrets Manager Interface for retrieving secrets privately. Interface Endpoints enable zero-trust architectures where private instances never traverse internet. Combined with VPC Flow Logs and CloudWatch, endpoints provide observability while maintaining isolation.

48. How does AWS Organizations SCPs work and what are common use cases?

Service Control Policies (SCPs) are organization-level permission boundaries (filters, not grants). Evaluation: SCPs AND IAM permissions AND resource policies; if SCP denies, access is denied regardless of IAM permissions. Example SCP: prevent any EC2 launch in non-approved regions, prevent public S3 bucket creation, enforce encryption on EBS volumes. Common policies: prevent account deletion, enforce MFA, restrict service usage by department, enforce compliance tagging. Limitations: SCPs don't apply to AWS root account, don't grant permissions (only restrict), take 15-30 seconds to propagate. Implementation: start permissive (FullAWSAccess), gradually restrict via targeted policies; test on non-production OU first. SCPs enforce governance at organization scale without requiring every account to implement identical IAM policies.

49. How would you implement a real-time data streaming architecture with Kinesis and Lambda?

Architecture: Producers (application servers, IoT devices, logs) send data to Kinesis Data Streams; partition key determines shard assignment; Lambda consumers poll Kinesis shards (event source mapping auto-creates) and process records; results written to DynamoDB (real-time aggregations), S3 (analytics), or Firehose (automated buffering). Key design: choose partition key to distribute evenly across shards, set shard count based on throughput (1MB/sec per shard input), configure batch size and window for Lambda triggers. Exactly-once processing: write to DynamoDB with composite keys including sequence numbers, check idempotency before processing. Monitoring: CloudWatch metrics (GetRecords.IteratorAgeMilliseconds indicates processing lag), X-Ray for latency analysis, set alarms on Lambda duration. Alternatives: Kinesis Data Firehose (simpler, auto-scaling, less control), SNS/SQS (simple events), EventBridge (event routing with transformations). Kinesis excels at high-throughput ordered streams (financial transactions, clickstreams, sensor data).

50. How would you implement zero-downtime deployments with minimal data loss?

Strategies: Blue/Green (described earlier, fastest rollback), Canary (gradually shift 5% traffic, monitor metrics, progressively increase), Rolling (gradually replace instances, maintain connection draining), Database migrations: backward-compatible schema changes (add columns as nullable, migrate data asynchronously), use DMS for zero-downtime database migrations. Application level: implement graceful shutdown (drain in-flight requests, close DB connections), session affinity (sticky sessions if needed), versioned APIs (support old and new simultaneously). Infrastructure: implement connection draining on load balancers (wait for existing connections), use deregistration delay (30-300 seconds), health checks (remove unhealthy instances before directing traffic). Data consistency: implement idempotent operations, use transactions for critical updates, replicate data cross-region before cutover. Validation: use CloudWatch synthetic canaries to test endpoints, runbooks for rollback, on-call rotation during deployments. Complex deployments benefit from infrastructure as code and automated testing to minimize manual errors.

51. How would you design a disaster recovery strategy for a production database?

RTO/RPO Planning: RTO (Recovery Time Objective) = maximum acceptable downtime (1 hour, 4 hours); RPO (Recovery Point Objective) = maximum data loss acceptable (1 hour, 15 minutes). Strategies by RTO/RPO: Passive (RTO 24h, RPO 24h): automated daily snapshots stored in different region, manual restore when needed; cheapest. Active-Passive (RTO 1h, RPO 15min): read replica in different region, automatic promotion via automated DNS failover; moderate cost. Active-Active (RTO minutes, RPO near-zero): multi-master replication (DynamoDB Global Tables, Aurora Global Database), constant synchronization; highest cost. Implementation: use AWS Backup for automated snapshots across regions, RDS automated backups (35 days retention), cross-region read replicas, DMS for continuous replication of on-premises databases. Testing: monthly disaster recovery drills (restore from backup, validate data), document playbooks, identify gaps. Monitoring: track replication lag, backup completion, RTO/RPO metrics. Combine backup strategy with multi-AZ/multi-region deployment for comprehensive resilience.

52. How would you handle secrets rotation at scale across an organization?

Use AWS Secrets Manager for centralized secrets management. Features: automatic rotation (Lambda-triggered on schedule), versioning (multiple secret versions simultaneously), encryption with KMS, audit logging via CloudTrail, fine-grained IAM permissions. Implementation: (1) create rotation Lambda function (get current secret, generate new credential, update application/resource, validate), (2) enable automatic rotation (30-90 day intervals), (3) configure target resources (RDS database users, API keys, OAuth tokens). Organization-wide: use AWS Secrets Manager Resource Policy to allow cross-account access, replicate secrets to secondary regions, enforce secret naming conventions, tag secrets by application/environment. Challenges: application compatibility (not all apps support runtime credential refresh), coordinated rotation (rolling updates to distributed systems), emergency access (manual override procedures). Alternatives: Parameter Store for non-sensitive configuration, Systems Manager for patch management. Secrets Manager eliminates hardcoded credentials and reduces insider threats but requires application changes for live rotation support.

53. How would you optimize costs for a large AWS deployment?

Strategies: (1) Instance types: use burstable t3 for non-critical, reserved/savings plans for stable workloads, spot for batch jobs; consolidate small instances to fewer larger ones (better utilization). (2) Storage: use S3 Intelligent-Tiering for unknown access patterns, Glacier for archival, implement lifecycle policies (transition old data). (3) Database: choose right service (DynamoDB for scalability, RDS for relational), use read replicas instead of oversizing primary, enable query caching. (4) Network: use VPC endpoints to eliminate data transfer charges, consolidate resources in fewer AZs when possible, optimize data transfer (CloudFront for geo-distributed access). (5) Compute: auto-scale to avoid idle capacity, delete unused resources (EBS snapshots, Elastic IPs, data transfer), use Lambda for sporadic workloads. (6) Monitoring: use AWS Cost Explorer to identify top spending services, Compute Optimizer for right-sizing recommendations, Trusted Advisor for quick wins. (7) Governance: implement tagging strategy (tag all resources by cost center/project), set budgets with alerts, use SCPs to prevent expensive services in non-production. Typical cost savings: 40-60% through optimization, payback within 3-6 months of effort.

54. How would you implement compliance and governance with AWS?

Foundation: implement AWS Config to track resource configurations and compliance status, CloudTrail for API audit logs, CloudWatch for operational monitoring. Access Control: use IAM with least privilege, MFA enforcement, Temporary credentials via STS. Data Protection: encryption at rest (S3 server-side encryption, RDS encryption, EBS encryption), encryption in transit (TLS/HTTPS), KMS for key management. Compliance Frameworks: AWS provides compliance certifications (SOC 2, ISO 27001, PCI-DSS, HIPAA, FedRAMP), use AWS Artifact to access compliance documents, implement security baselines (CIS Benchmarks). Organization Level: use AWS Organizations with SCPs to enforce policies, AWS Security Hub for centralized security findings. Monitoring: GuardDuty for threat detection, VPC Flow Logs for network monitoring, S3 Access Logs for bucket auditing. Automation: implement Lambda for remediation (disable public access, enforce encryption), Systems Manager for patch management. Documentation: maintain runbooks, incident response procedures, architecture diagrams. Compliance is continuous: regular audits, penetration testing, security training, and staying updated with AWS security advisories.

55. How would you architect a real-time dashboard for monitoring microservices?

Data Collection: CloudWatch for resource metrics, X-Ray for distributed tracing (correlates requests across services), application logs sent to CloudWatch Logs, custom metrics via CloudWatch API. Processing: CloudWatch Insights for log analysis, CloudWatch Anomaly Detector for automatic baseline learning, Lambda for metric aggregation. Visualization: CloudWatch Dashboards (free, AWS-native), QuickSight (self-service BI, embedded dashboards), Grafana (open-source, fine control). Architecture: applications emit metrics/logs, aggregated by CloudWatch/Kinesis, visualized by dashboard, alarms trigger SNS notifications. Advanced: ServiceLens in X-Ray for service map visualization, Lambda Insights for serverless observability, Container Insights for ECS/EKS clusters. Best Practices: alert on business metrics (conversion rate, latency p99) not just infrastructure (CPU); correlate metrics with deployments (via CloudWatch Events), maintain alerting fatigue via alert tuning. Challenges: high cardinality metrics (millions of unique values) cause cost/performance issues; design metrics carefully (aggregate or sample if needed).

56. How would you implement a secure, scalable API on AWS?

API Gateway: use REST API (simpler) or HTTP API (lower latency/cost), enable WAF for web attack protection, implement rate limiting and request throttling. Authentication: IAM for AWS service-to-service, Cognito for user authentication (built-in MFA, social login), Lambda authorizers for custom logic. Backend: Lambda for serverless (auto-scaling, pay-per-use), ECS/EKS for containers (more control), RDS/DynamoDB for data (multi-AZ for HA). Security: VPC endpoints for private backends, encryption in transit (TLS 1.2+), encryption at rest (KMS), API keys for rate limiting, CORS properly configured. Monitoring: X-Ray for tracing, CloudWatch for metrics/logs, WAF logs for security analysis. Versioning/Deployment: use stages (dev/test/prod), canary deployments via weighted routing, automated testing via CodePipeline. Documentation: OpenAPI specs (auto-generated from API Gateway), SDK generation. Cost optimization: cache responses (CloudFront or API Gateway caching), batch operations, use appropriate database (DynamoDB for read-heavy, RDS for complex queries). Production APIs require: authentication, rate limiting, monitoring, documentation, disaster recovery plan.

57. How would you design a cost-effective logging strategy for distributed systems?

Log aggregation: stream logs to CloudWatch Logs (centralized, integrated with AWS), S3 (long-term archival, cost-effective), or third-party (DataDog, Splunk if required). Log levels: emit ERROR/WARNING in production, enable DEBUG in development only (reduces volume). Sampling: log 100% of errors, 10% of successes (statistical significance with cost reduction). Retention: delete logs older than 90 days automatically (CloudWatch Logs retention policy), archive to Glacier for compliance. Structured logging: use JSON format with context (request ID, user ID, timestamp), enables CloudWatch Insights queries. Cost reduction: (1) agent-side filtering (CloudWatch agent pre-filters logs before shipping), (2) VPC endpoints for logs (no NAT gateway charges), (3) batch writes (higher throughput, fewer API calls). Querying: CloudWatch Insights for ad-hoc analysis, CloudWatch Logs Insights (unlimited queries after 15 minute lag), export to Athena for complex analysis, QuickSight for dashboards. Challenges: log explosion from noisy services (implement circuit breakers), structured logging adoption (requires code changes), correlating logs across services (use trace IDs). CloudWatch typically costs 50 cents per GB ingested; third-party solutions are more expensive but offer richer features.

58. How would you approach migrating legacy applications to containers?

Assessment Phase: identify application dependencies (databases, file systems, OS requirements), determine container readiness (stateless preferred, persistent data requires volumes), estimate effort. Containerization: create Dockerfile (multi-stage builds to minimize image size), use lightweight base images (Alpine, distroless), include health checks, configure logging to stdout. Registry: push images to ECR (AWS-native, integrated with IAM), enable image scanning for vulnerabilities. Orchestration: ECS for Fargate (simplest, AWS-managed, limited control), ECS on EC2 (more control, still AWS-managed), EKS (Kubernetes, multi-cloud portable, more complex). Deployment: use Blue/Green or rolling deployments for zero downtime, implement health checks (readiness/liveness probes), configure auto-scaling. Data Migration: containerize databases separately, use DMS for zero-downtime migration, implement application-level retry logic (databases down during migration). Networking: use Service Discovery (Route 53 auto naming), implement service mesh (Istio) for traffic management only if needed. Challenges: persistent storage (EBS volumes for databases), logging/monitoring (send to CloudWatch, X-Ray), compliance (container image vulnerability scanning). Expect 2-3 month migration per moderate application; benefit from isolation, consistent deployments, and scaling.

59. How would you implement a multi-tenancy architecture on AWS?

Isolation Levels: Dedicated accounts per tenant (highest isolation, highest cost, recommended for large enterprises), VPC per tenant (strong isolation, moderate cost), subnet per tenant (network isolation, shared compute), database isolation via row-level security (shared infrastructure, cheapest, requires careful security). Data Isolation: store tenant ID in every record, implement row-level security in database (PostgreSQL RLS, application-enforced), separate S3 prefixes per tenant (with IAM policies enforcing prefix access). Identity/Authentication: Cognito user pools (separate pool per tenant or shared with tenant identifier in custom claims), SAML/OIDC for enterprise customers with SSO. Billing/Metering: tag resources per tenant, use Cost Allocation Tags, implement metering (API call count, storage usage, compute hours) for chargeback. Deployment: shared infrastructure (one deployment) for multi-tenant (cost-effective) vs. separate deployments (complex, isolated). Monitoring: tenant-specific dashboards in CloudWatch, CloudTrail logs tagged by tenant, alerting for tenant-specific anomalies. Challenges: noisy neighbor problem (one tenant's high usage impacts others; solve with reserved capacity or dedicated infrastructure), data isolation compliance (ensure encryption keys per tenant), performance variability. Multi-tenancy benefits from shared infrastructure costs but requires careful security design to prevent cross-tenant data leaks.

60. How would you design a self-healing infrastructure with AWS?

Health Monitoring: CloudWatch Alarms on metrics (CPU, disk, network), ECS/EKS health checks (restart failed containers), RDS automated failover (Multi-AZ), Route 53 health checks (automatic DNS failover). Auto-Remediation: Lambda triggered by SNS alerts (restart service, scale cluster, invoke runbook), Systems Manager Automation for complex runbooks, Auto Scaling Groups (replace unhealthy instances), CodeDeploy for emergency patching. Self-Healing Patterns: (1) circuit breakers (fail fast, prevent cascading failures), (2) exponential backoff (retry with increasing delays), (3) bulkheads (isolate failure domains), (4) graceful degradation (degrade service rather than fail completely). Chaos Engineering: regularly test failure scenarios (AWS Fault Injection Simulator), break things intentionally to learn resilience, document lessons learned. Automation Examples: (1) Lambda detects high CPU alarm, triggers Auto Scaling Group increase, (2) ECS detects crashed container, automatically restarts, (3) RDS detects primary failure, promotes read replica in seconds. Limits: self-healing handles transient failures well (network blips, resource exhaustion) but cannot fix application bugs (require code fixes) or data corruption (require backups). Design systems to be resilient (multiple replicas, redundant paths), then layer self-healing automation on top. Requires robust testing, runbooks, and on-call procedures.


Tips for AWS Interviews

Technical Preparation

  1. Master the fundamentals: understand VPCs, IAM, EC2, S3, RDS, and Lambda deeply before advanced topics
  2. Practice CLI commands: be comfortable running aws commands without consulting documentation
  3. Understand tradeoffs: every architectural decision has cost, complexity, performance, and security implications; discuss them
  4. Know AWS services deeply: for each service, understand: when to use it, cost model, scaling characteristics, integration with other services
  5. Study real AWS architectures: review AWS reference architectures, Well-Architected reviews, and case studies from similar industries

Communication Tips

  1. Clarify requirements first: ask about scalability needs, availability requirements, data sensitivity, budget constraints before proposing solutions
  2. Draw diagrams: use text-based diagrams or sketches to visualize architecture, reference, data flows
  3. Discuss tradeoffs explicitly: "Option A is cheaper but has higher latency, Option B is more expensive but better performance for real-time scenarios"
  4. Provide estimates: CPU, memory, storage, bandwidth requirements; calculate costs roughly
  5. Mention monitoring/observability: every architecture needs CloudWatch, alarms, and incident response procedures

Common Pitfalls to Avoid

  1. Single point of failure: always mention multi-AZ, failover, redundancy
  2. Over-engineering: avoid proposing overly complex solutions for simple problems (use managed services)
  3. Ignoring costs: discuss cost implications of design choices, propose optimizations
  4. Forgetting security: always mention encryption, IAM, network isolation, audit logging
  5. No monitoring plan: how will you know if the system is healthy? What metrics matter?

Behavioral Tips

  1. Be honest about gaps: if you haven't used a service, say so, but discuss how you'd approach learning it
  2. Ask clarifying questions: don't assume requirements; confirm before proposing solutions
  3. Listen actively: the interviewer may drop hints about preferred solutions
  4. Think out loud: show your reasoning process, discuss alternatives as you talk
  5. Be prepared to pivot: if interviewer suggests a different approach, adapt your solution

Additional Preparation Resources

  • AWS Well-Architected Framework (whitepaper)
  • AWS Architecture Reference Architectures (AWS documentation)
  • AWS Certified Solutions Architect Associate exam study guide
  • Real-world case studies from AWS blog
  • Practice designing systems for companies you admire (Netflix's microservices, Airbnb's scale, Uber's real-time systems)

Good luck with your AWS interviews! Remember: the interviewer wants to understand your thinking process, not whether you remember every API parameter.