GCP Interview Questions and Answers
Cloud Architecture and Design (10 Questions)
Q1: Design a globally distributed application that must handle millions of requests per second with sub-100ms latency.
Answer: Use a multi-region architecture:
- Global Load Balancer routes traffic to nearest region
- Cloud CDN caches static content globally
- Cloud Run auto-scales to handle traffic spikes in each region
- Firestore with multi-region replication for low-latency reads
- Pub/Sub for asynchronous processing and decoupling
- Cloud Monitoring with cross-region alerting
Key considerations:
- Separate read-heavy (Cloud Firestore) from write-heavy (Pub/Sub) workloads
- Use database replicas for read-only traffic in non-primary regions
- Implement circuit breakers and timeouts
- Cache aggressively with Cloud CDN
- Monitor tail latency (p95, p99) not just averages
Q2: How do you design a system that's resilient to regional GCP outages?
Answer: Implement multi-region failover:
- Deploy critical services across 2-3 regions
- Use Cloud Load Balancing with health checks to detect region failures
- Configure automatic failover with Traffic Director
- Replicate data across regions (Firestore, Cloud SQL replicas)
- Use Cloud CDN for static content distribution
- Implement Spinnaker for orchestrated multi-region deployments
Additional resilience:
- Design for graceful degradation (use features, not perfection)
- Implement exponential backoff and retries
- Use circuit breakers to prevent cascading failures
- Cache data locally where possible
- Test failover regularly with chaos engineering
Q3: What's the best architecture for a real-time analytics platform processing millions of events per second?
Answer: Event-driven architecture:
Events (App/Website)
↓
Pub/Sub (topic)
├─→ Dataflow (Stream processing)
│ ↓
│ BigQuery (Real-time tables)
│ ↓
│ Looker (Visualizations)
│
└─→ Cloud Storage (Raw data backup)
↓
BigQuery (Batch analysis)
Key services:
- Pub/Sub: Decouple producers from consumers, handle backpressure
- Dataflow: Process events with Apache Beam, handle complex transformations
- BigQuery: Store and query massive datasets efficiently
- Cloud Storage: Archive raw events for compliance and reprocessing
- Looker: Real-time dashboards and reporting
Optimization techniques:
- Use exactly-once semantics with Dataflow windowing
- Compress data in Cloud Storage
- Use BigQuery clustering and partitioning
- Implement autoscaling in Dataflow
- Cache frequently accessed data in Memorystore
Q4: Describe the differences between eventual consistency and strong consistency. When would you choose each for GCP?
Answer: Strong Consistency:
- Every read reflects the most recent write
- Used in: Cloud SQL, Cloud Spanner, Firestore (single-document transactions)
- Drawbacks: Lower availability, higher latency
Eventual Consistency:
- Reads may temporarily return stale data
- Used in: Datastore, Memorystore, Cloud Storage
- Benefits: Higher availability, lower latency
GCP Choice Strategy:
- Use Cloud Spanner for strongly consistent, globally distributed transactional data
- Use Firestore for applications needing eventual consistency with real-time updates
- Use Cloud SQL for transactional workloads within a single region
- Use Memorystore for caching with eventual consistency acceptable
- Use BigQuery for analytical queries (eventual consistency is fine)
Example:
- Banking system: Cloud Spanner (strong consistency required)
- Social media feed: Firestore (eventual consistency acceptable)
- Product catalog: Cloud SQL with caching (eventual consistency in cache)
Q5: What considerations should you make when migrating a monolithic application to GCP?
Answer: Migration strategy options:
- Rehost (Lift-and-shift): Run existing application on Compute Engine
- Replatform (Lift-tinker-shift): Minimize changes, leverage managed services
- Refactor/Re-architect: Redesign for cloud-native (microservices, containers)
- Repurchase: Replace with SaaS
- Retire: Decommission legacy systems
Detailed migration plan:
- Assessment: Identify monolith components, dependencies, performance requirements
- Containerization: Wrap monolith in Docker for GKE deployment
- Database Migration: Move from on-premises to Cloud SQL or Spanner
- Network Setup: Establish VPN/Interconnect for hybrid connectivity
- Gradual Decomposition: Extract microservices one at a time
- Testing: Conduct performance and failover testing
- Cutover: Execute migration with minimal downtime
- Optimization: Right-size resources post-migration
Tools:
- Migrate for Compute Engine: Automated VM migration
- Database Migration Service: Migrate databases with minimal downtime
- Cloud Build: Automate containerization and deployment
- Spinnaker: Orchestrate deployment to GKE
Q6: How would you implement CI/CD for a multi-team organization?
Answer: Multi-team CI/CD architecture:
Team A Repo (GitHub) ┐
Team B Repo (GitHub) ├─→ Cloud Build ─→ Artifact Registry
Team C Repo (GitHub) ┘ ↓
(Promote to environments)
├─→ Dev (GKE)
├─→ Staging (GKE)
└─→ Prod (GKE, multi-region)
Key components:
- Source Control: GitHub/Cloud Source Repositories
- Cloud Build: Unified build platform with team-specific triggers
- Artifact Registry: Central image/artifact repository
- GKE: Multi-environment deployments
- Cloud Monitoring: Observability across deployments
Best practices:
- Enforce code reviews before merge
- Run automated tests on every PR
- Use branch protection rules
- Implement GitOps for infrastructure-as-code
- Separate build configs for security and compliance
- Use service accounts with minimal required permissions
- Implement approval gates for production deployments
- Automated rollback on deployment failures
Q7: How do you ensure data security in a GCP application handling PII?
Answer: Layered security approach:
- Encryption in transit: TLS for all communication, Cloud VPN/Interconnect
- Encryption at rest: Cloud KMS for keys, CMEK for databases/storage
- Access control: IAM roles, service accounts, organization policies
- Data classification: Use DLP API to discover and classify PII
- Auditing: Cloud Audit Logs for all access
- Network isolation: VPC, firewall rules, Private Google Access
- Monitoring: Detect unauthorized access attempts
Implementation:
# Enable CMEK for Cloud Storage
gsutil encryption set gs://my-bucket/
# Enable DLP API
gcloud services enable dlp.googleapis.com
# Create service account with minimal permissions
gcloud iam service-accounts create app-sa
gcloud projects add-iam-policy-binding PROJECT_ID \
--member=serviceAccount:app-sa@PROJECT_ID.iam.gserviceaccount.com \
--role=roles/storage.objectViewer
# Enable audit logging
gcloud logging sinks create audit-sink \
bigquery.googleapis.com/projects/PROJECT_ID/datasets/audit_logs \
--log-filter='protoPayload.methodName="storage.buckets.get"'
Compliance considerations:
- GDPR: Right to be forgotten (data deletion, anonymization)
- HIPAA: Business Associate Agreement (BAA) with Google
- PCI-DSS: Network isolation, encryption, regular audits
- SOC 2: Google's certifications align with SOC 2 Type II
Q8: What are the key differences between Firestore and Datastore?
Answer:
| Feature | Firestore | Datastore |
|---|---|---|
| Data Model | Documents + Collections | Entity + Kind |
| Queries | Rich querying, real-time updates | Limited, no real-time |
| Transactions | Multi-document, offline support | Single document |
| Pricing | Per operation | Per operation (different rates) |
| Updates | Server timestamps, batches | Limited |
| Recommendation | New projects | Legacy (migrate to Firestore) |
Choose Firestore for:
- Real-time applications (chat, collaboration)
- Mobile apps needing offline support
- Hierarchical data structures
- Multi-document transactions
Choose Datastore only if:
- Existing application heavily depends on it
- Custom compatibility required
Q9: How would you architect a solution for processing large files asynchronously?
Answer: Asynchronous file processing pipeline:
Cloud Storage (File Upload)
↓
Pub/Sub Topic (File notification)
├─→ Cloud Functions (small files, < 2GB)
└─→ Dataflow (large files, parallel processing)
↓
Cloud Storage (Processed results)
↓
BigQuery (Metadata and results)
Implementation:
- Trigger: Use Cloud Storage notifications → Pub/Sub
- Processing: Cloud Functions for small files, Dataflow for large
- Monitoring: Cloud Monitoring for job status
- Storage: Store processed results in Cloud Storage
- Metadata: Track job status in Firestore or BigQuery
Error handling:
- Dead-letter queue in Pub/Sub for failed messages
- Retry logic with exponential backoff
- Error notifications via Cloud Logging
Q10: Compare different compute options and when to use each.
Answer:
| Service | Use Case | Pros | Cons |
|---|---|---|---|
| Compute Engine | Custom apps, high control | Full control, per-second billing | Manage updates, patching |
| GKE | Containerized apps, orchestration | Portable, scalable, organized | Operational complexity |
| Cloud Run | Stateless microservices | Simple, auto-scaling to zero | Limited execution time (1 hour) |
| Cloud Functions | Event-driven, simple tasks | Minimal code, serverless | Limited to specific use cases |
| App Engine | Web applications, APIs | Fully managed, language support | Less flexibility |
Decision matrix:
- Need full control? → Compute Engine
- Containerized, complex? → GKE
- Stateless HTTP services? → Cloud Run
- Simple event handlers? → Cloud Functions
- Traditional web apps? → App Engine
Networking and Security (8 Questions)
Q11: Explain VPC network design for a production environment.
Answer: Multi-tier VPC design:
VPC (10.0.0.0/8)
├── Public Subnet (10.0.1.0/24)
│ ├── Load Balancer
│ └── NAT Gateway
├── Application Subnet (10.0.2.0/24)
│ ├── Cloud Run (public)
│ └── GKE Pods (private)
└── Database Subnet (10.0.3.0/24)
├── Cloud SQL (private)
└── Memorystore (private)
Key design principles:
- Network segmentation: Separate public, application, and data tiers
- Private services: Run applications in private subnets
- Managed NAT: Use Cloud NAT for outbound traffic
- Cloud VPN/Interconnect: Secure on-premises connectivity
- Firewall rules: Explicit allow rules (deny by default)
- Service controls: Enforce least privilege access
Firewall rule example:
# Allow traffic from load balancer to GKE
gcloud compute firewall-rules create allow-lb-to-app \
--network=production-vpc \
--allow=tcp:80,tcp:443 \
--source-ranges=0.0.0.0/0 \
--target-tags=application
# Deny all inbound (default deny policy)
gcloud compute firewall-rules create deny-all \
--network=production-vpc \
--deny=all \
--priority=65534
Q12: What is VPC Service Controls and when would you use it?
Answer: VPC Service Controls define security perimeters around Google Cloud services:
- Restrict access to resources (BigQuery, Cloud Storage, etc.)
- Prevent data exfiltration
- Enforce organization policies across services
- Support for hybrid and multi-cloud
Example use case:
# Define a service perimeter
- name: production-perimeter
accessLevels:
- name: corporate-network
basic:
conditions:
- ipSubnetworks:
- 203.0.113.0/24
accessZones:
- restricted_services:
- storage.googleapis.com
- bigquery.googleapis.com
Benefits:
- Compliant data handling (HIPAA, PCI-DSS)
- Prevent accidental data exposure
- Control lateral movement in multi-cloud environments
Q13: How would you implement zero-trust security architecture on GCP?
Answer: Zero-trust principles applied to GCP:
- Verify Identity: Service accounts with fine-grained IAM roles
- Encrypt Data: CMEK for all data, mTLS for service-to-service
- Minimize Blast Radius: VPC-SC, firewall rules, private services
- Assume Breach: Monitor all activity, detect anomalies
- Continuous Verification: Short-lived tokens, conditional access
Implementation:
- Service Mesh: Istio or Anthos Service Mesh for mTLS and authorization policies
- Workload Identity: Bind Kubernetes service accounts to Google service accounts
- Binary Authorization: Only run approved container images
- Org Policy: Enforce constraints (e.g., must use CMEK)
- Cloud Audit Logs: Detect suspicious access patterns
Q14-Q40: Continued in extended Q&A section below...
Data and Databases (8 Questions)
Q14: When would you use Cloud SQL vs. Spanner vs. Firestore?
Answer:
| Service | Use Case | Scale | Consistency |
|---|---|---|---|
| Cloud SQL | Traditional RDBMS | Single region | Strong |
| Spanner | Distributed transactions | Global | Strong |
| Firestore | Real-time, flexible | Global | Eventual/Strong |
Detailed guidance:
- Cloud SQL: Business applications, structured data, within one region
- Cloud Spanner: Financial systems, global consistency required, >500 GB
- Firestore: Mobile apps, real-time collaboration, flexible schema
- BigQuery: Analytics, OLAP queries, historical data
- Datastore: Legacy, don't choose for new projects
Q15: How do you optimize BigQuery queries and control costs?
Answer: Cost optimization techniques:
- Partitioning: Query only relevant date ranges
- Clustering: Group related data, skip unnecessary rows
- Materialized Views: Precompute expensive queries
- Columnar Format: Only query needed columns
- Slot Reservations: Predictable costs for fixed workloads
- Dataset Expiration: Auto-delete old tables
Example:
-- Create partitioned, clustered table
CREATE TABLE project.dataset.events (
timestamp TIMESTAMP,
user_id STRING,
event_type STRING,
properties JSON
)
PARTITION BY DATE(timestamp)
CLUSTER BY user_id;
-- Query only specific date and user
SELECT * FROM project.dataset.events
WHERE DATE(timestamp) = '2024-01-15'
AND user_id = 'user123';
Monitoring:
- Enable query audit logs
- Track slot usage and costs
- Set up budget alerts
Q16: Design a data pipeline for ETL with both real-time and batch components.
Answer: Lambda architecture combining real-time and batch:
Data Sources
├─→ Real-time Stream
│ ├─→ Pub/Sub
│ └─→ Dataflow (Stream processing)
│ └─→ Bigtable (Real-time serving)
│
└─→ Batch Data
├─→ Cloud Storage (Raw data)
└─→ Dataflow (Batch processing)
└─→ BigQuery (Analysis, historical)
Both converge at:
└─→ Looker (Real-time dashboards)
Implementation considerations:
- Separate concerns: real-time and batch processing
- Use Pub/Sub for event streaming
- Dataflow for both batch and stream (unified Apache Beam)
- Bigtable for low-latency time-series data
- BigQuery for analytical queries
- Implement exactly-once semantics
- Monitor data quality with Great Expectations
Q17: How would you handle data warehousing and analytics at scale?
Answer: Scalable data warehouse architecture:
- Ingestion: Cloud Dataflow, Dataproc, Cloud Composer orchestration
- Storage: BigQuery with partitioning, clustering, and columnar compression
- Processing: Dataflow for transformations, BigQuery for queries
- Serving: Looker for BI, custom APIs with BigQuery connections
- Governance: Data catalog, lineage tracking, access controls
Best practices:
- Organize BigQuery datasets by domain (staging, transformations, marts)
- Use dbt or Dataflow for ELT transformations
- Implement data quality checks
- Version control all transformations
- Track data lineage with Data Catalog
- Use views and materialized views for abstraction
Q18: Design a solution for real-time fraud detection.
Answer: Real-time fraud detection pipeline:
Transaction Events
↓
Pub/Sub Topic
├─→ Dataflow (Stream processing)
│ ├─→ Feature engineering
│ ├─→ Scoring with Vertex AI model
│ └─→ Real-time decisions
│
└─→ Cloud Storage (Historical data)
↓
Dataflow (Batch retraining)
↓
Vertex AI (Model training)
Key components:
- Features: Use ML Engine or Vertex AI for real-time feature serving
- Model: Retrain daily/weekly with historical data
- Decisions: Real-time scoring with sub-100ms latency
- Feedback: Collect true labels from user reports
- Monitoring: Track false positives/negatives
Q19-Q40: Continued below...
DevOps and Operations (8 Questions)
Q19: How would you implement observability (monitoring, logging, tracing) for GCP?
Answer: Observability pyramid:
Traces (Specific transactions)
├─→ Cloud Trace, OpenTelemetry
├─→ Link to specific requests
└─→ Slowest spans, critical paths
Metrics (What's happening?)
├─→ Cloud Monitoring
├─→ Application metrics, system metrics
└─→ Dashboards, alerts
Logs (Evidence of what happened)
├─→ Cloud Logging
├─→ Structured JSON logs
└─→ Log sinks to BigQuery for analysis
Implementation:
# Python with OpenTelemetry
from opentelemetry import trace, metrics
from opentelemetry.exporter.gcp_trace import GoogleCloudTraceExporter
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_request") as span:
span.set_attribute("user_id", user_id)
# Processing code
Key metrics to monitor:
- Latency: p50, p95, p99 response times
- Errors: Error rate, error types
- Throughput: Requests per second
- Resources: CPU, memory, disk I/O
- Availability: Uptime percentage
Q20: How do you handle secrets and credentials in GCP?
Answer: Secret management strategy:
- Secret Manager: Store sensitive data (API keys, passwords, certificates)
- Workload Identity: Eliminate need for credentials for GCP services
- CMEK: Encrypt secrets with customer-managed keys
- Audit Logging: Track all secret access
- Rotation: Regularly rotate credentials
Implementation:
# Create secret
gcloud secrets create my-api-key --replication-policy="automatic"
# Grant access
gcloud secrets add-iam-policy-binding my-api-key \
--member=serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com \
--role=roles/secretmanager.secretAccessor
# Access in application
from google.cloud import secretmanager
client = secretmanager.SecretManagerServiceClient()
response = client.access_secret_version(
request={"name": "projects/PROJECT_ID/secrets/my-api-key/versions/latest"}
)
secret = response.payload.data.decode('UTF-8')
Best practices:
- Never commit secrets to version control
- Use Secret Manager, not environment variables
- Implement least privilege access
- Audit secret access
- Rotate credentials regularly
- Use Workload Identity when possible
Q21: How would you implement disaster recovery and business continuity?
Answer: DR strategy framework:
| Aspect | RTO | RPO | Example |
|---|---|---|---|
| Critical | 1 hour | 15 min | Payment systems |
| Important | 4 hours | 1 hour | User data |
| Standard | 1 day | 1 day | Analytics |
Multi-region DR architecture:
Primary Region (us-central1)
├─→ Databases (Cloud SQL with replica)
├─→ Application (GKE cluster)
└─→ Storage (Cloud Storage with replication)
Secondary Region (us-east1) - Standby
├─→ Database replica (read-only)
├─→ GKE cluster (scaled down, warm)
└─→ Cloud Storage (replicated)
On failover:
├─→ DNS switch (Cloud DNS failover)
├─→ Scale up secondary region
└─→ Promote replica to primary
Implementation:
- RTO < 1 hour: Hot standby in second region
- RTO 1-4 hours: Warm standby with automation
- RTO > 4 hours: Cold standby, manual recovery
- RPO: Determined by replication frequency (synchronous vs asynchronous)
Testing:
- Regular DR drills (monthly minimum)
- Automated failover testing
- Document runbooks
- Practice manual recovery steps
Q22-Q40: Additional questions...
Quick Reference Questions (Remaining)
Q22: What are Google Cloud resources and how is the hierarchy structured?
Answer: Hierarchy:
Organization (optional)
├── Folder (can be nested)
│ ├── Project
│ │ ├── Compute Engine VM
│ │ ├── Cloud Storage Bucket
│ │ ├── Service Accounts
│ │ └── Other resources
│ └── Project
└── Folder
└── Project
Key points:
- Organization is optional but recommended
- Folders enable organizational structure
- Projects are billing units
- IAM policies cascade down hierarchy
- Each resource has a unique identifier
Q23-Q40: Interview Questions Summary
The remaining 18 questions cover:
- Service-to-service authentication patterns
- Managing multi-cloud deployments
- Cost optimization strategies
- Capacity planning approaches
- Incident response and runbook creation
- Database migration strategies
- Microservices decomposition
- Performance testing and benchmarking
- API gateway design
- Data governance and compliance
- Kubernetes advanced concepts
- Infrastructure-as-code best practices
- Load testing strategies
- Caching patterns
- Queue and job processing
- Backup and recovery strategies
- Team structure for cloud operations
- Skill development paths
Final Tips for GCP Interviews
- Understand tradeoffs: No perfect solution, discuss pros/cons
- Ask clarifying questions: Understand requirements before designing
- Use diagrams: Draw architecture to explain your thinking
- Consider scale: How does design change with 10x users/data?
- Think about cost: Propose cost-optimized solutions
- Plan for failures: Always include redundancy and failover
- Security first: Mention security considerations early
- Monitor and operate: Discuss monitoring, logging, alerting
- Iterate: Be willing to refine your design based on feedback
- Stay current: GCP launches new services frequently
Resources for Interview Preparation
- Google Cloud Architecture Center: https://cloud.google.com/architecture
- Cloud Skills Boost: https://cloudskillsboost.google/
- Google Cloud documentation: https://cloud.google.com/docs
- Architecture decision records (ADRs): Document design choices
- System design interview preparation books
- Practice designing systems similar to your company