Skip to main content

GCP Interview Questions and Answers

Cloud Architecture and Design (10 Questions)

Q1: Design a globally distributed application that must handle millions of requests per second with sub-100ms latency.

Answer: Use a multi-region architecture:

  • Global Load Balancer routes traffic to nearest region
  • Cloud CDN caches static content globally
  • Cloud Run auto-scales to handle traffic spikes in each region
  • Firestore with multi-region replication for low-latency reads
  • Pub/Sub for asynchronous processing and decoupling
  • Cloud Monitoring with cross-region alerting

Key considerations:

  • Separate read-heavy (Cloud Firestore) from write-heavy (Pub/Sub) workloads
  • Use database replicas for read-only traffic in non-primary regions
  • Implement circuit breakers and timeouts
  • Cache aggressively with Cloud CDN
  • Monitor tail latency (p95, p99) not just averages

Q2: How do you design a system that's resilient to regional GCP outages?

Answer: Implement multi-region failover:

  • Deploy critical services across 2-3 regions
  • Use Cloud Load Balancing with health checks to detect region failures
  • Configure automatic failover with Traffic Director
  • Replicate data across regions (Firestore, Cloud SQL replicas)
  • Use Cloud CDN for static content distribution
  • Implement Spinnaker for orchestrated multi-region deployments

Additional resilience:

  • Design for graceful degradation (use features, not perfection)
  • Implement exponential backoff and retries
  • Use circuit breakers to prevent cascading failures
  • Cache data locally where possible
  • Test failover regularly with chaos engineering

Q3: What's the best architecture for a real-time analytics platform processing millions of events per second?

Answer: Event-driven architecture:

Events (App/Website)

Pub/Sub (topic)
├─→ Dataflow (Stream processing)
│ ↓
│ BigQuery (Real-time tables)
│ ↓
│ Looker (Visualizations)

└─→ Cloud Storage (Raw data backup)

BigQuery (Batch analysis)

Key services:

  • Pub/Sub: Decouple producers from consumers, handle backpressure
  • Dataflow: Process events with Apache Beam, handle complex transformations
  • BigQuery: Store and query massive datasets efficiently
  • Cloud Storage: Archive raw events for compliance and reprocessing
  • Looker: Real-time dashboards and reporting

Optimization techniques:

  • Use exactly-once semantics with Dataflow windowing
  • Compress data in Cloud Storage
  • Use BigQuery clustering and partitioning
  • Implement autoscaling in Dataflow
  • Cache frequently accessed data in Memorystore

Q4: Describe the differences between eventual consistency and strong consistency. When would you choose each for GCP?

Answer: Strong Consistency:

  • Every read reflects the most recent write
  • Used in: Cloud SQL, Cloud Spanner, Firestore (single-document transactions)
  • Drawbacks: Lower availability, higher latency

Eventual Consistency:

  • Reads may temporarily return stale data
  • Used in: Datastore, Memorystore, Cloud Storage
  • Benefits: Higher availability, lower latency

GCP Choice Strategy:

  • Use Cloud Spanner for strongly consistent, globally distributed transactional data
  • Use Firestore for applications needing eventual consistency with real-time updates
  • Use Cloud SQL for transactional workloads within a single region
  • Use Memorystore for caching with eventual consistency acceptable
  • Use BigQuery for analytical queries (eventual consistency is fine)

Example:

  • Banking system: Cloud Spanner (strong consistency required)
  • Social media feed: Firestore (eventual consistency acceptable)
  • Product catalog: Cloud SQL with caching (eventual consistency in cache)

Q5: What considerations should you make when migrating a monolithic application to GCP?

Answer: Migration strategy options:

  1. Rehost (Lift-and-shift): Run existing application on Compute Engine
  2. Replatform (Lift-tinker-shift): Minimize changes, leverage managed services
  3. Refactor/Re-architect: Redesign for cloud-native (microservices, containers)
  4. Repurchase: Replace with SaaS
  5. Retire: Decommission legacy systems

Detailed migration plan:

  • Assessment: Identify monolith components, dependencies, performance requirements
  • Containerization: Wrap monolith in Docker for GKE deployment
  • Database Migration: Move from on-premises to Cloud SQL or Spanner
  • Network Setup: Establish VPN/Interconnect for hybrid connectivity
  • Gradual Decomposition: Extract microservices one at a time
  • Testing: Conduct performance and failover testing
  • Cutover: Execute migration with minimal downtime
  • Optimization: Right-size resources post-migration

Tools:

  • Migrate for Compute Engine: Automated VM migration
  • Database Migration Service: Migrate databases with minimal downtime
  • Cloud Build: Automate containerization and deployment
  • Spinnaker: Orchestrate deployment to GKE

Q6: How would you implement CI/CD for a multi-team organization?

Answer: Multi-team CI/CD architecture:

Team A Repo (GitHub)  ┐
Team B Repo (GitHub) ├─→ Cloud Build ─→ Artifact Registry
Team C Repo (GitHub) ┘ ↓
(Promote to environments)
├─→ Dev (GKE)
├─→ Staging (GKE)
└─→ Prod (GKE, multi-region)

Key components:

  • Source Control: GitHub/Cloud Source Repositories
  • Cloud Build: Unified build platform with team-specific triggers
  • Artifact Registry: Central image/artifact repository
  • GKE: Multi-environment deployments
  • Cloud Monitoring: Observability across deployments

Best practices:

  • Enforce code reviews before merge
  • Run automated tests on every PR
  • Use branch protection rules
  • Implement GitOps for infrastructure-as-code
  • Separate build configs for security and compliance
  • Use service accounts with minimal required permissions
  • Implement approval gates for production deployments
  • Automated rollback on deployment failures

Q7: How do you ensure data security in a GCP application handling PII?

Answer: Layered security approach:

  1. Encryption in transit: TLS for all communication, Cloud VPN/Interconnect
  2. Encryption at rest: Cloud KMS for keys, CMEK for databases/storage
  3. Access control: IAM roles, service accounts, organization policies
  4. Data classification: Use DLP API to discover and classify PII
  5. Auditing: Cloud Audit Logs for all access
  6. Network isolation: VPC, firewall rules, Private Google Access
  7. Monitoring: Detect unauthorized access attempts

Implementation:

# Enable CMEK for Cloud Storage
gsutil encryption set gs://my-bucket/

# Enable DLP API
gcloud services enable dlp.googleapis.com

# Create service account with minimal permissions
gcloud iam service-accounts create app-sa
gcloud projects add-iam-policy-binding PROJECT_ID \
--member=serviceAccount:app-sa@PROJECT_ID.iam.gserviceaccount.com \
--role=roles/storage.objectViewer

# Enable audit logging
gcloud logging sinks create audit-sink \
bigquery.googleapis.com/projects/PROJECT_ID/datasets/audit_logs \
--log-filter='protoPayload.methodName="storage.buckets.get"'

Compliance considerations:

  • GDPR: Right to be forgotten (data deletion, anonymization)
  • HIPAA: Business Associate Agreement (BAA) with Google
  • PCI-DSS: Network isolation, encryption, regular audits
  • SOC 2: Google's certifications align with SOC 2 Type II

Q8: What are the key differences between Firestore and Datastore?

Answer:

FeatureFirestoreDatastore
Data ModelDocuments + CollectionsEntity + Kind
QueriesRich querying, real-time updatesLimited, no real-time
TransactionsMulti-document, offline supportSingle document
PricingPer operationPer operation (different rates)
UpdatesServer timestamps, batchesLimited
RecommendationNew projectsLegacy (migrate to Firestore)

Choose Firestore for:

  • Real-time applications (chat, collaboration)
  • Mobile apps needing offline support
  • Hierarchical data structures
  • Multi-document transactions

Choose Datastore only if:

  • Existing application heavily depends on it
  • Custom compatibility required

Q9: How would you architect a solution for processing large files asynchronously?

Answer: Asynchronous file processing pipeline:

Cloud Storage (File Upload)

Pub/Sub Topic (File notification)
├─→ Cloud Functions (small files, < 2GB)
└─→ Dataflow (large files, parallel processing)

Cloud Storage (Processed results)

BigQuery (Metadata and results)

Implementation:

  • Trigger: Use Cloud Storage notifications → Pub/Sub
  • Processing: Cloud Functions for small files, Dataflow for large
  • Monitoring: Cloud Monitoring for job status
  • Storage: Store processed results in Cloud Storage
  • Metadata: Track job status in Firestore or BigQuery

Error handling:

  • Dead-letter queue in Pub/Sub for failed messages
  • Retry logic with exponential backoff
  • Error notifications via Cloud Logging

Q10: Compare different compute options and when to use each.

Answer:

ServiceUse CaseProsCons
Compute EngineCustom apps, high controlFull control, per-second billingManage updates, patching
GKEContainerized apps, orchestrationPortable, scalable, organizedOperational complexity
Cloud RunStateless microservicesSimple, auto-scaling to zeroLimited execution time (1 hour)
Cloud FunctionsEvent-driven, simple tasksMinimal code, serverlessLimited to specific use cases
App EngineWeb applications, APIsFully managed, language supportLess flexibility

Decision matrix:

  • Need full control? → Compute Engine
  • Containerized, complex? → GKE
  • Stateless HTTP services? → Cloud Run
  • Simple event handlers? → Cloud Functions
  • Traditional web apps? → App Engine

Networking and Security (8 Questions)

Q11: Explain VPC network design for a production environment.

Answer: Multi-tier VPC design:

VPC (10.0.0.0/8)
├── Public Subnet (10.0.1.0/24)
│ ├── Load Balancer
│ └── NAT Gateway
├── Application Subnet (10.0.2.0/24)
│ ├── Cloud Run (public)
│ └── GKE Pods (private)
└── Database Subnet (10.0.3.0/24)
├── Cloud SQL (private)
└── Memorystore (private)

Key design principles:

  • Network segmentation: Separate public, application, and data tiers
  • Private services: Run applications in private subnets
  • Managed NAT: Use Cloud NAT for outbound traffic
  • Cloud VPN/Interconnect: Secure on-premises connectivity
  • Firewall rules: Explicit allow rules (deny by default)
  • Service controls: Enforce least privilege access

Firewall rule example:

# Allow traffic from load balancer to GKE
gcloud compute firewall-rules create allow-lb-to-app \
--network=production-vpc \
--allow=tcp:80,tcp:443 \
--source-ranges=0.0.0.0/0 \
--target-tags=application

# Deny all inbound (default deny policy)
gcloud compute firewall-rules create deny-all \
--network=production-vpc \
--deny=all \
--priority=65534

Q12: What is VPC Service Controls and when would you use it?

Answer: VPC Service Controls define security perimeters around Google Cloud services:

  • Restrict access to resources (BigQuery, Cloud Storage, etc.)
  • Prevent data exfiltration
  • Enforce organization policies across services
  • Support for hybrid and multi-cloud

Example use case:

# Define a service perimeter
- name: production-perimeter
accessLevels:
- name: corporate-network
basic:
conditions:
- ipSubnetworks:
- 203.0.113.0/24
accessZones:
- restricted_services:
- storage.googleapis.com
- bigquery.googleapis.com

Benefits:

  • Compliant data handling (HIPAA, PCI-DSS)
  • Prevent accidental data exposure
  • Control lateral movement in multi-cloud environments

Q13: How would you implement zero-trust security architecture on GCP?

Answer: Zero-trust principles applied to GCP:

  1. Verify Identity: Service accounts with fine-grained IAM roles
  2. Encrypt Data: CMEK for all data, mTLS for service-to-service
  3. Minimize Blast Radius: VPC-SC, firewall rules, private services
  4. Assume Breach: Monitor all activity, detect anomalies
  5. Continuous Verification: Short-lived tokens, conditional access

Implementation:

  • Service Mesh: Istio or Anthos Service Mesh for mTLS and authorization policies
  • Workload Identity: Bind Kubernetes service accounts to Google service accounts
  • Binary Authorization: Only run approved container images
  • Org Policy: Enforce constraints (e.g., must use CMEK)
  • Cloud Audit Logs: Detect suspicious access patterns

Q14-Q40: Continued in extended Q&A section below...

Data and Databases (8 Questions)

Q14: When would you use Cloud SQL vs. Spanner vs. Firestore?

Answer:

ServiceUse CaseScaleConsistency
Cloud SQLTraditional RDBMSSingle regionStrong
SpannerDistributed transactionsGlobalStrong
FirestoreReal-time, flexibleGlobalEventual/Strong

Detailed guidance:

  • Cloud SQL: Business applications, structured data, within one region
  • Cloud Spanner: Financial systems, global consistency required, >500 GB
  • Firestore: Mobile apps, real-time collaboration, flexible schema
  • BigQuery: Analytics, OLAP queries, historical data
  • Datastore: Legacy, don't choose for new projects

Q15: How do you optimize BigQuery queries and control costs?

Answer: Cost optimization techniques:

  • Partitioning: Query only relevant date ranges
  • Clustering: Group related data, skip unnecessary rows
  • Materialized Views: Precompute expensive queries
  • Columnar Format: Only query needed columns
  • Slot Reservations: Predictable costs for fixed workloads
  • Dataset Expiration: Auto-delete old tables

Example:

-- Create partitioned, clustered table
CREATE TABLE project.dataset.events (
timestamp TIMESTAMP,
user_id STRING,
event_type STRING,
properties JSON
)
PARTITION BY DATE(timestamp)
CLUSTER BY user_id;

-- Query only specific date and user
SELECT * FROM project.dataset.events
WHERE DATE(timestamp) = '2024-01-15'
AND user_id = 'user123';

Monitoring:

  • Enable query audit logs
  • Track slot usage and costs
  • Set up budget alerts

Q16: Design a data pipeline for ETL with both real-time and batch components.

Answer: Lambda architecture combining real-time and batch:

Data Sources
├─→ Real-time Stream
│ ├─→ Pub/Sub
│ └─→ Dataflow (Stream processing)
│ └─→ Bigtable (Real-time serving)

└─→ Batch Data
├─→ Cloud Storage (Raw data)
└─→ Dataflow (Batch processing)
└─→ BigQuery (Analysis, historical)

Both converge at:
└─→ Looker (Real-time dashboards)

Implementation considerations:

  • Separate concerns: real-time and batch processing
  • Use Pub/Sub for event streaming
  • Dataflow for both batch and stream (unified Apache Beam)
  • Bigtable for low-latency time-series data
  • BigQuery for analytical queries
  • Implement exactly-once semantics
  • Monitor data quality with Great Expectations

Q17: How would you handle data warehousing and analytics at scale?

Answer: Scalable data warehouse architecture:

  • Ingestion: Cloud Dataflow, Dataproc, Cloud Composer orchestration
  • Storage: BigQuery with partitioning, clustering, and columnar compression
  • Processing: Dataflow for transformations, BigQuery for queries
  • Serving: Looker for BI, custom APIs with BigQuery connections
  • Governance: Data catalog, lineage tracking, access controls

Best practices:

  • Organize BigQuery datasets by domain (staging, transformations, marts)
  • Use dbt or Dataflow for ELT transformations
  • Implement data quality checks
  • Version control all transformations
  • Track data lineage with Data Catalog
  • Use views and materialized views for abstraction

Q18: Design a solution for real-time fraud detection.

Answer: Real-time fraud detection pipeline:

Transaction Events

Pub/Sub Topic
├─→ Dataflow (Stream processing)
│ ├─→ Feature engineering
│ ├─→ Scoring with Vertex AI model
│ └─→ Real-time decisions

└─→ Cloud Storage (Historical data)

Dataflow (Batch retraining)

Vertex AI (Model training)

Key components:

  • Features: Use ML Engine or Vertex AI for real-time feature serving
  • Model: Retrain daily/weekly with historical data
  • Decisions: Real-time scoring with sub-100ms latency
  • Feedback: Collect true labels from user reports
  • Monitoring: Track false positives/negatives

Q19-Q40: Continued below...

DevOps and Operations (8 Questions)

Q19: How would you implement observability (monitoring, logging, tracing) for GCP?

Answer: Observability pyramid:

Traces (Specific transactions)
├─→ Cloud Trace, OpenTelemetry
├─→ Link to specific requests
└─→ Slowest spans, critical paths

Metrics (What's happening?)
├─→ Cloud Monitoring
├─→ Application metrics, system metrics
└─→ Dashboards, alerts

Logs (Evidence of what happened)
├─→ Cloud Logging
├─→ Structured JSON logs
└─→ Log sinks to BigQuery for analysis

Implementation:

# Python with OpenTelemetry
from opentelemetry import trace, metrics
from opentelemetry.exporter.gcp_trace import GoogleCloudTraceExporter

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_request") as span:
span.set_attribute("user_id", user_id)
# Processing code

Key metrics to monitor:

  • Latency: p50, p95, p99 response times
  • Errors: Error rate, error types
  • Throughput: Requests per second
  • Resources: CPU, memory, disk I/O
  • Availability: Uptime percentage

Q20: How do you handle secrets and credentials in GCP?

Answer: Secret management strategy:

  • Secret Manager: Store sensitive data (API keys, passwords, certificates)
  • Workload Identity: Eliminate need for credentials for GCP services
  • CMEK: Encrypt secrets with customer-managed keys
  • Audit Logging: Track all secret access
  • Rotation: Regularly rotate credentials

Implementation:

# Create secret
gcloud secrets create my-api-key --replication-policy="automatic"

# Grant access
gcloud secrets add-iam-policy-binding my-api-key \
--member=serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com \
--role=roles/secretmanager.secretAccessor

# Access in application
from google.cloud import secretmanager
client = secretmanager.SecretManagerServiceClient()
response = client.access_secret_version(
request={"name": "projects/PROJECT_ID/secrets/my-api-key/versions/latest"}
)
secret = response.payload.data.decode('UTF-8')

Best practices:

  • Never commit secrets to version control
  • Use Secret Manager, not environment variables
  • Implement least privilege access
  • Audit secret access
  • Rotate credentials regularly
  • Use Workload Identity when possible

Q21: How would you implement disaster recovery and business continuity?

Answer: DR strategy framework:

AspectRTORPOExample
Critical1 hour15 minPayment systems
Important4 hours1 hourUser data
Standard1 day1 dayAnalytics

Multi-region DR architecture:

Primary Region (us-central1)
├─→ Databases (Cloud SQL with replica)
├─→ Application (GKE cluster)
└─→ Storage (Cloud Storage with replication)

Secondary Region (us-east1) - Standby
├─→ Database replica (read-only)
├─→ GKE cluster (scaled down, warm)
└─→ Cloud Storage (replicated)

On failover:
├─→ DNS switch (Cloud DNS failover)
├─→ Scale up secondary region
└─→ Promote replica to primary

Implementation:

  • RTO < 1 hour: Hot standby in second region
  • RTO 1-4 hours: Warm standby with automation
  • RTO > 4 hours: Cold standby, manual recovery
  • RPO: Determined by replication frequency (synchronous vs asynchronous)

Testing:

  • Regular DR drills (monthly minimum)
  • Automated failover testing
  • Document runbooks
  • Practice manual recovery steps

Q22-Q40: Additional questions...

Quick Reference Questions (Remaining)

Q22: What are Google Cloud resources and how is the hierarchy structured?

Answer: Hierarchy:

Organization (optional)
├── Folder (can be nested)
│ ├── Project
│ │ ├── Compute Engine VM
│ │ ├── Cloud Storage Bucket
│ │ ├── Service Accounts
│ │ └── Other resources
│ └── Project
└── Folder
└── Project

Key points:

  • Organization is optional but recommended
  • Folders enable organizational structure
  • Projects are billing units
  • IAM policies cascade down hierarchy
  • Each resource has a unique identifier

Q23-Q40: Interview Questions Summary

The remaining 18 questions cover:

  • Service-to-service authentication patterns
  • Managing multi-cloud deployments
  • Cost optimization strategies
  • Capacity planning approaches
  • Incident response and runbook creation
  • Database migration strategies
  • Microservices decomposition
  • Performance testing and benchmarking
  • API gateway design
  • Data governance and compliance
  • Kubernetes advanced concepts
  • Infrastructure-as-code best practices
  • Load testing strategies
  • Caching patterns
  • Queue and job processing
  • Backup and recovery strategies
  • Team structure for cloud operations
  • Skill development paths

Final Tips for GCP Interviews

  1. Understand tradeoffs: No perfect solution, discuss pros/cons
  2. Ask clarifying questions: Understand requirements before designing
  3. Use diagrams: Draw architecture to explain your thinking
  4. Consider scale: How does design change with 10x users/data?
  5. Think about cost: Propose cost-optimized solutions
  6. Plan for failures: Always include redundancy and failover
  7. Security first: Mention security considerations early
  8. Monitor and operate: Discuss monitoring, logging, alerting
  9. Iterate: Be willing to refine your design based on feedback
  10. Stay current: GCP launches new services frequently

Resources for Interview Preparation