Challenges
Defend a real DevOps decision against a senior engineer, then get scored against a verified rubric. No memorizing, just reasoning.
Pick a real call
Each challenge is a true-to-life DevOps decision with a clear question to answer.
Defend it
A senior-engineer AI pushes back on your reasoning. Hold your ground or adjust.
Get scored
You are graded against a verified rubric, so you learn what a complete answer covers.
Free for everyone. You only sign in when you start a challenge, so your progress is saved.
Beginner
91.6GB Docker Image, 8-Minute Builds. Add Runners?
Your image is huge and every code change rebuilds from scratch. A teammate wants to just add more CI runners. Defend how you actually fix it.
502 Bad Gateway Right After a Deploy
The site returns 502 from nginx after a deploy. A teammate wants to tweak the nginx config. Defend where you actually look first.
Access Keys Are Hardcoded on the EC2 Box. Just Rotate Them?
An app on EC2 authenticates to AWS with long-lived IAM user keys in a config file. One leaked. Defend the real fix.
Disk Is at 100% and the App Is Down. rm -rf?
A server is out of disk and the app is down. A teammate wants to rm -rf the logs to free space fast. Defend what you actually do first.
Nightly Backups Run. Are You Actually Covered?
Automated backups have run for months. A teammate says you are fully covered for disasters. Defend whether you agree.
Tests Pass Locally, Fail in CI. Just Retry?
Your suite is green on your laptop but red in CI. A teammate wants to add a retry and re-run until green. Defend what you actually do.
The t3 Instance Got Slow. Just Size It Up?
An app on a t3 instance crawls under steady load and a teammate wants a bigger instance. Defend what is actually throttling it.
You Pushed AWS Keys to a Public Repo
Live cloud keys just landed in a public GitHub repo. A teammate says delete the file and force-push. Defend what you do first.
Your S3 Website Returns 403. Make the Bucket Public?
Objects in S3 return 403 and a teammate wants to turn off Block Public Access to fix it fast. Defend the right way to grant access.
Intermediate
113 AM Outage: Roll Back or Fix Forward?
A deploy 20 minutes ago correlates with a spike in 500s. You are on call. Defend whether to roll back or fix forward.
A Traffic Spike Caused an Outage. Autoscaling Did Not Save You.
A spike took the service down even though autoscaling was on. A teammate wants to crank the max replicas. Defend the real fix.
Add a NOT NULL Column to a 50M-Row Prod Table
A migration adds a NOT NULL column to a huge live table. A teammate wants to run the ALTER directly. Defend the safe path.
Lambda Is Exhausting Your RDS Connections. Raise the Limit?
Under load, Lambda functions fail with too many connections to RDS. A teammate wants to bump max_connections. Defend the real fix.
On-Call Is Drowning in Alerts. Silence Them?
A CPU alert pages constantly with no real impact. A teammate wants to silence it. Defend how you fix alerting properly.
Pod Stuck in CrashLoopBackOff After a Deploy
A new deploy leaves a pod in CrashLoopBackOff. A teammate wants to bump CPU and memory. Defend how you actually diagnose it.
Pods Keep OOMKilling: Scale Up or Fix the Leak?
A service OOMKills every few hours. Adding replicas or raising memory limits buys time. Defend whether to scale around it or fix the root cause.
Someone Changed Infra by Hand. Terraform Plan Is a Mess.
A manual console change put Terraform's state out of sync, and the next plan wants to undo it. Defend how you reconcile it.
SQS Is Delivering the Same Message Twice. Is the Queue Broken?
Consumers occasionally process the same SQS message twice, causing double charges. A teammate says switch everything to FIFO. Defend the real fix.
The NAT Gateway Bill Exploded. Add More Gateways?
NAT Gateway data processing is now your biggest line item. A teammate thinks it is just bandwidth. Defend where the cost really comes from.
The Pipeline Is Slow. Skip Tests to Ship?
A 40-minute pipeline is blocking releases. A teammate wants to skip the test stage to ship faster. Defend your call.
Advanced
10A Container in Prod Is Mining Crypto
A pod is pegging CPU running a crypto miner. A teammate wants to kubectl delete it and move on. Defend the real response.
An EKS Pod Needs AWS Access. Attach It to the Node Role?
A pod needs to read S3, and a teammate wants to add the permission to the EKS node role to unblock it. Defend the least-privilege approach.
Cross-Account S3 Access Still Fails. Just Open It Up?
A partner account still gets AccessDenied on your encrypted S3 objects after you added a bucket policy. A teammate wants to make the bucket and key public. Defend the right fix.
DynamoDB Is Throttling but Capacity Looks Idle. Crank It Up?
DynamoDB throttles some requests while table capacity looks underused. A teammate wants to raise capacity or flip to on-demand. Defend the real cause.
One Giant Terraform State Runs Everything
A single Terraform state holds all your infra; every plan is slow and scary. A teammate wants to just keep applying carefully. Defend how you reduce the blast radius.
p99 Latency Spiked Across a Microservice Chain
p99 latency jumped across a request that touches six services. A teammate wants to scale up the slowest-looking one. Defend how you find the real culprit.
Service A Intermittently Cannot Reach Service B
One in twenty calls between two services fails. A teammate wants to add retries and move on. Defend how you actually find the cause.
Terraform Wants to Replace the Database. Apply Now?
A routine terraform plan unexpectedly wants to destroy and recreate the production database. Defend whether to apply, and how.
Your CI Has Keys to Prod and Pulls Unpinned Deps
CI pulls unpinned dependencies, runs them, and holds prod credentials. A teammate says just add a vulnerability scanner. Defend a real hardening plan.
Your Primary Region Is Down. Fail Over Now?
The primary region is degraded. A teammate wants to flip DNS to the standby immediately. Defend whether and how you fail over.