Challenges

Defend a real DevOps decision against a senior engineer, then get scored against a verified rubric. No memorizing, just reasoning.

Sign in to play free
1

Pick a real call

Each challenge is a true-to-life DevOps decision with a clear question to answer.

2

Defend it

A senior-engineer AI pushes back on your reasoning. Hold your ground or adjust.

3

Get scored

You are graded against a verified rubric, so you learn what a complete answer covers.

Free for everyone. You only sign in when you start a challenge, so your progress is saved.

Beginner

9
beginner

1.6GB Docker Image, 8-Minute Builds. Add Runners?

Your image is huge and every code change rebuilds from scratch. A teammate wants to just add more CI runners. Defend how you actually fix it.

docker · ci-cd · ~8 minView
beginner

502 Bad Gateway Right After a Deploy

The site returns 502 from nginx after a deploy. A teammate wants to tweak the nginx config. Defend where you actually look first.

nginx · incident-response · ~8 minView
beginner

Access Keys Are Hardcoded on the EC2 Box. Just Rotate Them?

An app on EC2 authenticates to AWS with long-lived IAM user keys in a config file. One leaked. Defend the real fix.

aws · iam · ~10 minView
beginner

Disk Is at 100% and the App Is Down. rm -rf?

A server is out of disk and the app is down. A teammate wants to rm -rf the logs to free space fast. Defend what you actually do first.

linux · incident-response · ~8 minView
beginner

Nightly Backups Run. Are You Actually Covered?

Automated backups have run for months. A teammate says you are fully covered for disasters. Defend whether you agree.

backups · reliability · ~8 minView
beginner

Tests Pass Locally, Fail in CI. Just Retry?

Your suite is green on your laptop but red in CI. A teammate wants to add a retry and re-run until green. Defend what you actually do.

ci-cd · testing · ~8 minView
beginner

The t3 Instance Got Slow. Just Size It Up?

An app on a t3 instance crawls under steady load and a teammate wants a bigger instance. Defend what is actually throttling it.

aws · ec2 · ~9 minView
beginner

You Pushed AWS Keys to a Public Repo

Live cloud keys just landed in a public GitHub repo. A teammate says delete the file and force-push. Defend what you do first.

git · security · ~8 minView
beginner

Your S3 Website Returns 403. Make the Bucket Public?

Objects in S3 return 403 and a teammate wants to turn off Block Public Access to fix it fast. Defend the right way to grant access.

aws · s3 · ~9 minView

Intermediate

11
intermediate

3 AM Outage: Roll Back or Fix Forward?

A deploy 20 minutes ago correlates with a spike in 500s. You are on call. Defend whether to roll back or fix forward.

incident-response · deployment · ~12 minView
intermediate

A Traffic Spike Caused an Outage. Autoscaling Did Not Save You.

A spike took the service down even though autoscaling was on. A teammate wants to crank the max replicas. Defend the real fix.

kubernetes · scaling · ~13 minView
intermediate

Add a NOT NULL Column to a 50M-Row Prod Table

A migration adds a NOT NULL column to a huge live table. A teammate wants to run the ALTER directly. Defend the safe path.

database · postgres · ~14 minView
intermediate

Lambda Is Exhausting Your RDS Connections. Raise the Limit?

Under load, Lambda functions fail with too many connections to RDS. A teammate wants to bump max_connections. Defend the real fix.

aws · rds · ~12 minView
intermediate

On-Call Is Drowning in Alerts. Silence Them?

A CPU alert pages constantly with no real impact. A teammate wants to silence it. Defend how you fix alerting properly.

observability · reliability · ~12 minView
intermediate

Pod Stuck in CrashLoopBackOff After a Deploy

A new deploy leaves a pod in CrashLoopBackOff. A teammate wants to bump CPU and memory. Defend how you actually diagnose it.

kubernetes · incident-response · ~12 minView
intermediate

Pods Keep OOMKilling: Scale Up or Fix the Leak?

A service OOMKills every few hours. Adding replicas or raising memory limits buys time. Defend whether to scale around it or fix the root cause.

kubernetes · reliability · ~12 minView
intermediate

Someone Changed Infra by Hand. Terraform Plan Is a Mess.

A manual console change put Terraform's state out of sync, and the next plan wants to undo it. Defend how you reconcile it.

terraform · infrastructure · ~12 minView
intermediate

SQS Is Delivering the Same Message Twice. Is the Queue Broken?

Consumers occasionally process the same SQS message twice, causing double charges. A teammate says switch everything to FIFO. Defend the real fix.

aws · sqs · ~13 minView
intermediate

The NAT Gateway Bill Exploded. Add More Gateways?

NAT Gateway data processing is now your biggest line item. A teammate thinks it is just bandwidth. Defend where the cost really comes from.

aws · networking · ~12 minView
intermediate

The Pipeline Is Slow. Skip Tests to Ship?

A 40-minute pipeline is blocking releases. A teammate wants to skip the test stage to ship faster. Defend your call.

ci-cd · reliability · ~12 minView

Advanced

10
advanced

A Container in Prod Is Mining Crypto

A pod is pegging CPU running a crypto miner. A teammate wants to kubectl delete it and move on. Defend the real response.

security · kubernetes · ~15 minView
advanced

An EKS Pod Needs AWS Access. Attach It to the Node Role?

A pod needs to read S3, and a teammate wants to add the permission to the EKS node role to unblock it. Defend the least-privilege approach.

aws · eks · ~15 minView
advanced

Cross-Account S3 Access Still Fails. Just Open It Up?

A partner account still gets AccessDenied on your encrypted S3 objects after you added a bucket policy. A teammate wants to make the bucket and key public. Defend the right fix.

aws · iam · ~15 minView
advanced

DynamoDB Is Throttling but Capacity Looks Idle. Crank It Up?

DynamoDB throttles some requests while table capacity looks underused. A teammate wants to raise capacity or flip to on-demand. Defend the real cause.

aws · dynamodb · ~15 minView
advanced

One Giant Terraform State Runs Everything

A single Terraform state holds all your infra; every plan is slow and scary. A teammate wants to just keep applying carefully. Defend how you reduce the blast radius.

terraform · infrastructure · ~15 minView
advanced

p99 Latency Spiked Across a Microservice Chain

p99 latency jumped across a request that touches six services. A teammate wants to scale up the slowest-looking one. Defend how you find the real culprit.

observability · performance · ~15 minView
advanced

Service A Intermittently Cannot Reach Service B

One in twenty calls between two services fails. A teammate wants to add retries and move on. Defend how you actually find the cause.

kubernetes · networking · ~15 minView
advanced

Terraform Wants to Replace the Database. Apply Now?

A routine terraform plan unexpectedly wants to destroy and recreate the production database. Defend whether to apply, and how.

terraform · infrastructure · ~14 minView
advanced

Your CI Has Keys to Prod and Pulls Unpinned Deps

CI pulls unpinned dependencies, runs them, and holds prod credentials. A teammate says just add a vulnerability scanner. Defend a real hardening plan.

ci-cd · security · ~15 minView
advanced

Your Primary Region Is Down. Fail Over Now?

The primary region is degraded. A teammate wants to flip DNS to the standby immediately. Defend whether and how you fail over.

disaster-recovery · database · ~15 minView
Have a question about a challenge? Ask in the private Discord.