Linux for DevOps: The Fundamentals
Everything you actually need in Linux to work in DevOps - the command line, permissions, processes, systemd, networking, and disk troubleshooting. Not a cheat sheet of trivia, the commands you'll reach for every day and at 2am.
Almost every server you'll touch in DevOps runs Linux. If you can't move around the command line quickly, read logs, fix a permission error, restart a service, and figure out why a port isn't responding, everything else - Kubernetes, Terraform, CI/CD - sits on shaky ground.
This is the one guide that covers what you genuinely need. Not every flag of every command, and not the obscure trivia that shows up on cheat sheets. The commands that come up over and over when you're operating real systems. Most cloud servers run Ubuntu or Amazon Linux; everything here applies to both, and nearly all of it works on macOS too.
Read it once to see the shape of things, then keep it open as a reference until the muscle memory sticks.
How Linux Is Laid Out
Before the commands, the mental model: everything is a file. Config, logs, devices, even running processes - all exposed as files in one tree starting at /. Learn the layout and you stop hunting:
/etc- configuration. Nginx, SSH, systemd units, DNS settings all live here./var- variable data that grows:/var/log(logs),/var/lib(databases, container state)./usr- installed programs and their libraries (/usr/bin,/usr/local/bin)./home- user home directories./rootfor the root user./tmp- ephemeral scratch space, wiped on reboot./procand/sys- virtual filesystems the kernel exposes; live process and system state as files.
A few foundations that everything else builds on:
- Standard streams: every process has stdin (0), stdout (1), stderr (2).
2>&1merges stderr into stdout - essential for capturing all output in logs and CI. - Pipes (
|) and redirects: a pipe sends one command's stdout into the next command's stdin. Redirects send output to files:>overwrites,>>appends,2>captures stderr. These are the building blocks of every shell script. - Exit codes:
0means success, anything else is failure.$?holds the last exit code.&&runs the next command only if the previous succeeded;||runs it only if the previous failed. - Globs vs regex:
*.logis a glob (the shell expands it into filenames)..*\.logis a regex (used insidegrep,sed,awk). They look alike but behave differently - don't mix them up.
Navigating and Working With Files
# Where am I and what's here?
pwd # print working directory
ls -lah # long listing, human sizes, including hidden files
cd - # jump back to the previous directory
# Find things
find /var/log -name "*.log" -mtime -1 # .log files modified in the last 24h
find /etc -type f -name "*.conf" # all config files under /etc
# Move, copy, delete
cp -r src/ dest/ # recursive copy
mv oldname newname # move or rename
rm -rf dir/ # delete a directory (no confirmation, no trash bin)
ln -s /etc/nginx/nginx.conf ~/nginx.conf # symlink
# Look inside files
less /var/log/syslog # scrollable view (q to quit, / to search)
tail -f /var/log/app.log # follow a log live as it's written
head -50 access.log # first 50 lines
Searching and Processing Text
This is where you'll spend a lot of your time - logs, configs, command output. grep, sed, and awk are the core trio.
# grep - find matching lines
grep -r "ERROR" /var/log/ # recursive search through a directory
grep -i "timeout" app.log # case-insensitive
grep -v "health" access.log # invert: show lines that DON'T match
grep -E "ERROR|WARN" app.log # extended regex - match either word
grep -c "404" access.log # count matching lines
# Pipes doing real work
ps aux | grep nginx # find nginx processes
cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
# ^ top 10 IPs hitting your server, most frequent first
# sed - find and replace
sed 's/localhost/10.0.0.1/g' config.template > config.conf # write a new file
sed -i 's/DEBUG/INFO/g' app.conf # edit the file in place
# awk - column extraction
awk -F: '{print $1}' /etc/passwd # first colon-delimited column (usernames)
awk '$9 == 500 {print $7}' access.log # request paths that returned HTTP 500
# Redirects in practice
./deploy.sh > deploy.log 2>&1 # capture stdout AND stderr to one file
./check.sh >> /var/log/checks.log # append, don't overwrite
Handy shortcuts that make you faster: Ctrl+R (reverse-search command history), Ctrl+A / Ctrl+E (jump to start/end of line), and !! (repeat the last command - sudo !! reruns it with sudo).
Permissions and Ownership
Permission errors are one of the most common things that break deployments: a config the app can't read, a script that won't execute, an SSH key that's "too open." The quick version: every file carries read/write/execute bits for three audiences - owner, group, and others - which ls -l prints as -rwxr-xr-- and which you'll often see written as octal (755, 644, 600).
ls -la /var/www/html/ # read the -rwxr-xr-x string in column one
chmod 755 /usr/local/bin/myapp # rwxr-xr-x - executable
chmod 644 /etc/app/config.yml # rw-r--r-- - config file
chmod 600 ~/.ssh/id_ed25519 # rw------- - private keys MUST be 600
chown www-data:www-data /var/www/html/ # set owner:group
The single most common SSH gotcha: your private key must be 600 and your ~/.ssh directory 700, or SSH silently refuses the key and hands you a confusing Permission denied (publickey).
There's more to it - why permissions aren't cumulative, how the octal math works, what rwx means on a directory versus a file, and the special bits. It's a model worth having properly, so it gets its own guide: Linux File Permissions, Properly Understood.
Processes and Signals
A process is a running program. When something breaks - pegged CPU, a hung service, a memory leak - you need to find it fast.
# Snapshot of processes
ps aux # everything, human-readable
ps aux --sort=-%cpu | head # top CPU consumers
ps aux --sort=-%mem | head # top memory consumers
top # live monitor (press M for mem, P for cpu, q to quit)
htop # nicer top, if installed
# Find a process
pgrep -a nginx # PIDs + full command for anything named nginx
# Send signals
kill -15 <pid> # SIGTERM - ask it to stop gracefully (default)
kill -9 <pid> # SIGKILL - force kill, no cleanup (last resort)
kill -1 <pid> # SIGHUP - traditionally "reload config"
pkill -f "python worker.py" # kill by matching the full command line
Always try SIGTERM (15) before SIGKILL (9). kill -9 gives the process no chance to flush buffers, close files, or release locks - databases and web servers can leave corrupt state behind. Wait a few seconds after SIGTERM before escalating.
Managing Services With systemd
On modern Linux, systemd (PID 1) manages almost every service. Knowing how to read its status and logs is the difference between a 5-minute fix and a 2-hour outage.
# Service control
systemctl status nginx # running? failed? recent log lines?
systemctl start nginx
systemctl stop nginx
systemctl restart nginx
systemctl reload nginx # reload config without a full restart (if supported)
systemctl enable nginx # start automatically on boot
systemctl disable nginx
systemctl list-units --type=service --state=running
# journalctl - reading service logs
journalctl -u nginx # all logs for the nginx unit
journalctl -u nginx -f # follow live
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx -p err # errors only
journalctl -b # everything since the last boot
journalctl --disk-usage # how much space the logs are eating
Reach for systemctl status <service> first whenever a service misbehaves - it tells you whether it's running, when it last restarted, and shows the last several log lines in one shot.
Networking and Connectivity
Network problems are frustrating because they fail at many layers - DNS, routing, firewall, the app itself. A systematic toolkit lets you eliminate possibilities instead of guessing.
# What's listening, and what's connected?
ss -tlnp # TCP ports being listened on, with process names
ss -tulnp # add UDP
lsof -i :8080 # which process is using port 8080?
# DNS
dig example.com # full DNS answer
dig @8.8.8.8 example.com # ask a specific resolver
cat /etc/resolv.conf # which DNS servers this machine uses
cat /etc/hosts # local overrides - check here when DNS acts weird
# Interfaces and routing
ip addr show # interfaces and their IPs
ip route show # the routing table (is there a default gateway?)
# Testing reachability
curl -v https://api.example.com/health # full HTTP debug, including TLS
curl -o /dev/null -s -w "%{http_code}\n" https://example.com # just the status code
nc -zv 10.0.0.5 5432 # is the PostgreSQL port reachable?
Two distinctions that save hours: "Connection refused" means something answered and said no - the host is reachable but nothing is listening on that port (check ss -tlnp). A firewall drop gives you a timeout, not a refusal. And 0.0.0.0 in ss output means "listening on all interfaces," while 127.0.0.1 means loopback only - not reachable from outside the machine.
Disk, Memory, and the 2am Checklist
When a server is "slow" or a deploy fails for no obvious reason, it's often a full disk or exhausted memory. Check these first.
# Disk
df -h # free space per filesystem (watch for 100% Use%)
du -sh /var/log/* # what's eating space in a directory
du -sh /* 2>/dev/null | sort -rh | head # biggest top-level directories
# Memory
free -h # total/used/free memory and swap
ps aux --sort=-%mem | head # who's hogging RAM
# Package management
apt update && apt install -y htop # Debian/Ubuntu
yum install -y htop # RHEL/Amazon Linux (or dnf on newer systems)
A disk at 100% breaks things in confusing ways - services can't write logs, databases refuse writes, deploys fail mid-copy. The usual culprit is runaway logs in /var/log. df -h then du -sh /var/log/* finds it fast.
Common Mistakes
rm -rf without checking the path twice. There's no trash bin. ls the target first. In scripts, use rm -rf "${DIR:?}" - the :? aborts if DIR is unset, preventing a catastrophic rm -rf /.
Reaching for chmod 777 to fix a permission error. It "works" by removing all protection - it's a security hole, not a fix. Identify which user needs access (ps aux | grep <process>), then grant the minimum: usually chmod 640 plus chown appuser:appgroup.
> vs >>. > truncates the file before writing. Point it at a log another process is reading and you've wiped it. Use >> to append to existing files.
Jumping straight to kill -9. Try SIGTERM first - give the process a chance to shut down cleanly. kill -9 is the last resort, not the first move.
Trusting ping as a connectivity test. ICMP is frequently blocked by firewalls and cloud security groups. A host that ignores ping may still serve HTTP fine. Test the actual port with curl or nc.
Where to Go From Here
The only way this sticks is by doing it. Spin up a cheap server (a DigitalOcean droplet or an AWS EC2 free-tier instance) and, using only the terminal:
- Create a non-root user, give it
sudo, and lock down SSH key permissions. - Install and run Nginx as a systemd service - then read its logs with
journalctl. - Deliberately break it (wrong permissions on the web root, a config typo) and use
systemctl status,journalctl, andssto diagnose and fix it. - Fill up a disk with
dd, watch a service fail, and clean it up.
Do that loop a few times and the commands stop being a list to memorize - they become reflexes. That's the whole goal. From here, Docker, Kubernetes, and Terraform all make far more sense, because they're just orchestrating the Linux primitives you now understand.