Linux Process and Job Management
Find what is eating CPU, kill a stuck process the right way, and run things in the background. The process model, signals, ps and top, kill and pkill, and jobs control - the commands you actually reach for when a box is misbehaving.
When a server misbehaves, it is almost always a process: one is pinning the CPU, one is leaking memory, one is stuck and holding a port, or the one you need is not running at all. Knowing how to see processes, understand their state, and stop them cleanly is the difference between a calm fix and a guess that makes things worse.
This is the practical version - the model you need plus the handful of commands that come up over and over.
What a process actually is
A process is a running program. Every process has:
- A PID (process ID) - a unique number the kernel assigns it.
- A PPID (parent PID) - the process that started it. Processes form a tree; the root is
init/systemd(PID 1). - An owner - the user it runs as, which decides what it can touch and who can signal it.
- A state - running, sleeping, stopped, or zombie (more on these below).
New processes are created when a parent forks a copy of itself and usually execs a new program in it. That parent/child link matters: when you kill a parent, you often want its children to go too, and orphaned children get re-parented to PID 1.
Seeing what is running: ps
ps takes a snapshot of processes. The one invocation worth memorizing:
ps aux # every process, with user, CPU%, MEM%, and the command
The columns that matter: USER, PID, %CPU, %MEM, STAT (state), and COMMAND. To find a specific process, pipe it to grep:
ps aux | grep nginx # find nginx processes
ps aux | grep -v grep | grep nginx # drop the grep line itself
Two more forms you will use:
ps -ef # same idea, different columns; shows PPID (the parent)
ps -p 1234 -o user,pid,ppid,%cpu,%mem,stat,cmd # one PID, chosen columns
pgrep -a nginx # just the PIDs (and command) matching a name
pgrep is the clean way to get PIDs for scripting - no grep | awk dance.
The live view: top and htop
ps is a snapshot; top updates continuously and is how you catch a process in the act.
top # live, sorted by CPU by default
Inside top: press P to sort by CPU, M to sort by memory, k to kill a PID, and q to quit. The header line shows load average - three numbers for the last 1, 5, and 15 minutes. A rough rule: if load average is consistently above your core count, the box is overloaded.
htop is the friendlier version if it is installed - colored bars, scrolling, and you can click to select. Install it (apt install htop or dnf install htop); on a box you control it is worth it.
htop # nicer interactive process viewer
Signals: how you actually talk to a process
You do not "kill" a process directly - you send it a signal, and the process decides how to respond. The three that matter in practice:
- SIGTERM (15) - the polite "please shut down." The default. The process can catch it, flush its work, close connections, and exit cleanly. Always try this first.
- SIGKILL (9) - the unconditional "die now." The kernel terminates the process immediately; it cannot catch or ignore it. Nothing gets cleaned up - open files, in-flight work, and locks are left as-is. A last resort.
- SIGHUP (1) - originally "the terminal hung up." Many daemons repurpose it to mean "reload your config without restarting" (nginx, sshd). Check the docs before assuming.
The instinct to reach for kill -9 immediately is the most common mistake here. SIGKILL gives the process no chance to clean up, which is exactly how you get corrupted files and stale locks. Send SIGTERM, wait a few seconds, and only escalate to SIGKILL if it is genuinely stuck.
Stopping processes: kill, pkill, killall
kill 1234 # send SIGTERM (the default) to PID 1234
kill -TERM 1234 # the same thing, explicit
kill -9 1234 # SIGKILL - only if SIGTERM did not work
kill -HUP 1234 # SIGHUP - often "reload config"
By name instead of PID:
pkill nginx # SIGTERM every process whose name matches
pkill -9 -f "python worker.py" # match the FULL command line, then SIGKILL
killall nginx # SIGTERM all processes named exactly "nginx"
pkill -f matches against the whole command line, which is how you target one specific script when several share an interpreter (several python processes, for example). Be careful: a loose pattern can kill more than you meant. Check first with pgrep -af what would match.
The correct escalation when something is stuck:
kill 1234 # 1. ask nicely
sleep 5 # 2. give it a moment
kill -9 1234 # 3. only if it is still there
Background and foreground: job control
When you run a command in a terminal it holds the foreground until it finishes. Job control lets you push work into the background and get your prompt back.
long-running-task & # start it in the background
jobs # list jobs started from this shell
fg %1 # bring job 1 back to the foreground
bg %1 # resume a stopped job in the background
The key sequence: press Ctrl+Z to suspend the foreground process (it stops, does not die), then bg to let it keep running in the background, or fg to resume it in front. Ctrl+C is different - it sends SIGINT to stop the process entirely.
The catch most people hit: a backgrounded job still belongs to your shell, so when you log out or close the terminal, it gets a SIGHUP and dies. To survive a logout:
nohup long-task & # ignore SIGHUP; output goes to nohup.out
disown %1 # detach an already-running job from the shell
For anything that genuinely needs to outlive your session - a long migration, a training run - use tmux or screen (a persistent terminal you can detach from and reconnect to) or, better for a real service, a systemd unit. nohup is fine for a one-off; it is not how you run a service.
Process states and the zombie myth
The STAT column in ps tells you what a process is doing:
- R - running or runnable (on or waiting for the CPU).
- S - sleeping, waiting for something (the normal idle state for most processes).
- D - uninterruptible sleep, usually waiting on disk or network I/O. A process stuck in D cannot even be killed until the I/O returns; lots of D processes points at a storage or network problem.
- T - stopped (suspended, for example by Ctrl+Z).
- Z - zombie.
A zombie is not a process eating resources - it is a dead process whose parent has not yet read its exit status. It holds nothing but a slot in the process table. You cannot kill a zombie (it is already dead); you fix it by getting the parent to reap it, or if the parent is broken, by killing the parent so PID 1 adopts and reaps the zombie. A pile of zombies means a parent with a bug, not a resource leak.
Priorities: nice and renice
Every process has a niceness from -20 (greediest) to 19 (most willing to yield). Higher niceness means lower priority. Use it to stop a heavy batch job from starving everything else:
nice -n 10 ./backup.sh # start a job at lower priority
renice -n 5 -p 1234 # change a running process's priority
Only root can raise priority (use a negative niceness). For most DevOps work you will reach for this rarely - but when a backup or import is hammering a box that also serves traffic, a higher niceness keeps the important work responsive.
Finding what holds a port or a file
A classic incident: a service will not start because "address already in use." Something is already on the port. Find it:
ss -tlnp # listening TCP sockets with the owning process
ss -tlnp | grep :8080 # who is on port 8080
lsof -i :8080 # same question, lsof's view
lsof -p 1234 # every file (and socket) PID 1234 has open
ss is the modern replacement for netstat; -tlnp means TCP, listening, numeric, with the process. Once you have the PID you can decide whether to stop it or move your service. The same lsof -p trick finds which process is holding a file you cannot delete or a mount you cannot unmount.
The /proc view
Everything above is reading from /proc, a virtual filesystem where each process is a directory named by its PID:
cat /proc/1234/cmdline | tr '\0' ' ' # the exact command line
ls -l /proc/1234/cwd # its working directory
cat /proc/1234/status # state, memory, the owning UID
ls -l /proc/1234/fd # every open file descriptor
You rarely need this directly, but when ps is not telling you enough - what directory is this process actually running in, what files does it have open - /proc has the ground truth.
Putting it together: a runaway process
The whole loop in practice:
top(orhtop) - find the PID pinning the CPU. Note the user and command.ps -p <pid> -o user,ppid,cmd- confirm what it is and who started it.kill <pid>- ask it to stop, then wait a few seconds.kill -9 <pid>- only if it ignored the polite request.- Find out why it ran away before you call it fixed - a cron job, a stuck loop, a bad deploy. Killing the symptom without finding the cause just means it comes back.
That sequence handles the large majority of "the box is slow" pages. Practice it on a test machine: start a busy loop (yes > /dev/null &), watch it in top, then find and stop it. Do it once and it becomes reflex.