Guides/LinuxLinux/Linux Process and Job Management

Linux Process and Job Management

Find what is eating CPU, kill a stuck process the right way, and run things in the background. The process model, signals, ps and top, kill and pkill, and jobs control - the commands you actually reach for when a box is misbehaving.


When a server misbehaves, it is almost always a process: one is pinning the CPU, one is leaking memory, one is stuck and holding a port, or the one you need is not running at all. Knowing how to see processes, understand their state, and stop them cleanly is the difference between a calm fix and a guess that makes things worse.

This is the practical version - the model you need plus the handful of commands that come up over and over.

What a process actually is

A process is a running program. Every process has:

  • A PID (process ID) - a unique number the kernel assigns it.
  • A PPID (parent PID) - the process that started it. Processes form a tree; the root is init/systemd (PID 1).
  • An owner - the user it runs as, which decides what it can touch and who can signal it.
  • A state - running, sleeping, stopped, or zombie (more on these below).

New processes are created when a parent forks a copy of itself and usually execs a new program in it. That parent/child link matters: when you kill a parent, you often want its children to go too, and orphaned children get re-parented to PID 1.

Seeing what is running: ps

ps takes a snapshot of processes. The one invocation worth memorizing:

ps aux              # every process, with user, CPU%, MEM%, and the command

The columns that matter: USER, PID, %CPU, %MEM, STAT (state), and COMMAND. To find a specific process, pipe it to grep:

ps aux | grep nginx          # find nginx processes
ps aux | grep -v grep | grep nginx   # drop the grep line itself

Two more forms you will use:

ps -ef              # same idea, different columns; shows PPID (the parent)
ps -p 1234 -o user,pid,ppid,%cpu,%mem,stat,cmd   # one PID, chosen columns
pgrep -a nginx      # just the PIDs (and command) matching a name

pgrep is the clean way to get PIDs for scripting - no grep | awk dance.

The live view: top and htop

ps is a snapshot; top updates continuously and is how you catch a process in the act.

top                 # live, sorted by CPU by default

Inside top: press P to sort by CPU, M to sort by memory, k to kill a PID, and q to quit. The header line shows load average - three numbers for the last 1, 5, and 15 minutes. A rough rule: if load average is consistently above your core count, the box is overloaded.

htop is the friendlier version if it is installed - colored bars, scrolling, and you can click to select. Install it (apt install htop or dnf install htop); on a box you control it is worth it.

htop                # nicer interactive process viewer

Signals: how you actually talk to a process

You do not "kill" a process directly - you send it a signal, and the process decides how to respond. The three that matter in practice:

  • SIGTERM (15) - the polite "please shut down." The default. The process can catch it, flush its work, close connections, and exit cleanly. Always try this first.
  • SIGKILL (9) - the unconditional "die now." The kernel terminates the process immediately; it cannot catch or ignore it. Nothing gets cleaned up - open files, in-flight work, and locks are left as-is. A last resort.
  • SIGHUP (1) - originally "the terminal hung up." Many daemons repurpose it to mean "reload your config without restarting" (nginx, sshd). Check the docs before assuming.

The instinct to reach for kill -9 immediately is the most common mistake here. SIGKILL gives the process no chance to clean up, which is exactly how you get corrupted files and stale locks. Send SIGTERM, wait a few seconds, and only escalate to SIGKILL if it is genuinely stuck.

Stopping processes: kill, pkill, killall

kill 1234           # send SIGTERM (the default) to PID 1234
kill -TERM 1234     # the same thing, explicit
kill -9 1234        # SIGKILL - only if SIGTERM did not work
kill -HUP 1234      # SIGHUP - often "reload config"

By name instead of PID:

pkill nginx         # SIGTERM every process whose name matches
pkill -9 -f "python worker.py"   # match the FULL command line, then SIGKILL
killall nginx       # SIGTERM all processes named exactly "nginx"

pkill -f matches against the whole command line, which is how you target one specific script when several share an interpreter (several python processes, for example). Be careful: a loose pattern can kill more than you meant. Check first with pgrep -af what would match.

The correct escalation when something is stuck:

kill 1234           # 1. ask nicely
sleep 5             # 2. give it a moment
kill -9 1234        # 3. only if it is still there

Background and foreground: job control

When you run a command in a terminal it holds the foreground until it finishes. Job control lets you push work into the background and get your prompt back.

long-running-task &          # start it in the background
jobs                         # list jobs started from this shell
fg %1                        # bring job 1 back to the foreground
bg %1                        # resume a stopped job in the background

The key sequence: press Ctrl+Z to suspend the foreground process (it stops, does not die), then bg to let it keep running in the background, or fg to resume it in front. Ctrl+C is different - it sends SIGINT to stop the process entirely.

The catch most people hit: a backgrounded job still belongs to your shell, so when you log out or close the terminal, it gets a SIGHUP and dies. To survive a logout:

nohup long-task &            # ignore SIGHUP; output goes to nohup.out
disown %1                    # detach an already-running job from the shell

For anything that genuinely needs to outlive your session - a long migration, a training run - use tmux or screen (a persistent terminal you can detach from and reconnect to) or, better for a real service, a systemd unit. nohup is fine for a one-off; it is not how you run a service.

Process states and the zombie myth

The STAT column in ps tells you what a process is doing:

  • R - running or runnable (on or waiting for the CPU).
  • S - sleeping, waiting for something (the normal idle state for most processes).
  • D - uninterruptible sleep, usually waiting on disk or network I/O. A process stuck in D cannot even be killed until the I/O returns; lots of D processes points at a storage or network problem.
  • T - stopped (suspended, for example by Ctrl+Z).
  • Z - zombie.

A zombie is not a process eating resources - it is a dead process whose parent has not yet read its exit status. It holds nothing but a slot in the process table. You cannot kill a zombie (it is already dead); you fix it by getting the parent to reap it, or if the parent is broken, by killing the parent so PID 1 adopts and reaps the zombie. A pile of zombies means a parent with a bug, not a resource leak.

Priorities: nice and renice

Every process has a niceness from -20 (greediest) to 19 (most willing to yield). Higher niceness means lower priority. Use it to stop a heavy batch job from starving everything else:

nice -n 10 ./backup.sh       # start a job at lower priority
renice -n 5 -p 1234          # change a running process's priority

Only root can raise priority (use a negative niceness). For most DevOps work you will reach for this rarely - but when a backup or import is hammering a box that also serves traffic, a higher niceness keeps the important work responsive.

Finding what holds a port or a file

A classic incident: a service will not start because "address already in use." Something is already on the port. Find it:

ss -tlnp                     # listening TCP sockets with the owning process
ss -tlnp | grep :8080        # who is on port 8080
lsof -i :8080                # same question, lsof's view
lsof -p 1234                 # every file (and socket) PID 1234 has open

ss is the modern replacement for netstat; -tlnp means TCP, listening, numeric, with the process. Once you have the PID you can decide whether to stop it or move your service. The same lsof -p trick finds which process is holding a file you cannot delete or a mount you cannot unmount.

The /proc view

Everything above is reading from /proc, a virtual filesystem where each process is a directory named by its PID:

cat /proc/1234/cmdline | tr '\0' ' '   # the exact command line
ls -l /proc/1234/cwd                   # its working directory
cat /proc/1234/status                  # state, memory, the owning UID
ls -l /proc/1234/fd                    # every open file descriptor

You rarely need this directly, but when ps is not telling you enough - what directory is this process actually running in, what files does it have open - /proc has the ground truth.

Putting it together: a runaway process

The whole loop in practice:

  1. top (or htop) - find the PID pinning the CPU. Note the user and command.
  2. ps -p <pid> -o user,ppid,cmd - confirm what it is and who started it.
  3. kill <pid> - ask it to stop, then wait a few seconds.
  4. kill -9 <pid> - only if it ignored the polite request.
  5. Find out why it ran away before you call it fixed - a cron job, a stuck loop, a bad deploy. Killing the symptom without finding the cause just means it comes back.

That sequence handles the large majority of "the box is slow" pages. Practice it on a test machine: start a busy loop (yes > /dev/null &), watch it in top, then find and stop it. Do it once and it becomes reflex.