Linux/Linux Performance Monitoring and Troubleshooting

Linux Performance Monitoring and Troubleshooting

Diagnose a slow or overloaded Linux box methodically: read load average correctly, find whether CPU, memory, or IO is the bottleneck, and use strace, lsof, and ltrace to see what a process is actually doing. The systematic approach interviewers look for.

"The server is slow." It is the vaguest page you will ever get, and how you respond separates engineers. The weak move is to restart things and hope. The strong move - the one interviewers are listening for - is a methodical sweep: is this CPU, memory, IO, or the network? Find the resource that is saturated, find the process responsible, and only then dig into why. This guide is that sweep, plus the tools to go deep when you need to.

Load average, read correctly

Every glance at a busy box starts with load average (uptime, the top line of top, or cat /proc/loadavg):

uptime
# 14:32:01 up 9 days, load average: 3.40, 2.10, 1.05

Three numbers: the average number of processes runnable OR waiting (the 1, 5, and 15-minute averages). The crucial detail almost everyone gets wrong: load is not CPU usage. On Linux, load counts processes that are running, waiting for CPU, AND in uninterruptible sleep (usually blocked on disk or network IO). A load of 8 can mean eight CPU-hungry processes, or it can mean eight processes all stuck waiting on a slow disk while the CPUs sit idle.

Read it relative to your core count (nproc): a load of 4 is full utilization on a 4-core box and half-load on an 8-core box. And read the trend across the three numbers - 3.40 2.10 1.05 is a load that is climbing (recent higher than older), which is a spike in progress; 1.05 2.10 3.40 is a problem that is resolving. Load tells you something is busy; it does not tell you what. That is the next step.

Which resource is the bottleneck?

The single most useful starting command is top (or htop). The header lines tell you almost everything if you know how to read the CPU breakdown:

%Cpu(s):  12.3 us,  4.1 sy,  0.0 ni, 80.2 id,  3.1 wa,  0.0 hi,  0.3 si

us (user) - time running application code. High us = the CPU is genuinely busy doing work. Find the process with P (sort by CPU).
sy (system) - time in the kernel. High sy often means heavy syscalls (lots of IO, context switching, or networking).
id (idle) - doing nothing. High idle WITH high load means processes are blocked, not CPU-bound. Look at IO next.
wa (iowait) - CPU idle because it is waiting for disk/network IO to finish. High wa is the giveaway for an IO bottleneck - the CPU is not the problem, the storage is.

That one line routes you: high us -> CPU-bound (find the busy process); high wa -> IO-bound (find what is hitting disk); neither, but memory pressure -> check memory next.

Memory: the "available" trap

free -h
#               total        used        free      shared  buff/cache   available
# Mem:           15Gi       4.2Gi       300Mi        80Mi        11Gi        10Gi

The number people panic over is free, and it is the wrong one. Linux deliberately uses almost all free RAM as disk cache (buff/cache) because empty RAM is wasted RAM - and it instantly gives that cache back to any process that needs it. So free being tiny is normal and healthy.

The number that actually matters is available - how much memory a new process could use, accounting for reclaimable cache. If available is healthy, you have memory regardless of what free says. The real warning signs:

free -h                      # is `available` low AND swap filling up?
vmstat 1                     # watch `si`/`so` columns: swap in/out per second
dmesg | grep -i oom          # did the OOM killer fire? (it kills a process to free RAM)

Sustained swap activity (si/so consistently non-zero in vmstat) means you are out of real memory and paging to disk, which is catastrophically slow and shows up as the whole box grinding. And if a process mysteriously vanished, dmesg | grep -i oom tells you the kernel's Out-Of-Memory killer reaped it - the fix is more memory or a memory leak in the app, not a restart.

IO: when the disk is the bottleneck

When top shows high wa, the disk (or network storage) is the suspect. Measure it:

iostat -xz 1                 # per-device IO stats, refreshed each second
iotop                        # which PROCESS is doing the IO (like top, for disk)
df -h                        # is a filesystem simply full? (a classic "slow/broken" cause)

In iostat -xz 1, the columns that matter are %util (how busy the device is - near 100% means saturated) and await (average IO latency in ms - high values mean slow storage). iotop then names the process responsible. And never skip df -h: a full disk presents as a thousand confusing symptoms (failed writes, stuck services, errors everywhere), and it is a five-second check.

The systematic sweep

Put it together into a repeatable method, top of the stack down. This is the answer to "walk me through debugging a slow server":

uptime - is load high, and rising or falling? Sets the scale of the problem.
top (header) - read the CPU line. High us -> CPU. High wa -> IO. High idle + high load -> blocked processes, look at IO/locks.
free -h - is available low and swap active? Memory pressure can cause everything else.
df -h - is any filesystem full? (Cheap, catches a lot.)
Name the process - top sorted by CPU (P) or memory (M), or iotop for disk. Now you have a PID and a culprit.
Go deep on that process - this is where strace/lsof come in (below).

The discipline is identifying the saturated resource BEFORE touching anything. "I see iowait at 40% and one process pinning the disk in iotop, so this is IO, not CPU" is the sentence that wins interviews and fixes incidents.

Going deep: `strace`, `lsof`, `ltrace`

Once you have the guilty process, these three answer "what is it actually doing?"

`strace` - the system calls a process makes

strace shows every system call (kernel interaction) a process makes - every file it opens, every network call, every read and write. It is how you find what a stuck or slow process is waiting on:

strace -p 1234               # attach to a running process and watch its syscalls
strace -p 1234 -f            # follow child threads/processes too
strace -c -p 1234            # summary: count and TIME spent per syscall (start here)
strace -e trace=openat ./app # only show file opens - find a missing/wrong path

strace -c is the one to know: let it run for a few seconds, Ctrl+C, and it prints which syscalls the process spent its time in. A process stuck in read or poll is waiting on IO or network; one hammering stat/openat on a missing file is failing in a loop. strace -e trace=openat is the fastest way to find "it cannot find its config" - you literally watch it try paths and get ENOENT. (Note: strace pauses the target and adds overhead, so use it deliberately, not on a hot production process at peak.)

`lsof` - open files, sockets, and ports

Everything in Linux is a file, so "list open files" answers a surprising range of questions:

lsof -p 1234                 # every file, socket, and library this process has open
lsof -i :8080                # which process is using port 8080
lsof +D /var/log             # who has files open under this directory
lsof /var/lib/app/data.db    # who has THIS file open (cannot unmount/delete it?)
lsof -nP | grep deleted      # files deleted but still held open (disk not freeing up!)

That last one is gold: when df says the disk is full but you cannot find the space, it is usually a deleted-but-still-open file (a process holding a log it "rotated"). lsof | grep deleted finds it; restarting the process releases the space.

`ltrace` - the library calls a process makes

ltrace is strace's cousin for library (not kernel) calls - it shows calls into shared libraries like libc (malloc, strcmp, crypto functions). It is more niche, but when a process is burning CPU inside a library rather than in syscalls, ltrace -c -p PID shows where:

ltrace -c -p 1234            # count time spent in each library function

You will reach for strace ninety percent of the time, lsof for the file/port questions, and ltrace occasionally when the work is happening in user-space library code rather than syscalls.

A worked example

Page: "checkout is slow." You SSH in:

uptime -> load 9 on a 4-core box, climbing. Something is saturated.
top -> %wa at 35%, idle 50%, and one postgres process near the top. High iowait + idle CPU = this is IO, not CPU.
iotop -> that postgres backend is reading the disk hard.
iostat -xz 1 -> the data disk is at %util 99%, await 40ms. The storage is saturated.
Now you know the shape: a query is doing heavy disk reads (likely a missing index causing a full scan), and the fix is in the query/schema, not "restart postgres." You would confirm with the database's own tools, but the Linux sweep already told you it is IO-bound and pointed at the exact process.

Compare that to restarting the app and watching it slow down again ten minutes later. The method is the whole point.

What to keep

Load average tells you something is busy but not what (and it includes IO waits, not just CPU). top's CPU line routes you to the right resource - us for CPU, wa for IO. free -h's available (not free) is the real memory number. df -h catches full disks. Then strace -c shows what a process is waiting on, lsof answers every "who has this file/port" question, and ltrace covers the rare in-library case. Identify the saturated resource first, name the process, then go deep - in that order, every time.

All guides Join the community