Skip to content

Performance Tools

Brendan Gregg's 60-Second Analysis

When troubleshooting performance issues, run these commands in order:

1. uptime                    # Load averages
2. dmesg | tail             # Kernel errors
3. vmstat 1                 # Overall stats
4. mpstat -P ALL 1          # CPU balance
5. pidstat 1                # Process usage
6. iostat -xz 1             # Disk I/O
7. free -m                  # Memory usage
8. sar -n DEV 1             # Network I/O
9. sar -n TCP,ETCP 1        # TCP stats
10. top                      # Overview

Process Monitoring

top

Interactive process viewer.

top

# Key commands while running:
# P - Sort by CPU
# M - Sort by memory
# k - Kill process
# r - Renice
# 1 - Show individual CPUs
# H - Show threads

Understanding output:

%Cpu(s):  5.2 us,  2.1 sy,  0.0 ni, 92.5 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
         USER    SYS    NICE   IDLE   IOWAIT  HW-INT  SW-INT  STOLEN

htop

Better interactive viewer (if available).

htop

# Features:
# - Color coding
# - Mouse support
# - Tree view (F5)
# - Filter (F4)

ps

Process snapshot.

# All processes
ps aux

# Process tree
ps auxf
ps -ejH

# Custom format
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu

# Threads
ps -eLf

pidstat

Per-process statistics.

# CPU usage
pidstat 1

# Memory
pidstat -r 1

# I/O
pidstat -d 1

# Context switches
pidstat -w 1

# Specific process
pidstat -p <PID> 1

Memory

free

Memory usage summary.

free -h

#               total        used        free      shared  buff/cache   available
# Mem:           15Gi       5.2Gi       8.1Gi       234Mi       2.3Gi        9.8Gi
# Swap:         2.0Gi          0B       2.0Gi

Key fields: - available: Memory available for applications (includes reclaimable cache) - buff/cache: File system cache (can be reclaimed)

vmstat

Virtual memory statistics.

vmstat 1

# r: runnable processes
# b: blocked processes
# swpd: swap used
# free: free memory
# buff/cache
# si: swap in
# so: swap out
# bi: blocks in (read)
# bo: blocks out (write)
# in: interrupts/sec
# cs: context switches/sec
# us/sy/id/wa: CPU percentages

pmap

Process memory map.

pmap <PID>
pmap -x <PID>  # Extended

Disk I/O

iostat

I/O statistics.

iostat -xz 1

# Key metrics:
# r/s, w/s: Read/write ops per second
# rkB/s, wkB/s: KB read/written per second
# %util: Device utilization
# await: Average wait time (ms)
# svctm: Service time (deprecated, ignore)

High %util: Device saturated High await: High latency

iotop

Top-like I/O monitor.

iotop

# -o: Only show active I/O
# -a: Accumulated I/O

df / du

Disk space.

# Filesystem usage
df -h

# Directory size
du -sh /var/log
du -h --max-depth=1 /var

# Find large files
du -ah /var | sort -rh | head -20
find / -type f -size +100M

Network

netstat (legacy)

Network connections and statistics.

# All connections
netstat -an

# Listening
netstat -tln

# With programs
netstat -tlnp

# Statistics
netstat -s

# Routing
netstat -rn

ss

Modern socket statistics.

# All TCP
ss -tan

# Listening
ss -tln

# With process
ss -tlnp

# Filter
ss state established
ss dst 192.168.1.1
ss sport :22

tcpdump

Packet capture (covered in Network Stack).

tcpdump -i eth0
tcpdump -i eth0 port 80
tcpdump -i eth0 host 192.168.1.1

iftop

Bandwidth usage per connection.

iftop -i eth0

ping / traceroute / mtr

Connectivity and path.

# Reachability
ping -c 4 8.8.8.8

# Path
traceroute google.com

# Continuous traceroute
mtr google.com

System

uptime

System uptime and load.

uptime
# 10:30:45 up 5 days,  2:15,  3 users,  load average: 1.23, 0.85, 0.45
#                                                      1min  5min  15min

Load average interpretation: - < # CPUs: Underutilized - = # CPUs: Fully utilized - > # CPUs: Overloaded (processes waiting)

uname

System information.

uname -a
# Kernel version, hostname, architecture

dmesg

Kernel ring buffer (logs).

dmesg
dmesg | tail
dmesg | grep -i error
dmesg -T  # Human-readable timestamps

journalctl

systemd journal.

# All logs
journalctl

# Since boot
journalctl -b

# Kernel logs
journalctl -k

# Specific service
journalctl -u nginx

# Follow (tail -f)
journalctl -f

# Priority
journalctl -p err

# Time range
journalctl --since "2025-01-15 10:00" --until "2025-01-15 11:00"

Advanced Tools

strace

System call tracer.

# Trace program
strace ls

# Attach to process
strace -p <PID>

# Specific syscalls
strace -e open,read ./program

# Timing
strace -T ./program

# Summary
strace -c ./program

# Follow forks
strace -f ./program

lsof

List open files.

# All open files
lsof

# By process
lsof -p <PID>

# By file
lsof /var/log/syslog

# Network
lsof -i
lsof -i :80

# By user
lsof -u username

perf

Performance profiling.

# Record CPU profile
perf record -a -g sleep 10
perf report

# Top functions
perf top

# Specific event
perf stat -e cycles,instructions ./program

# Trace system calls
perf trace ./program

sar

System Activity Reporter.

# CPU usage (historical)
sar

# Memory
sar -r

# Network
sar -n DEV

# Disk
sar -d

# All
sar -A

Performance Metrics

Key Metrics to Check

CPU: - % utilization - Load average - Context switches - Interrupts

Memory: - Used vs available - Swap usage - Page faults - Cache hit rate

Disk: - IOPS (r/s, w/s) - Throughput (MB/s) - Latency (await) - % utilization

Network: - Throughput (Mbps) - Packet rate - Errors/drops - Connections

Practice Questions

  1. How do you identify which process is consuming the most CPU?
  2. What does a load average of 5.0 mean on a 4-CPU system?
  3. How do you find which process has a file open?
  4. Explain the difference between 'free' and 'available' memory.
  5. How would you trace system calls made by a running process?
  6. What tool would you use to find network bandwidth usage?
  7. How do you identify disk I/O bottlenecks?

Further Reading

  • Brendan Gregg's blog and books
  • man pages for each tool
  • "Systems Performance" by Brendan Gregg