Performance Tools¶
Brendan Gregg's 60-Second Analysis¶
When troubleshooting performance issues, run these commands in order:
1. uptime # Load averages
2. dmesg | tail # Kernel errors
3. vmstat 1 # Overall stats
4. mpstat -P ALL 1 # CPU balance
5. pidstat 1 # Process usage
6. iostat -xz 1 # Disk I/O
7. free -m # Memory usage
8. sar -n DEV 1 # Network I/O
9. sar -n TCP,ETCP 1 # TCP stats
10. top # Overview
Process Monitoring¶
top¶
Interactive process viewer.
top
# Key commands while running:
# P - Sort by CPU
# M - Sort by memory
# k - Kill process
# r - Renice
# 1 - Show individual CPUs
# H - Show threads
Understanding output:
%Cpu(s): 5.2 us, 2.1 sy, 0.0 ni, 92.5 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
USER SYS NICE IDLE IOWAIT HW-INT SW-INT STOLEN
htop¶
Better interactive viewer (if available).
ps¶
Process snapshot.
# All processes
ps aux
# Process tree
ps auxf
ps -ejH
# Custom format
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu
# Threads
ps -eLf
pidstat¶
Per-process statistics.
# CPU usage
pidstat 1
# Memory
pidstat -r 1
# I/O
pidstat -d 1
# Context switches
pidstat -w 1
# Specific process
pidstat -p <PID> 1
Memory¶
free¶
Memory usage summary.
free -h
# total used free shared buff/cache available
# Mem: 15Gi 5.2Gi 8.1Gi 234Mi 2.3Gi 9.8Gi
# Swap: 2.0Gi 0B 2.0Gi
Key fields:
- available: Memory available for applications (includes reclaimable cache)
- buff/cache: File system cache (can be reclaimed)
vmstat¶
Virtual memory statistics.
vmstat 1
# r: runnable processes
# b: blocked processes
# swpd: swap used
# free: free memory
# buff/cache
# si: swap in
# so: swap out
# bi: blocks in (read)
# bo: blocks out (write)
# in: interrupts/sec
# cs: context switches/sec
# us/sy/id/wa: CPU percentages
pmap¶
Process memory map.
Disk I/O¶
iostat¶
I/O statistics.
iostat -xz 1
# Key metrics:
# r/s, w/s: Read/write ops per second
# rkB/s, wkB/s: KB read/written per second
# %util: Device utilization
# await: Average wait time (ms)
# svctm: Service time (deprecated, ignore)
High %util: Device saturated High await: High latency
iotop¶
Top-like I/O monitor.
df / du¶
Disk space.
# Filesystem usage
df -h
# Directory size
du -sh /var/log
du -h --max-depth=1 /var
# Find large files
du -ah /var | sort -rh | head -20
find / -type f -size +100M
Network¶
netstat (legacy)¶
Network connections and statistics.
# All connections
netstat -an
# Listening
netstat -tln
# With programs
netstat -tlnp
# Statistics
netstat -s
# Routing
netstat -rn
ss¶
Modern socket statistics.
# All TCP
ss -tan
# Listening
ss -tln
# With process
ss -tlnp
# Filter
ss state established
ss dst 192.168.1.1
ss sport :22
tcpdump¶
Packet capture (covered in Network Stack).
iftop¶
Bandwidth usage per connection.
ping / traceroute / mtr¶
Connectivity and path.
# Reachability
ping -c 4 8.8.8.8
# Path
traceroute google.com
# Continuous traceroute
mtr google.com
System¶
uptime¶
System uptime and load.
Load average interpretation: - < # CPUs: Underutilized - = # CPUs: Fully utilized - > # CPUs: Overloaded (processes waiting)
uname¶
System information.
dmesg¶
Kernel ring buffer (logs).
journalctl¶
systemd journal.
# All logs
journalctl
# Since boot
journalctl -b
# Kernel logs
journalctl -k
# Specific service
journalctl -u nginx
# Follow (tail -f)
journalctl -f
# Priority
journalctl -p err
# Time range
journalctl --since "2025-01-15 10:00" --until "2025-01-15 11:00"
Advanced Tools¶
strace¶
System call tracer.
# Trace program
strace ls
# Attach to process
strace -p <PID>
# Specific syscalls
strace -e open,read ./program
# Timing
strace -T ./program
# Summary
strace -c ./program
# Follow forks
strace -f ./program
lsof¶
List open files.
# All open files
lsof
# By process
lsof -p <PID>
# By file
lsof /var/log/syslog
# Network
lsof -i
lsof -i :80
# By user
lsof -u username
perf¶
Performance profiling.
# Record CPU profile
perf record -a -g sleep 10
perf report
# Top functions
perf top
# Specific event
perf stat -e cycles,instructions ./program
# Trace system calls
perf trace ./program
sar¶
System Activity Reporter.
Performance Metrics¶
Key Metrics to Check¶
CPU: - % utilization - Load average - Context switches - Interrupts
Memory: - Used vs available - Swap usage - Page faults - Cache hit rate
Disk: - IOPS (r/s, w/s) - Throughput (MB/s) - Latency (await) - % utilization
Network: - Throughput (Mbps) - Packet rate - Errors/drops - Connections
Practice Questions¶
- How do you identify which process is consuming the most CPU?
- What does a load average of 5.0 mean on a 4-CPU system?
- How do you find which process has a file open?
- Explain the difference between 'free' and 'available' memory.
- How would you trace system calls made by a running process?
- What tool would you use to find network bandwidth usage?
- How do you identify disk I/O bottlenecks?
Further Reading¶
- Brendan Gregg's blog and books
manpages for each tool- "Systems Performance" by Brendan Gregg