Troubleshooting¶
Systematic Approach¶
Follow a structured methodology when troubleshooting:
- Define the problem: What exactly is failing?
- Gather information: Logs, metrics, symptoms
- Form hypotheses: What could cause this?
- Test hypotheses: Use tools to verify
- Implement fix: Make targeted changes
- Verify: Confirm problem is solved
- Document: Record findings and solution
Common Scenarios¶
High CPU Usage¶
Symptoms: System slow, high load average
Investigation:
# 1. Identify process
top # Press P to sort by CPU
ps aux --sort=-%cpu | head
# 2. Check what it's doing
strace -p <PID>
perf top -p <PID>
# 3. Check threads
top -H -p <PID>
ps -eLf | grep <PID>
# 4. Look for patterns
mpstat -P ALL 1 # CPU balance?
vmstat 1 # Context switches?
Common causes: - Infinite loop - CPU-intensive computation - Too many threads/processes - Inefficient algorithm
High Memory Usage¶
Symptoms: Low available memory, swapping, OOM kills
Investigation:
# 1. Check overall memory
free -h
vmstat 1
# 2. Find memory hogs
ps aux --sort=-%mem | head
top # Press M
# 3. Check process details
pmap -x <PID>
cat /proc/<PID>/status | grep -i vm
cat /proc/<PID>/smaps
# 4. Check for leaks
valgrind --leak-check=full ./program
# 5. OOM killer logs
dmesg | grep -i oom
journalctl -k | grep -i oom
Common causes: - Memory leak - Misconfigured cache - Too many processes - Insufficient RAM
High Disk I/O Wait¶
Symptoms: System sluggish, high %wa in top
Investigation:
# 1. Confirm I/O wait
top # Check %wa
vmstat 1 # Check wa column
# 2. Identify processes
iotop
pidstat -d 1
# 3. Check disk stats
iostat -xz 1
# Look for high %util, await
# 4. Check what's being accessed
lsof -p <PID>
strace -e open,read,write -p <PID>
# 5. Check disk health
smartctl -a /dev/sda
dmesg | grep -i error
Common causes: - Slow disk - Too many I/O requests - No file system caching - Disk failure
Network Connectivity Issues¶
Symptoms: Can't reach server, timeouts
Investigation:
# 1. Test connectivity
ping <target>
# 2. Check route
traceroute <target>
mtr <target>
# 3. DNS resolution
nslookup <hostname>
dig <hostname>
cat /etc/resolv.conf
# 4. Check listening ports
ss -tln
netstat -tln
# 5. Firewall rules
iptables -L -n -v
# 6. Check interfaces
ip addr show
ip link show
# 7. Packet capture
tcpdump -i eth0 host <target>
Common causes: - Network down - Firewall blocking - DNS failure - Routing issue - Service not listening
DNS Resolution Failures¶
Symptoms: "Host not found", can ping IP but not hostname
Investigation:
# 1. Test DNS
nslookup google.com
dig google.com
# 2. Check resolver config
cat /etc/resolv.conf
cat /etc/nsswitch.conf
# 3. Check /etc/hosts
cat /etc/hosts
# 4. Test specific nameserver
dig @8.8.8.8 google.com
# 5. Check network
ping 8.8.8.8
# 6. Trace DNS query
dig +trace google.com
Common causes: - Nameserver unreachable - Wrong nameserver configured - Firewall blocking port 53 - DNS server issue
Process Consuming Too Much Memory¶
Investigation:
# 1. Identify process
ps aux --sort=-%mem | head
top
# 2. Detailed memory breakdown
pmap -x <PID>
cat /proc/<PID>/status
cat /proc/<PID>/smaps_rollup
# 3. Check memory type
# VSZ: Virtual memory
# RSS: Physical memory
# SHR: Shared memory
# 4. Monitor over time
watch -n 1 'ps -p <PID> -o pid,vsz,rss'
# 5. Check for leaks
valgrind --leak-check=full ./program
System Running Out of PIDs¶
Symptoms: "Cannot fork: Resource temporarily unavailable"
Investigation:
# 1. Check PID limit
cat /proc/sys/kernel/pid_max
# 2. Count processes
ps aux | wc -l
# 3. Find process creating many children
ps -eLf | awk '{print $2}' | sort | uniq -c | sort -rn | head
# 4. Check for fork bomb
ps auxf
Solution:
Application Crash Investigation¶
For SIGSEGV (segmentation fault):
# 1. Enable core dumps
ulimit -c unlimited
echo "/tmp/core.%e.%p" > /proc/sys/kernel/core_pattern
# 2. Reproduce crash
# 3. Analyze core dump
gdb ./program /tmp/core.program.12345
(gdb) bt # Backtrace
(gdb) info registers
(gdb) frame 0 # Examine frame
(gdb) print var # Print variable
# 4. Check kernel logs
dmesg | tail
journalctl -xe
Useful Commands for Troubleshooting¶
# Check system resources
top, htop, atop
# Check logs
journalctl -xe
tail -f /var/log/syslog
dmesg
# Network
ss, netstat, tcpdump, traceroute
# Disk
df -h, du -sh, iostat, iotop
# Process
ps, pgrep, pidof, kill, pkill
# Files
lsof, fuser
# Performance
vmstat, iostat, sar, perf
Best Practices¶
- Don't panic: Stay calm and systematic
- Gather data first: Don't make assumptions
- One change at a time: So you know what fixed it
- Document everything: For future reference
- Check logs: Often have the answer
- Ask questions: If interviewer scenario, clarify details
- Think out loud: Especially in interviews
Interview Tips¶
When given a troubleshooting scenario:
- Ask clarifying questions:
- What changed recently?
- When did it start?
- Is it intermittent or constant?
-
What error messages?
-
Propose investigation steps:
- Start with broad checks
- Narrow down based on findings
-
Explain why each step
-
Form hypotheses:
- Based on symptoms
- Explain reasoning
-
Test systematically
-
Consider multiple causes:
- Don't fixate on first idea
- Be ready to pivot
Practice Questions¶
- How would you troubleshoot high CPU usage?
- System is slow and swapping heavily. What do you check?
- Application crashes with SIGSEGV. How do you investigate?
- Users can't reach web server. How do you diagnose?
- Disk is full. How do you find what's using space?
- DNS isn't working. What are your debugging steps?
- System load is high but CPU idle. Why?
Further Reading¶
- "Systems Performance" by Brendan Gregg
- "The Practice of System and Network Administration"
- Linux man pages