Skip to content

Troubleshooting

Systematic Approach

Follow a structured methodology when troubleshooting:

  1. Define the problem: What exactly is failing?
  2. Gather information: Logs, metrics, symptoms
  3. Form hypotheses: What could cause this?
  4. Test hypotheses: Use tools to verify
  5. Implement fix: Make targeted changes
  6. Verify: Confirm problem is solved
  7. Document: Record findings and solution

Common Scenarios

High CPU Usage

Symptoms: System slow, high load average

Investigation:

# 1. Identify process
top  # Press P to sort by CPU
ps aux --sort=-%cpu | head

# 2. Check what it's doing
strace -p <PID>
perf top -p <PID>

# 3. Check threads
top -H -p <PID>
ps -eLf | grep <PID>

# 4. Look for patterns
mpstat -P ALL 1  # CPU balance?
vmstat 1         # Context switches?

Common causes: - Infinite loop - CPU-intensive computation - Too many threads/processes - Inefficient algorithm

High Memory Usage

Symptoms: Low available memory, swapping, OOM kills

Investigation:

# 1. Check overall memory
free -h
vmstat 1

# 2. Find memory hogs
ps aux --sort=-%mem | head
top  # Press M

# 3. Check process details
pmap -x <PID>
cat /proc/<PID>/status | grep -i vm
cat /proc/<PID>/smaps

# 4. Check for leaks
valgrind --leak-check=full ./program

# 5. OOM killer logs
dmesg | grep -i oom
journalctl -k | grep -i oom

Common causes: - Memory leak - Misconfigured cache - Too many processes - Insufficient RAM

High Disk I/O Wait

Symptoms: System sluggish, high %wa in top

Investigation:

# 1. Confirm I/O wait
top  # Check %wa
vmstat 1  # Check wa column

# 2. Identify processes
iotop
pidstat -d 1

# 3. Check disk stats
iostat -xz 1
# Look for high %util, await

# 4. Check what's being accessed
lsof -p <PID>
strace -e open,read,write -p <PID>

# 5. Check disk health
smartctl -a /dev/sda
dmesg | grep -i error

Common causes: - Slow disk - Too many I/O requests - No file system caching - Disk failure

Network Connectivity Issues

Symptoms: Can't reach server, timeouts

Investigation:

# 1. Test connectivity
ping <target>

# 2. Check route
traceroute <target>
mtr <target>

# 3. DNS resolution
nslookup <hostname>
dig <hostname>
cat /etc/resolv.conf

# 4. Check listening ports
ss -tln
netstat -tln

# 5. Firewall rules
iptables -L -n -v

# 6. Check interfaces
ip addr show
ip link show

# 7. Packet capture
tcpdump -i eth0 host <target>

Common causes: - Network down - Firewall blocking - DNS failure - Routing issue - Service not listening

DNS Resolution Failures

Symptoms: "Host not found", can ping IP but not hostname

Investigation:

# 1. Test DNS
nslookup google.com
dig google.com

# 2. Check resolver config
cat /etc/resolv.conf
cat /etc/nsswitch.conf

# 3. Check /etc/hosts
cat /etc/hosts

# 4. Test specific nameserver
dig @8.8.8.8 google.com

# 5. Check network
ping 8.8.8.8

# 6. Trace DNS query
dig +trace google.com

Common causes: - Nameserver unreachable - Wrong nameserver configured - Firewall blocking port 53 - DNS server issue

Process Consuming Too Much Memory

Investigation:

# 1. Identify process
ps aux --sort=-%mem | head
top

# 2. Detailed memory breakdown
pmap -x <PID>
cat /proc/<PID>/status
cat /proc/<PID>/smaps_rollup

# 3. Check memory type
# VSZ: Virtual memory
# RSS: Physical memory
# SHR: Shared memory

# 4. Monitor over time
watch -n 1 'ps -p <PID> -o pid,vsz,rss'

# 5. Check for leaks
valgrind --leak-check=full ./program

System Running Out of PIDs

Symptoms: "Cannot fork: Resource temporarily unavailable"

Investigation:

# 1. Check PID limit
cat /proc/sys/kernel/pid_max

# 2. Count processes
ps aux | wc -l

# 3. Find process creating many children
ps -eLf | awk '{print $2}' | sort | uniq -c | sort -rn | head

# 4. Check for fork bomb
ps auxf

Solution:

# Increase limit
sysctl -w kernel.pid_max=65536

# Kill offending process
kill -9 <PID>

Application Crash Investigation

For SIGSEGV (segmentation fault):

# 1. Enable core dumps
ulimit -c unlimited
echo "/tmp/core.%e.%p" > /proc/sys/kernel/core_pattern

# 2. Reproduce crash

# 3. Analyze core dump
gdb ./program /tmp/core.program.12345
(gdb) bt           # Backtrace
(gdb) info registers
(gdb) frame 0      # Examine frame
(gdb) print var    # Print variable

# 4. Check kernel logs
dmesg | tail
journalctl -xe

Useful Commands for Troubleshooting

# Check system resources
top, htop, atop

# Check logs
journalctl -xe
tail -f /var/log/syslog
dmesg

# Network
ss, netstat, tcpdump, traceroute

# Disk
df -h, du -sh, iostat, iotop

# Process
ps, pgrep, pidof, kill, pkill

# Files
lsof, fuser

# Performance
vmstat, iostat, sar, perf

Best Practices

  1. Don't panic: Stay calm and systematic
  2. Gather data first: Don't make assumptions
  3. One change at a time: So you know what fixed it
  4. Document everything: For future reference
  5. Check logs: Often have the answer
  6. Ask questions: If interviewer scenario, clarify details
  7. Think out loud: Especially in interviews

Interview Tips

When given a troubleshooting scenario:

  1. Ask clarifying questions:
  2. What changed recently?
  3. When did it start?
  4. Is it intermittent or constant?
  5. What error messages?

  6. Propose investigation steps:

  7. Start with broad checks
  8. Narrow down based on findings
  9. Explain why each step

  10. Form hypotheses:

  11. Based on symptoms
  12. Explain reasoning
  13. Test systematically

  14. Consider multiple causes:

  15. Don't fixate on first idea
  16. Be ready to pivot

Practice Questions

  1. How would you troubleshoot high CPU usage?
  2. System is slow and swapping heavily. What do you check?
  3. Application crashes with SIGSEGV. How do you investigate?
  4. Users can't reach web server. How do you diagnose?
  5. Disk is full. How do you find what's using space?
  6. DNS isn't working. What are your debugging steps?
  7. System load is high but CPU idle. Why?

Further Reading

  • "Systems Performance" by Brendan Gregg
  • "The Practice of System and Network Administration"
  • Linux man pages