Troubleshooting¶

Systematic Approach¶

Follow a structured methodology when troubleshooting:

Define the problem: What exactly is failing?
Gather information: Logs, metrics, symptoms
Form hypotheses: What could cause this?
Test hypotheses: Use tools to verify
Implement fix: Make targeted changes
Verify: Confirm problem is solved
Document: Record findings and solution

Common Scenarios¶

High CPU Usage¶

Symptoms: System slow, high load average

Investigation:

# 1. Identify process
top  # Press P to sort by CPU
ps aux --sort=-%cpu | head

# 2. Check what it's doing
strace -p <PID>
perf top -p <PID>

# 3. Check threads
top -H -p <PID>
ps -eLf | grep <PID>

# 4. Look for patterns
mpstat -P ALL 1  # CPU balance?
vmstat 1         # Context switches?

Common causes: - Infinite loop - CPU-intensive computation - Too many threads/processes - Inefficient algorithm

High Memory Usage¶

Symptoms: Low available memory, swapping, OOM kills

Investigation:

# 1. Check overall memory
free -h
vmstat 1

# 2. Find memory hogs
ps aux --sort=-%mem | head
top  # Press M

# 3. Check process details
pmap -x <PID>
cat /proc/<PID>/status | grep -i vm
cat /proc/<PID>/smaps

# 4. Check for leaks
valgrind --leak-check=full ./program

# 5. OOM killer logs
dmesg | grep -i oom
journalctl -k | grep -i oom

Common causes: - Memory leak - Misconfigured cache - Too many processes - Insufficient RAM

High Disk I/O Wait¶

Symptoms: System sluggish, high %wa in top

Investigation:

# 1. Confirm I/O wait
top  # Check %wa
vmstat 1  # Check wa column

# 2. Identify processes
iotop
pidstat -d 1

# 3. Check disk stats
iostat -xz 1
# Look for high %util, await

# 4. Check what's being accessed
lsof -p <PID>
strace -e open,read,write -p <PID>

# 5. Check disk health
smartctl -a /dev/sda
dmesg | grep -i error

Common causes: - Slow disk - Too many I/O requests - No file system caching - Disk failure

Network Connectivity Issues¶

Symptoms: Can't reach server, timeouts

Investigation:

# 1. Test connectivity
ping <target>

# 2. Check route
traceroute <target>
mtr <target>

# 3. DNS resolution
nslookup <hostname>
dig <hostname>
cat /etc/resolv.conf

# 4. Check listening ports
ss -tln
netstat -tln

# 5. Firewall rules
iptables -L -n -v

# 6. Check interfaces
ip addr show
ip link show

# 7. Packet capture
tcpdump -i eth0 host <target>

Common causes: - Network down - Firewall blocking - DNS failure - Routing issue - Service not listening

DNS Resolution Failures¶

Symptoms: "Host not found", can ping IP but not hostname

Investigation:

# 1. Test DNS
nslookup google.com
dig google.com

# 2. Check resolver config
cat /etc/resolv.conf
cat /etc/nsswitch.conf

# 3. Check /etc/hosts
cat /etc/hosts

# 4. Test specific nameserver
dig @8.8.8.8 google.com

# 5. Check network
ping 8.8.8.8

# 6. Trace DNS query
dig +trace google.com

Common causes: - Nameserver unreachable - Wrong nameserver configured - Firewall blocking port 53 - DNS server issue

Process Consuming Too Much Memory¶

Investigation:

# 1. Identify process
ps aux --sort=-%mem | head
top

# 2. Detailed memory breakdown
pmap -x <PID>
cat /proc/<PID>/status
cat /proc/<PID>/smaps_rollup

# 3. Check memory type
# VSZ: Virtual memory
# RSS: Physical memory
# SHR: Shared memory

# 4. Monitor over time
watch -n 1 'ps -p <PID> -o pid,vsz,rss'

# 5. Check for leaks
valgrind --leak-check=full ./program

System Running Out of PIDs¶

Symptoms: "Cannot fork: Resource temporarily unavailable"

Investigation:

# 1. Check PID limit
cat /proc/sys/kernel/pid_max

# 2. Count processes
ps aux | wc -l

# 3. Find process creating many children
ps -eLf | awk '{print $2}' | sort | uniq -c | sort -rn | head

# 4. Check for fork bomb
ps auxf

Solution:

# Increase limit
sysctl -w kernel.pid_max=65536

# Kill offending process
kill -9 <PID>

Application Crash Investigation¶

For SIGSEGV (segmentation fault):

# 1. Enable core dumps
ulimit -c unlimited
echo "/tmp/core.%e.%p" > /proc/sys/kernel/core_pattern

# 2. Reproduce crash

# 3. Analyze core dump
gdb ./program /tmp/core.program.12345
(gdb) bt           # Backtrace
(gdb) info registers
(gdb) frame 0      # Examine frame
(gdb) print var    # Print variable

# 4. Check kernel logs
dmesg | tail
journalctl -xe

Useful Commands for Troubleshooting¶

# Check system resources
top, htop, atop

# Check logs
journalctl -xe
tail -f /var/log/syslog
dmesg

# Network
ss, netstat, tcpdump, traceroute

# Disk
df -h, du -sh, iostat, iotop

# Process
ps, pgrep, pidof, kill, pkill

# Files
lsof, fuser

# Performance
vmstat, iostat, sar, perf

Best Practices¶

Don't panic: Stay calm and systematic
Gather data first: Don't make assumptions
One change at a time: So you know what fixed it
Document everything: For future reference
Check logs: Often have the answer
Ask questions: If interviewer scenario, clarify details
Think out loud: Especially in interviews

Interview Tips¶

When given a troubleshooting scenario:

Ask clarifying questions:
What changed recently?
When did it start?
Is it intermittent or constant?
What error messages?
Propose investigation steps:
Start with broad checks
Narrow down based on findings
Explain why each step
Form hypotheses:
Based on symptoms
Explain reasoning
Test systematically
Consider multiple causes:
Don't fixate on first idea
Be ready to pivot

Practice Questions¶

How would you troubleshoot high CPU usage?
System is slow and swapping heavily. What do you check?
Application crashes with SIGSEGV. How do you investigate?
Users can't reach web server. How do you diagnose?
Disk is full. How do you find what's using space?
DNS isn't working. What are your debugging steps?
System load is high but CPU idle. Why?

Troubleshooting¶

Systematic Approach¶

Common Scenarios¶

High CPU Usage¶

High Memory Usage¶

High Disk I/O Wait¶

Network Connectivity Issues¶

DNS Resolution Failures¶

Process Consuming Too Much Memory¶

System Running Out of PIDs¶

Application Crash Investigation¶

Useful Commands for Troubleshooting¶

Best Practices¶

Interview Tips¶

Practice Questions¶

Further Reading¶