Troubleshooting Scenarios¶
Practice scenarios to prepare for interview troubleshooting questions.
Scenario 1: High CPU Usage¶
Situation: Production web server showing 95% CPU usage.Application is slow.
Your investigation:
- "What recent changes were made?"
- Check which process:
top,ps aux --sort=-%cpu - Multiple Apache workers at 100% CPU
- Check threads:
top -H -p <PID> - System call trace:
strace -p <PID> - See many
accept()calls and processing - Check connections:
ss -tan | wc -l→ 10,000 connections - Check Apache config: MaxClients too high? Attack?
- Check logs:
/var/log/apache2/access.log→ Many requests from same IPs - Diagnosis: DDoS attack
Actions: Rate limiting, block IPs, scale capacity
Scenario 2: Memory Leak¶
Situation: Application memory usage growing over time, eventually OOM killed.
Investigation:
- Monitor memory:
watch 'ps -p <PID> -o pid,vsz,rss' - Memory increases linearly
- Check memory breakdown:
pmap -x <PID> - Large heap allocation
- Run with debugging:
valgrind --leak-check=full ./app - Shows memory allocated but not freed
- Diagnosis: Memory leak in code
Actions: Fix leak, monitor, restart process periodically as workaround
Scenario 3: Disk Full¶
Situation: Server stops working, errors about disk space.
Investigation:
- Check filesystems:
df -h→ /var at 100% - Find large files:
du -sh /var/*→ /var/log is 50GB - Find specific files:
du -ah /var/log | sort -rh | head -20 - Old rotated logs not deleted
- Check logrotate config:
/etc/logrotate.d/ - Logrotate not running? Check cron
- Diagnosis: Logrotate misconfigured
Actions: Delete old logs, fix logrotate, add monitoring
Scenario 4: Network Connectivity¶
Situation: Can't SSH to server, but it's pingable.
Investigation:
- Ping works:
ping 192.168.1.10✓ - SSH timeout:
ssh user@192.168.1.10 - Check if SSH listening:
nmap -p 22 192.168.1.10→ Filtered - Try from different location → Same result
- Login via console/KVM
- Check SSH status:
systemctl status sshd→ Running - Check listening:
ss -tln | grep :22→ Listening on 0.0.0.0:22 - Check firewall:
iptables -L -n -v - INPUT chain DROPs port 22
- Diagnosis: Firewall rule blocking SSH
Actions: Fix firewall rule, investigate who changed it
Scenario 5: Slow DNS Resolution¶
Situation: Applications slow, DNS lookups taking 5-10 seconds.
Investigation:
- Test DNS:
time nslookup google.com→ 8 seconds - Check resolv.conf:
cat /etc/resolv.conf→ nameserver 8.8.8.8 - Ping nameserver:
ping 8.8.8.8→ Timeout - Network issue to Google DNS
- Try local resolver:
dig @192.168.1.1 google.com→ Fast - Check route:
traceroute 8.8.8.8→ Times out at firewall - Diagnosis: Firewall blocking outbound DNS to 8.8.8.8
Actions: Use local DNS, fix firewall, redundant DNS servers
Scenario 6: Process Won't Die¶
Situation: Process won't terminate even with kill -9.
Investigation:
- Try SIGKILL:
kill -9 <PID>→ Still running - Check process state:
ps aux | grep <PID>→ State 'D' - 'D' = uninterruptible sleep (usually I/O)
- Check what it's doing:
cat /proc/<PID>/wchan→ Shows kernel function - Stuck in kernel doing I/O
- Check disk:
iostat -x 1→ Device /dev/sdb has high await, 100% util - Check dmesg:
dmesg | tail→ Disk errors - Diagnosis: Failing disk, process stuck in I/O
Actions: Can't kill (kernel operation), fix disk, may need reboot
Scenario 7: Web Server 502 Errors¶
Situation: Nginx returning 502 Bad Gateway errors.
Investigation:
- Check Nginx logs:
/var/log/nginx/error.log - "Connection refused" to backend (127.0.0.1:8080)
- Check backend status:
systemctl status app→ Failed - Check why failed:
journalctl -u app -n 100 - "bind: Address already in use"
- Check who's on 8080:
ss -tlnp | grep :8080→ Different process - Rogue process on backend port
- Diagnosis: Another process took backend port
Actions: Kill rogue process, start backend, investigate how it happened
Scenario 8: High Load but Low CPU¶
Situation: Load average 20.0 on 4-CPU system, but CPU idle.
Investigation:
- Check load and CPU:
top→ Load 20, CPU 95% idle - Load average includes D state processes
- Check D state:
ps aux | grep ' D '→ 16 processes in D state - All doing I/O
- Check I/O:
iostat -x 1→ %iowait high, %util 100% - Disk saturated
- Check what I/O:
iotop→ Database writes - Diagnosis: Disk I/O bottleneck
Actions: Faster disks, optimize queries, add caching
How to Approach Interview Scenarios¶
-
Ask questions:
- When did it start?
- What changed?
- Intermittent or constant?
- Error messages?
-
Start broad, narrow down:
- System-level stats first
- Then process-specific
- Finally deep dive
-
Explain your reasoning:
- Why each command
- What you're looking for
- How it helps
-
Form hypotheses:
- Based on symptoms
- Test each one
- Adjust based on findings
-
Document findings:
- What you discovered
- Root cause
- Fix applied
Practice Exercise¶
Create your own scenarios:
- Pick a symptom
- Work backward to root cause
- List investigation steps
- Test on actual system if possible