Skip to content

Troubleshooting Scenarios

Practice scenarios to prepare for interview troubleshooting questions.

Scenario 1: High CPU Usage

Situation: Production web server showing 95% CPU usage.Application is slow.

Your investigation:

  1. "What recent changes were made?"
  2. Check which process: top, ps aux --sort=-%cpu
  3. Multiple Apache workers at 100% CPU
  4. Check threads: top -H -p <PID>
  5. System call trace: strace -p <PID>
  6. See many accept() calls and processing
  7. Check connections: ss -tan | wc -l → 10,000 connections
  8. Check Apache config: MaxClients too high? Attack?
  9. Check logs: /var/log/apache2/access.log → Many requests from same IPs
  10. Diagnosis: DDoS attack

Actions: Rate limiting, block IPs, scale capacity

Scenario 2: Memory Leak

Situation: Application memory usage growing over time, eventually OOM killed.

Investigation:

  1. Monitor memory: watch 'ps -p <PID> -o pid,vsz,rss'
  2. Memory increases linearly
  3. Check memory breakdown: pmap -x <PID>
  4. Large heap allocation
  5. Run with debugging: valgrind --leak-check=full ./app
  6. Shows memory allocated but not freed
  7. Diagnosis: Memory leak in code

Actions: Fix leak, monitor, restart process periodically as workaround

Scenario 3: Disk Full

Situation: Server stops working, errors about disk space.

Investigation:

  1. Check filesystems: df -h → /var at 100%
  2. Find large files: du -sh /var/* → /var/log is 50GB
  3. Find specific files: du -ah /var/log | sort -rh | head -20
  4. Old rotated logs not deleted
  5. Check logrotate config: /etc/logrotate.d/
  6. Logrotate not running? Check cron
  7. Diagnosis: Logrotate misconfigured

Actions: Delete old logs, fix logrotate, add monitoring

Scenario 4: Network Connectivity

Situation: Can't SSH to server, but it's pingable.

Investigation:

  1. Ping works: ping 192.168.1.10
  2. SSH timeout: ssh user@192.168.1.10
  3. Check if SSH listening: nmap -p 22 192.168.1.10 → Filtered
  4. Try from different location → Same result
  5. Login via console/KVM
  6. Check SSH status: systemctl status sshd → Running
  7. Check listening: ss -tln | grep :22 → Listening on 0.0.0.0:22
  8. Check firewall: iptables -L -n -v
  9. INPUT chain DROPs port 22
  10. Diagnosis: Firewall rule blocking SSH

Actions: Fix firewall rule, investigate who changed it

Scenario 5: Slow DNS Resolution

Situation: Applications slow, DNS lookups taking 5-10 seconds.

Investigation:

  1. Test DNS: time nslookup google.com → 8 seconds
  2. Check resolv.conf: cat /etc/resolv.conf → nameserver 8.8.8.8
  3. Ping nameserver: ping 8.8.8.8 → Timeout
  4. Network issue to Google DNS
  5. Try local resolver: dig @192.168.1.1 google.com → Fast
  6. Check route: traceroute 8.8.8.8 → Times out at firewall
  7. Diagnosis: Firewall blocking outbound DNS to 8.8.8.8

Actions: Use local DNS, fix firewall, redundant DNS servers

Scenario 6: Process Won't Die

Situation: Process won't terminate even with kill -9.

Investigation:

  1. Try SIGKILL: kill -9 <PID> → Still running
  2. Check process state: ps aux | grep <PID> → State 'D'
  3. 'D' = uninterruptible sleep (usually I/O)
  4. Check what it's doing: cat /proc/<PID>/wchan → Shows kernel function
  5. Stuck in kernel doing I/O
  6. Check disk: iostat -x 1 → Device /dev/sdb has high await, 100% util
  7. Check dmesg: dmesg | tail → Disk errors
  8. Diagnosis: Failing disk, process stuck in I/O

Actions: Can't kill (kernel operation), fix disk, may need reboot

Scenario 7: Web Server 502 Errors

Situation: Nginx returning 502 Bad Gateway errors.

Investigation:

  1. Check Nginx logs: /var/log/nginx/error.log
  2. "Connection refused" to backend (127.0.0.1:8080)
  3. Check backend status: systemctl status app → Failed
  4. Check why failed: journalctl -u app -n 100
  5. "bind: Address already in use"
  6. Check who's on 8080: ss -tlnp | grep :8080 → Different process
  7. Rogue process on backend port
  8. Diagnosis: Another process took backend port

Actions: Kill rogue process, start backend, investigate how it happened

Scenario 8: High Load but Low CPU

Situation: Load average 20.0 on 4-CPU system, but CPU idle.

Investigation:

  1. Check load and CPU: top → Load 20, CPU 95% idle
  2. Load average includes D state processes
  3. Check D state: ps aux | grep ' D ' → 16 processes in D state
  4. All doing I/O
  5. Check I/O: iostat -x 1 → %iowait high, %util 100%
  6. Disk saturated
  7. Check what I/O: iotop → Database writes
  8. Diagnosis: Disk I/O bottleneck

Actions: Faster disks, optimize queries, add caching

How to Approach Interview Scenarios

  1. Ask questions:

    • When did it start?
    • What changed?
    • Intermittent or constant?
    • Error messages?
  2. Start broad, narrow down:

    • System-level stats first
    • Then process-specific
    • Finally deep dive
  3. Explain your reasoning:

    • Why each command
    • What you're looking for
    • How it helps
  4. Form hypotheses:

    • Based on symptoms
    • Test each one
    • Adjust based on findings
  5. Document findings:

    • What you discovered
    • Root cause
    • Fix applied

Practice Exercise

Create your own scenarios:

  1. Pick a symptom
  2. Work backward to root cause
  3. List investigation steps
  4. Test on actual system if possible