Linux Network Stack¶
Overview¶
The Linux network stack processes packets from hardware to applications through multiple layers.
High-Level Network Stack¶
---
config:
fontFamily: Monospace
layout: elk
wrap: false
---
graph TD
%% USER SPACE
A[User Space] -->|syscalls| B["Socket API<br/>(socket(), bind(), connect(), send(), recv(), etc.)"]
%% KERNEL SPACE
subgraph K[Kernel Space]
direction TB
%% Namespaces and isolation
NS["Network Namespace (netns)"]:::ns
CG["Cgroups (net_cls, net_prio)"]:::ctrl
B --> NS
B --> CG
%% Protocol families
NS --> PF["Protocol Families<br/>(AF_INET, AF_INET6, AF_UNIX, AF_PACKET, AF_NETLINK)"]
%% Socket layers
PF --> SL["Socket Layer<br/>(struct sock, sk_buff, socket buffers)"]
SL --> TL[Transport Layer]
%% Transport protocols
TL --> TCP["TCP<br/>(Congestion control, retransmission, SACK)"]
TL --> UDP[UDP]
TL --> SCTP[SCTP]
TL --> DCCP[DCCP]
%% eBPF hooks: socket-level
SL --> EBPF_SOCK["eBPF Socket Filters<br/>(SO_ATTACH_BPF, cgroup/bpf hooks)"]
%% Network layer
TL --> NL[Network Layer]
NL --> IP["IPv4 / IPv6<br/>(Routing, Fragmentation, Reassembly)"]
IP --> ROUTE["Routing Subsystem<br/>(fib_trie, fib_rules, policy routing)"]
IP --> NF["Netfilter Hooks<br/>(PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING)"]
NF --> NFT[nftables / iptables Chains]
%% ICMP, ARP, Neighbor
IP --> ICMP[ICMP / ICMPv6]
IP --> ARP["ARP / NDISC<br/>(Neighbor Cache, nd_tbl)"]
%% Tunneling and overlays
IP --> TUNNEL["Tunneling / VPN<br/>(IPIP, GRE, SIT, GENEVE, VXLAN, WireGuard)"]
%% eBPF hooks at XDP / TC
IP --> XDP["eBPF / XDP (Express Data Path)<br/>(runs in driver RX path)"]
IP --> TC["Traffic Control (tc)<br/>(qdisc, classifier, action)"]
TC --> QDISC["qdisc: fq_codel, HTB, pfifo_fast"]
QDISC --> CLS["classifier (cls_bpf, u32, flower)"]
CLS --> ACT["action (mirred, drop, redirect)"]
%% Virtual devices and bridges
IP --> VDEV["Virtual Devices<br/>(veth, bridge, bond, team, VLAN, MACVLAN, VXLAN)"]
VDEV --> BR["Bridge (br_netfilter, STP, FDB)"]
BR --> ETH["net_device Interface<br/>(struct net_device)"]
%% Device driver layer
ETH --> NAPI["NAPI / GRO / GSO<br/>(Packet batching and offload)"]
NAPI --> DRIVER["Network Device Driver<br/>(e1000e, ixgbe, virtio-net)"]
end
%% HARDWARE
DRIVER --> HW["Hardware (NIC, PHY, DMA, TX/RX rings, Interrupts)"]
Packet Flow¶
Receiving Packets¶
graph TD
A[NIC Hardware] --> B[Driver: Interrupt/NAPI]
B --> C[sk_buff allocated]
C --> D[IP Layer Processing]
D --> E[TCP/UDP Layer]
E --> F[Socket Buffer]
F --> G[Application]
Detailed steps:
- NIC receives packet: DMA to ring buffer
- Interrupt: CPU notified (or NAPI polling)
- Driver: Allocates sk_buff, copies packet data
- Netfilter: iptables PREROUTING
- IP Layer: Routing decision, forwarding/local
- Netfilter: INPUT chain
- TCP/UDP: Checksum, sequence numbers, socket lookup
- Socket buffer: Data queued for application
- Application: read()/recv() copies data
Sending Packets¶
graph TD
A[Application: write/send] --> B[Socket Layer]
B --> C[TCP/UDP Layer]
C --> D[IP Layer: Routing]
D --> E[Netfilter: OUTPUT]
E --> F[Device Queue]
F --> G[NIC: Transmit]
Low-Level Packet Flow¶
---
config:
fontFamily: Monospace
layout: elk
wrap: false
---
flowchart TD
%% USER SPACE
A1["User Space<br/>Applications: curl, ssh, nginx, etc."] -->|"send()/recv() syscalls"| B1[Socket Layer]
%% TRANSMIT PATH
subgraph TX["TX Path (Outgoing Packet)"]
direction TB
B1 --> C1["Transport Layer<br/>(TCP, UDP, SCTP)"]
C1 --> D1["IP Layer (IPv4/IPv6)<br/>Builds headers, checksum, etc."]
D1 -->|Routing decision| E1["Routing Subsystem<br/>(fib_lookup, policy rules)"]
E1 -->|Netfilter OUTPUT hook| F1["Netfilter / nftables<br/>(OUTPUT, POSTROUTING)"]
F1 -->|Optional NAT / filtering| G1["Traffic Control (tc) egress<br/>(qdisc, classifier, action)"]
G1 -->|Optional eBPF tc hook| H1["Virtual Device Layer<br/>(veth, bridge, bond, VLAN, etc.)"]
H1 -->|net_device ops| I1["Driver Queue (NAPI TX ring)"]
I1 --> J1["NIC Hardware<br/>(DMA → wire)"]
end
%% RECEIVE PATH
subgraph RX["RX Path (Incoming Packet)"]
direction TB
J2["NIC Hardware<br/>(Interrupt, DMA RX ring)"] --> K2["NAPI Poll Loop<br/>(GRO, checksum, offloads)"]
K2 -->|"XDP hook (optional eBPF)"| L2["XDP / eBPF Fast Path"]
L2 -->|if not dropped| M2["net_device RX handler"]
M2 -->|Netfilter PREROUTING| N2["Netfilter / nftables<br/>(PREROUTING)"]
N2 -->|Routing decision| O2["Routing Subsystem<br/>(fib_lookup)"]
O2 -->|Local destination?| P2{"Is packet for local host?"}
P2 -->|Yes| Q2["Netfilter INPUT hook"]
Q2 --> R2["Transport Layer Demux<br/>(TCP, UDP, ICMP)"]
R2 --> S2["Socket Receive Queue<br/>(sk_buff queued to app)"]
S2 --> T2["User Space read()<br/>(recv(), recvmsg())"]
P2 -->|No| U2["Forwarding Path"]
U2 -->|Netfilter FORWARD hook| V2["Netfilter Forward Decision"]
V2 -->|Allowed| W2["Traffic Control (tc) ingress"]
W2 --> X2["Egress Device Routing"]
X2 -->|Netfilter POSTROUTING| Y2["Netfilter NAT / Postrouting"]
Y2 --> Z2["Driver Queue (TX)"]
Z2 --> AA2["NIC Hardware → Outgoing Interface"]
end
%% RELATION BETWEEN TX AND RX
J1 -.-> J2
A1 --> B1
Socket API¶
Creating Sockets¶
#include <sys/socket.h>
// TCP socket
int sock = socket(AF_INET, SOCK_STREAM, 0);
// UDP socket
int sock = socket(AF_INET, SOCK_DGRAM, 0);
// Raw socket (requires root)
int sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP);
Address Families:
AF_INET: IPv4AF_INET6: IPv6AF_UNIX: Unix domain socketsAF_PACKET: Raw packets
Socket Types:
SOCK_STREAM: TCPSOCK_DGRAM: UDPSOCK_RAW: Raw IP packetsSOCK_SEQPACKET: Reliable datagrams
TCP Server¶
// 1. Create socket
int listen_sock = socket(AF_INET, SOCK_STREAM, 0);
// 2. Bind to address
struct sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons(8080);
addr.sin_addr.s_addr = INADDR_ANY;
bind(listen_sock, (struct sockaddr*)&addr, sizeof(addr));
// 3. Listen
listen(listen_sock, 128); // Backlog = 128
// 4. Accept connections
struct sockaddr_in client_addr;
socklen_t addr_len = sizeof(client_addr);
int client_sock = accept(listen_sock, (struct sockaddr*)&client_addr, &addr_len);
// 5. Communicate
char buf[1024];
ssize_t n = recv(client_sock, buf, sizeof(buf), 0);
send(client_sock, "Hello", 5, 0);
// 6. Close
close(client_sock);
close(listen_sock);
TCP Client¶
// 1. Create socket
int sock = socket(AF_INET, SOCK_STREAM, 0);
// 2. Connect
struct sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons(8080);
inet_pton(AF_INET, "192.168.1.1", &addr.sin_addr);
connect(sock, (struct sockaddr*)&addr, sizeof(addr));
// 3. Communicate
send(sock, "Hello", 5, 0);
char buf[1024];
recv(sock, buf, sizeof(buf), 0);
// 4. Close
close(sock);
Socket Options¶
// Reuse address (avoid "Address already in use")
int opt = 1;
setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
// Keep-alive
setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &opt, sizeof(opt));
// Send/receive buffer sizes
int bufsize = 65536;
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, &bufsize, sizeof(bufsize));
setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &bufsize, sizeof(bufsize));
// TCP_NODELAY (disable Nagle's algorithm)
setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt));
// Timeout
struct timeval tv = {5, 0}; // 5 seconds
setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
Kernel Network Structures¶
sk_buff¶
The fundamental network packet structure.
/**
* struct sk_buff - socket buffer
*
* The fundamental data structure for network packets in Linux kernel.
* Contains packet data, metadata, and various flags for packet processing.
*/
struct sk_buff {
/* ========== List Management ========== */
struct sk_buff *next; /* Next buffer in list */
struct sk_buff *prev; /* Previous buffer in list */
/* ========== Device & Socket ========== */
struct net_device *dev; /* Network device */
struct sock *sk; /* Associated socket */
/* ========== Timestamps ========== */
ktime_t tstamp; /* Packet timestamp */
/* ========== Packet Data Pointers ========== */
unsigned char *head; /* Start of allocated buffer */
unsigned char *data; /* Start of actual data */
sk_buff_data_t tail; /* End of data */
sk_buff_data_t end; /* End of allocated buffer */
/* ========== Data Length ========== */
unsigned int len; /* Total data length */
unsigned int data_len; /* Non-linear data length */
unsigned int truesize; /* Total buffer size (including overhead) */
/* ========== Protocol Headers ========== */
__u16 mac_header; /* Link layer header offset */
__u16 network_header; /* Network layer header offset (IP) */
__u16 transport_header; /* Transport layer header offset (TCP/UDP) */
__be16 protocol; /* Packet protocol (ETH_P_IP, ETH_P_IPV6, etc.) */
/* ========== Checksum ========== */
__u8 ip_summed:2; /* Checksum status (NONE, UNNECESSARY, COMPLETE, PARTIAL) */
union {
__wsum csum; /* Checksum value */
struct {
__u16 csum_start; /* Checksum start offset */
__u16 csum_offset; /* Checksum field offset */
};
};
/* ========== VLAN ========== */
union {
u32 vlan_all;
struct {
__be16 vlan_proto; /* VLAN protocol (ETH_P_8021Q, ETH_P_8021AD) */
__u16 vlan_tci; /* VLAN TCI (tag control information) */
};
};
/* ========== QoS & Routing ========== */
__u32 priority; /* Packet priority */
__u32 mark; /* Packet mark (for routing/filtering) */
__u16 queue_mapping; /* TX queue mapping */
__u32 hash; /* Flow hash */
int skb_iif; /* Input interface index */
/* ========== Clone & Reference ========== */
__u8 cloned:1; /* Buffer is cloned */
refcount_t users; /* Reference count */
/* ========== Packet Type ========== */
__u8 pkt_type:3; /* Packet class (PACKET_HOST, PACKET_BROADCAST, etc.) */
/* ========== Control Buffer ========== */
char cb[48] __aligned(8); /* Control buffer (layer-specific data) */
/* ========== Extensions ========== */
struct skb_ext *extensions; /* Optional extensions (SEC, TC, etc.) */
};
/**
* Key Concepts:
*
* Buffer Layout:
* head ----------> [headroom | data | tailroom] <---------- end
* ^ ^
* data tail
*
* Header Offsets:
* [Ethernet | IP | TCP/UDP | Payload]
* ^ ^ ^
* | | +-- transport_header
* | +------- network_header
* +------------------ mac_header
*
* Common Operations:
* - skb_put() : Add data to tail
* - skb_push() : Add data to head (prepend)
* - skb_pull() : Remove data from head
* - skb_reserve() : Reserve headroom
*/
Key functions:
alloc_skb(): Allocate new sk_buffskb_put(): Add data at tailskb_push(): Add data at headskb_pull(): Remove data from headkfree_skb(): Free sk_buff
Socket Buffer¶
Each socket has receive and send buffers.
# View socket buffer sizes
cat /proc/sys/net/core/rmem_default
cat /proc/sys/net/core/wmem_default
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/core/wmem_max
# Per-socket view
ss -tm # Memory info
Netfilter / iptables¶
Netfilter is the kernel packet filtering framework.
Chains and Tables¶
Tables:
- filter: Packet filtering (default)
- nat: Network address translation
- mangle: Packet modification
- raw: Pre-connection tracking
Chains:
- PREROUTING: Just arrived
- INPUT: Destined for local
- FORWARD: Routed through
- OUTPUT: Locally generated
- POSTROUTING: About to leave
Packet Flow¶
flowchart TB
Start([📡 Packet Arrives<br/>Network Interface])
PreRouting{{"🔍 PREROUTING<br/>netfilter hook<br/>(nat, mangle, raw)"}}
RouteDec1{{"🧭 Routing Decision<br/>Is packet for<br/>local machine?"}}
Input{{"📥 INPUT<br/>netfilter hook<br/>(filter, nat, mangle)"}}
LocalProc[["💻 Local Process<br/>(Application Layer)<br/>Socket receive/send"]]
Output{{"📤 OUTPUT<br/>netfilter hook<br/>(filter, nat, mangle, raw)"}}
Forward{{"↔️ FORWARD<br/>netfilter hook<br/>(filter, mangle)"}}
RouteDec2{{"🧭 Routing Decision<br/>Determine<br/>outgoing interface"}}
PostRouting{{"📮 POSTROUTING<br/>netfilter hook<br/>(nat, mangle)"}}
Out1([🌐 Send to Network<br/>Interface])
Out2([🌐 Send to Network<br/>Interface])
Drop1[/❌ DROP/]
Drop2[/❌ DROP/]
Drop3[/❌ DROP/]
Drop4[/❌ DROP/]
Drop5[/❌ DROP/]
Start --> PreRouting
PreRouting -->|"ACCEPT"| RouteDec1
PreRouting -.->|"DROP/REJECT"| Drop1
RouteDec1 -->|"Destination:<br/>Local IP"| Input
RouteDec1 -->|"Destination:<br/>Other IP<br/>(Forwarding enabled)"| Forward
Input -->|"ACCEPT"| LocalProc
Input -.->|"DROP/REJECT"| Drop2
LocalProc -->|"Application<br/>sends data"| Output
Output -.->|"DROP/REJECT"| Drop3
Output -->|"ACCEPT"| RouteDec2
Forward -->|"ACCEPT"| RouteDec2
Forward -.->|"DROP/REJECT"| Drop4
RouteDec2 --> PostRouting
PostRouting -->|"ACCEPT"| Out1
PostRouting -.->|"DROP/REJECT"| Drop5
RouteDec2 -.->|"from OUTPUT"| Out2
Detailed Hook Information¶
graph LR
subgraph Tables["iptables Tables (processed in order)"]
direction TB
Raw["1️⃣ raw<br/>Connection tracking bypass"]
Mangle["2️⃣ mangle<br/>Packet alteration (TOS, TTL)"]
Nat["3️⃣ nat<br/>Address translation"]
Filter["4️⃣ filter<br/>Packet filtering (allow/deny)"]
Security["5️⃣ security<br/>SELinux rules"]
end
subgraph Hooks["Netfilter Hooks & Available Tables"]
direction TB
subgraph H1["PREROUTING"]
P1["✓ raw<br/>✓ mangle<br/>✓ nat"]
end
subgraph H2["INPUT"]
P2["✓ mangle<br/>✓ filter<br/>✓ nat<br/>✓ security"]
end
subgraph H3["FORWARD"]
P3["✓ mangle<br/>✓ filter<br/>✓ security"]
end
subgraph H4["OUTPUT"]
P4["✓ raw<br/>✓ mangle<br/>✓ nat<br/>✓ filter<br/>✓ security"]
end
subgraph H5["POSTROUTING"]
P5["✓ mangle<br/>✓ nat"]
end
end
subgraph Actions["Common Actions"]
direction TB
Accept["✅ ACCEPT<br/>Allow packet"]
Drop["❌ DROP<br/>Silently discard"]
Reject["🚫 REJECT<br/>Discard + send error"]
Log["📝 LOG<br/>Log and continue"]
Masq["🎭 MASQUERADE<br/>Dynamic SNAT"]
DNAT["🎯 DNAT<br/>Destination NAT"]
SNAT["📤 SNAT<br/>Source NAT"]
end
iptables Examples¶
# List rules
iptables -L -n -v
# Allow SSH
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
# Block IP
iptables -A INPUT -s 192.168.1.100 -j DROP
# NAT/Masquerade
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# Port forward
iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to 192.168.1.10:8080
# Save rules
iptables-save > /etc/iptables/rules.v4
# Restore rules
iptables-restore < /etc/iptables/rules.v4
Network Namespaces¶
Isolate network stacks (covered in Fundamentals).
Network Diagnostics¶
tcpdump¶
Capture packets for analysis.
# Capture on interface
tcpdump -i eth0
# Save to file
tcpdump -i eth0 -w capture.pcap
# Read from file
tcpdump -r capture.pcap
# Filter by host
tcpdump host 192.168.1.1
# Filter by port
tcpdump port 80
# TCP flags
tcpdump 'tcp[tcpflags] & tcp-syn != 0'
# Verbose output
tcpdump -i eth0 -nn -vv
ss (socket statistics)¶
Modern replacement for netstat.
# All TCP connections
ss -tan
# Listening sockets
ss -tln
# With process info
ss -tlnp
# Socket memory
ss -tm
# Filter by state
ss state established
# Filter by port
ss -tan sport :22
netstat (legacy)¶
# All connections
netstat -an
# Listening
netstat -tln
# Routing table
netstat -rn
# Interface statistics
netstat -i
Network Performance Tuning¶
TCP Tuning¶
# TCP buffer sizes
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 67108864"
sysctl -w net.ipv4.tcp_wmem="4096 65536 67108864"
# TCP window scaling
sysctl -w net.ipv4.tcp_window_scaling=1
# Congestion control
sysctl -w net.ipv4.tcp_congestion_control=bbr
# SYN cookies (prevent SYN flood)
sysctl -w net.ipv4.tcp_syncookies=1
# Connection tracking
sysctl -w net.netfilter.nf_conntrack_max=1048576
Monitoring¶
# Network stats
netstat -s
nstat
# Per-interface stats
ip -s link
# Network throughput
iftop
nethogs
Practice Questions¶
- Explain the path of a packet from NIC to application.
- What is the difference between bind() and listen()?
- How does SO_REUSEADDR work?
- Explain the iptables packet flow through chains.
- What is an sk_buff?
- When would you use TCP_NODELAY?
- How do you capture packets on a specific port with tcpdump?
Further Reading¶
man 7 socket,man 2 socketman 7 tcp,man 7 udpman 8 iptables- Kernel source:
net/directory