Skip to content

Linux Network Stack

Overview

The Linux network stack processes packets from hardware to applications through multiple layers.

High-Level Network Stack

---
config:
    fontFamily: Monospace
    layout: elk
    wrap: false
---
graph TD
    %% USER SPACE
    A[User Space] -->|syscalls| B["Socket API<br/>(socket(), bind(), connect(), send(), recv(), etc.)"]

    %% KERNEL SPACE
    subgraph K[Kernel Space]
        direction TB

        %% Namespaces and isolation
        NS["Network Namespace (netns)"]:::ns
        CG["Cgroups (net_cls, net_prio)"]:::ctrl
        B --> NS
        B --> CG

        %% Protocol families
        NS --> PF["Protocol Families<br/>(AF_INET, AF_INET6, AF_UNIX, AF_PACKET, AF_NETLINK)"]

        %% Socket layers
        PF --> SL["Socket Layer<br/>(struct sock, sk_buff, socket buffers)"]
        SL --> TL[Transport Layer]

        %% Transport protocols
        TL --> TCP["TCP<br/>(Congestion control, retransmission, SACK)"]
        TL --> UDP[UDP]
        TL --> SCTP[SCTP]
        TL --> DCCP[DCCP]

        %% eBPF hooks: socket-level
        SL --> EBPF_SOCK["eBPF Socket Filters<br/>(SO_ATTACH_BPF, cgroup/bpf hooks)"]

        %% Network layer
        TL --> NL[Network Layer]
        NL --> IP["IPv4 / IPv6<br/>(Routing, Fragmentation, Reassembly)"]
        IP --> ROUTE["Routing Subsystem<br/>(fib_trie, fib_rules, policy routing)"]
        IP --> NF["Netfilter Hooks<br/>(PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING)"]
        NF --> NFT[nftables / iptables Chains]

        %% ICMP, ARP, Neighbor
        IP --> ICMP[ICMP / ICMPv6]
        IP --> ARP["ARP / NDISC<br/>(Neighbor Cache, nd_tbl)"]

        %% Tunneling and overlays
        IP --> TUNNEL["Tunneling / VPN<br/>(IPIP, GRE, SIT, GENEVE, VXLAN, WireGuard)"]

        %% eBPF hooks at XDP / TC
        IP --> XDP["eBPF / XDP (Express Data Path)<br/>(runs in driver RX path)"]
        IP --> TC["Traffic Control (tc)<br/>(qdisc, classifier, action)"]
        TC --> QDISC["qdisc: fq_codel, HTB, pfifo_fast"]
        QDISC --> CLS["classifier (cls_bpf, u32, flower)"]
        CLS --> ACT["action (mirred, drop, redirect)"]

        %% Virtual devices and bridges
        IP --> VDEV["Virtual Devices<br/>(veth, bridge, bond, team, VLAN, MACVLAN, VXLAN)"]
        VDEV --> BR["Bridge (br_netfilter, STP, FDB)"]
        BR --> ETH["net_device Interface<br/>(struct net_device)"]

        %% Device driver layer
        ETH --> NAPI["NAPI / GRO / GSO<br/>(Packet batching and offload)"]
        NAPI --> DRIVER["Network Device Driver<br/>(e1000e, ixgbe, virtio-net)"]
    end

    %% HARDWARE
    DRIVER --> HW["Hardware (NIC, PHY, DMA, TX/RX rings, Interrupts)"]

Packet Flow

Receiving Packets

graph TD
    A[NIC Hardware] --> B[Driver: Interrupt/NAPI]
    B --> C[sk_buff allocated]
    C --> D[IP Layer Processing]
    D --> E[TCP/UDP Layer]
    E --> F[Socket Buffer]
    F --> G[Application]

Detailed steps:

  1. NIC receives packet: DMA to ring buffer
  2. Interrupt: CPU notified (or NAPI polling)
  3. Driver: Allocates sk_buff, copies packet data
  4. Netfilter: iptables PREROUTING
  5. IP Layer: Routing decision, forwarding/local
  6. Netfilter: INPUT chain
  7. TCP/UDP: Checksum, sequence numbers, socket lookup
  8. Socket buffer: Data queued for application
  9. Application: read()/recv() copies data

Sending Packets

graph TD
    A[Application: write/send] --> B[Socket Layer]
    B --> C[TCP/UDP Layer]
    C --> D[IP Layer: Routing]
    D --> E[Netfilter: OUTPUT]
    E --> F[Device Queue]
    F --> G[NIC: Transmit]

Low-Level Packet Flow

---
config:
    fontFamily: Monospace
    layout: elk
    wrap: false
---
flowchart TD

    %% USER SPACE
    A1["User Space<br/>Applications: curl, ssh, nginx, etc."] -->|"send()/recv() syscalls"| B1[Socket Layer]

    %% TRANSMIT PATH
    subgraph TX["TX Path (Outgoing Packet)"]
        direction TB
        B1 --> C1["Transport Layer<br/>(TCP, UDP, SCTP)"]
        C1 --> D1["IP Layer (IPv4/IPv6)<br/>Builds headers, checksum, etc."]
        D1 -->|Routing decision| E1["Routing Subsystem<br/>(fib_lookup, policy rules)"]
        E1 -->|Netfilter OUTPUT hook| F1["Netfilter / nftables<br/>(OUTPUT, POSTROUTING)"]
        F1 -->|Optional NAT / filtering| G1["Traffic Control (tc) egress<br/>(qdisc, classifier, action)"]
        G1 -->|Optional eBPF tc hook| H1["Virtual Device Layer<br/>(veth, bridge, bond, VLAN, etc.)"]
        H1 -->|net_device ops| I1["Driver Queue (NAPI TX ring)"]
        I1 --> J1["NIC Hardware<br/>(DMA → wire)"]
    end

    %% RECEIVE PATH
    subgraph RX["RX Path (Incoming Packet)"]
        direction TB
        J2["NIC Hardware<br/>(Interrupt, DMA RX ring)"] --> K2["NAPI Poll Loop<br/>(GRO, checksum, offloads)"]
        K2 -->|"XDP hook (optional eBPF)"| L2["XDP / eBPF Fast Path"]
        L2 -->|if not dropped| M2["net_device RX handler"]
        M2 -->|Netfilter PREROUTING| N2["Netfilter / nftables<br/>(PREROUTING)"]
        N2 -->|Routing decision| O2["Routing Subsystem<br/>(fib_lookup)"]
        O2 -->|Local destination?| P2{"Is packet for local host?"}
        P2 -->|Yes| Q2["Netfilter INPUT hook"]
        Q2 --> R2["Transport Layer Demux<br/>(TCP, UDP, ICMP)"]
        R2 --> S2["Socket Receive Queue<br/>(sk_buff queued to app)"]
        S2 --> T2["User Space read()<br/>(recv(), recvmsg())"]
        P2 -->|No| U2["Forwarding Path"]
        U2 -->|Netfilter FORWARD hook| V2["Netfilter Forward Decision"]
        V2 -->|Allowed| W2["Traffic Control (tc) ingress"]
        W2 --> X2["Egress Device Routing"]
        X2 -->|Netfilter POSTROUTING| Y2["Netfilter NAT / Postrouting"]
        Y2 --> Z2["Driver Queue (TX)"]
        Z2 --> AA2["NIC Hardware → Outgoing Interface"]
    end

    %% RELATION BETWEEN TX AND RX
    J1 -.-> J2
    A1 --> B1

Socket API

Creating Sockets

#include <sys/socket.h>

// TCP socket
int sock = socket(AF_INET, SOCK_STREAM, 0);

// UDP socket
int sock = socket(AF_INET, SOCK_DGRAM, 0);

// Raw socket (requires root)
int sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP);

Address Families:

  • AF_INET: IPv4
  • AF_INET6: IPv6
  • AF_UNIX: Unix domain sockets
  • AF_PACKET: Raw packets

Socket Types:

  • SOCK_STREAM: TCP
  • SOCK_DGRAM: UDP
  • SOCK_RAW: Raw IP packets
  • SOCK_SEQPACKET: Reliable datagrams

TCP Server

// 1. Create socket
int listen_sock = socket(AF_INET, SOCK_STREAM, 0);

// 2. Bind to address
struct sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons(8080);
addr.sin_addr.s_addr = INADDR_ANY;
bind(listen_sock, (struct sockaddr*)&addr, sizeof(addr));

// 3. Listen
listen(listen_sock, 128);  // Backlog = 128

// 4. Accept connections
struct sockaddr_in client_addr;
socklen_t addr_len = sizeof(client_addr);
int client_sock = accept(listen_sock, (struct sockaddr*)&client_addr, &addr_len);

// 5. Communicate
char buf[1024];
ssize_t n = recv(client_sock, buf, sizeof(buf), 0);
send(client_sock, "Hello", 5, 0);

// 6. Close
close(client_sock);
close(listen_sock);

TCP Client

// 1. Create socket
int sock = socket(AF_INET, SOCK_STREAM, 0);

// 2. Connect
struct sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons(8080);
inet_pton(AF_INET, "192.168.1.1", &addr.sin_addr);
connect(sock, (struct sockaddr*)&addr, sizeof(addr));

// 3. Communicate
send(sock, "Hello", 5, 0);
char buf[1024];
recv(sock, buf, sizeof(buf), 0);

// 4. Close
close(sock);

Socket Options

// Reuse address (avoid "Address already in use")
int opt = 1;
setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

// Keep-alive
setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &opt, sizeof(opt));

// Send/receive buffer sizes
int bufsize = 65536;
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, &bufsize, sizeof(bufsize));
setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &bufsize, sizeof(bufsize));

// TCP_NODELAY (disable Nagle's algorithm)
setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt));

// Timeout
struct timeval tv = {5, 0};  // 5 seconds
setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));

Kernel Network Structures

sk_buff

The fundamental network packet structure.

/**
 * struct sk_buff - socket buffer
 * 
 * The fundamental data structure for network packets in Linux kernel.
 * Contains packet data, metadata, and various flags for packet processing.
 */
struct sk_buff {
    /* ========== List Management ========== */
    struct sk_buff          *next;              /* Next buffer in list */
    struct sk_buff          *prev;              /* Previous buffer in list */

    /* ========== Device & Socket ========== */
    struct net_device       *dev;               /* Network device */
    struct sock             *sk;                /* Associated socket */

    /* ========== Timestamps ========== */
    ktime_t                 tstamp;             /* Packet timestamp */

    /* ========== Packet Data Pointers ========== */
    unsigned char           *head;              /* Start of allocated buffer */
    unsigned char           *data;              /* Start of actual data */
    sk_buff_data_t          tail;               /* End of data */
    sk_buff_data_t          end;                /* End of allocated buffer */

    /* ========== Data Length ========== */
    unsigned int            len;                /* Total data length */
    unsigned int            data_len;           /* Non-linear data length */
    unsigned int            truesize;           /* Total buffer size (including overhead) */

    /* ========== Protocol Headers ========== */
    __u16                   mac_header;         /* Link layer header offset */
    __u16                   network_header;     /* Network layer header offset (IP) */
    __u16                   transport_header;   /* Transport layer header offset (TCP/UDP) */
    __be16                  protocol;           /* Packet protocol (ETH_P_IP, ETH_P_IPV6, etc.) */

    /* ========== Checksum ========== */
    __u8                    ip_summed:2;        /* Checksum status (NONE, UNNECESSARY, COMPLETE, PARTIAL) */
    union {
        __wsum              csum;               /* Checksum value */
        struct {
            __u16           csum_start;         /* Checksum start offset */
            __u16           csum_offset;        /* Checksum field offset */
        };
    };

    /* ========== VLAN ========== */
    union {
        u32                 vlan_all;
        struct {
            __be16          vlan_proto;         /* VLAN protocol (ETH_P_8021Q, ETH_P_8021AD) */
            __u16           vlan_tci;           /* VLAN TCI (tag control information) */
        };
    };

    /* ========== QoS & Routing ========== */
    __u32                   priority;           /* Packet priority */
    __u32                   mark;               /* Packet mark (for routing/filtering) */
    __u16                   queue_mapping;      /* TX queue mapping */
    __u32                   hash;               /* Flow hash */
    int                     skb_iif;            /* Input interface index */

    /* ========== Clone & Reference ========== */
    __u8                    cloned:1;           /* Buffer is cloned */
    refcount_t              users;              /* Reference count */

    /* ========== Packet Type ========== */
    __u8                    pkt_type:3;         /* Packet class (PACKET_HOST, PACKET_BROADCAST, etc.) */

    /* ========== Control Buffer ========== */
    char                    cb[48] __aligned(8); /* Control buffer (layer-specific data) */

    /* ========== Extensions ========== */
    struct skb_ext          *extensions;        /* Optional extensions (SEC, TC, etc.) */
};

/**
 * Key Concepts:
 * 
 * Buffer Layout:
 *   head ----------> [headroom | data | tailroom] <---------- end
 *                              ^      ^
 *                            data   tail
 * 
 * Header Offsets:
 *   [Ethernet | IP | TCP/UDP | Payload]
 *    ^          ^    ^
 *    |          |    +-- transport_header
 *    |          +------- network_header
 *    +------------------ mac_header
 * 
 * Common Operations:
 *   - skb_put()     : Add data to tail
 *   - skb_push()    : Add data to head (prepend)
 *   - skb_pull()    : Remove data from head
 *   - skb_reserve() : Reserve headroom
 */

Key functions:

  • alloc_skb(): Allocate new sk_buff
  • skb_put(): Add data at tail
  • skb_push(): Add data at head
  • skb_pull(): Remove data from head
  • kfree_skb(): Free sk_buff

Socket Buffer

Each socket has receive and send buffers.

# View socket buffer sizes
cat /proc/sys/net/core/rmem_default
cat /proc/sys/net/core/wmem_default
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/core/wmem_max

# Per-socket view
ss -tm  # Memory info

Netfilter / iptables

Netfilter is the kernel packet filtering framework.

Chains and Tables

Tables:

  • filter: Packet filtering (default)
  • nat: Network address translation
  • mangle: Packet modification
  • raw: Pre-connection tracking

Chains:

  • PREROUTING: Just arrived
  • INPUT: Destined for local
  • FORWARD: Routed through
  • OUTPUT: Locally generated
  • POSTROUTING: About to leave

Packet Flow

flowchart TB
    Start([📡 Packet Arrives<br/>Network Interface])
    PreRouting{{"🔍 PREROUTING<br/>netfilter hook<br/>(nat, mangle, raw)"}}
    RouteDec1{{"🧭 Routing Decision<br/>Is packet for<br/>local machine?"}}
    Input{{"📥 INPUT<br/>netfilter hook<br/>(filter, nat, mangle)"}}
    LocalProc[["💻 Local Process<br/>(Application Layer)<br/>Socket receive/send"]]
    Output{{"📤 OUTPUT<br/>netfilter hook<br/>(filter, nat, mangle, raw)"}}
    Forward{{"↔️ FORWARD<br/>netfilter hook<br/>(filter, mangle)"}}
    RouteDec2{{"🧭 Routing Decision<br/>Determine<br/>outgoing interface"}}
    PostRouting{{"📮 POSTROUTING<br/>netfilter hook<br/>(nat, mangle)"}}

    Out1([🌐 Send to Network<br/>Interface])
    Out2([🌐 Send to Network<br/>Interface])

    Drop1[/❌ DROP/]
    Drop2[/❌ DROP/]
    Drop3[/❌ DROP/]
    Drop4[/❌ DROP/]
    Drop5[/❌ DROP/]

    Start --> PreRouting

    PreRouting -->|"ACCEPT"| RouteDec1
    PreRouting -.->|"DROP/REJECT"| Drop1

    RouteDec1 -->|"Destination:<br/>Local IP"| Input
    RouteDec1 -->|"Destination:<br/>Other IP<br/>(Forwarding enabled)"| Forward

    Input -->|"ACCEPT"| LocalProc
    Input -.->|"DROP/REJECT"| Drop2

    LocalProc -->|"Application<br/>sends data"| Output

    Output -.->|"DROP/REJECT"| Drop3
    Output -->|"ACCEPT"| RouteDec2

    Forward -->|"ACCEPT"| RouteDec2
    Forward -.->|"DROP/REJECT"| Drop4

    RouteDec2 --> PostRouting

    PostRouting -->|"ACCEPT"| Out1
    PostRouting -.->|"DROP/REJECT"| Drop5

    RouteDec2 -.->|"from OUTPUT"| Out2

Detailed Hook Information

graph LR
    subgraph Tables["iptables Tables (processed in order)"]
        direction TB
        Raw["1️⃣ raw<br/>Connection tracking bypass"]
        Mangle["2️⃣ mangle<br/>Packet alteration (TOS, TTL)"]
        Nat["3️⃣ nat<br/>Address translation"]
        Filter["4️⃣ filter<br/>Packet filtering (allow/deny)"]
        Security["5️⃣ security<br/>SELinux rules"]
    end

    subgraph Hooks["Netfilter Hooks & Available Tables"]
        direction TB

        subgraph H1["PREROUTING"]
            P1["✓ raw<br/>✓ mangle<br/>✓ nat"]
        end

        subgraph H2["INPUT"]
            P2["✓ mangle<br/>✓ filter<br/>✓ nat<br/>✓ security"]
        end

        subgraph H3["FORWARD"]
            P3["✓ mangle<br/>✓ filter<br/>✓ security"]
        end

        subgraph H4["OUTPUT"]
            P4["✓ raw<br/>✓ mangle<br/>✓ nat<br/>✓ filter<br/>✓ security"]
        end

        subgraph H5["POSTROUTING"]
            P5["✓ mangle<br/>✓ nat"]
        end
    end

    subgraph Actions["Common Actions"]
        direction TB
        Accept["✅ ACCEPT<br/>Allow packet"]
        Drop["❌ DROP<br/>Silently discard"]
        Reject["🚫 REJECT<br/>Discard + send error"]
        Log["📝 LOG<br/>Log and continue"]
        Masq["🎭 MASQUERADE<br/>Dynamic SNAT"]
        DNAT["🎯 DNAT<br/>Destination NAT"]
        SNAT["📤 SNAT<br/>Source NAT"]
    end

iptables Examples

# List rules
iptables -L -n -v

# Allow SSH
iptables -A INPUT -p tcp --dport 22 -j ACCEPT

# Block IP
iptables -A INPUT -s 192.168.1.100 -j DROP

# NAT/Masquerade
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

# Port forward
iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to 192.168.1.10:8080

# Save rules
iptables-save > /etc/iptables/rules.v4

# Restore rules
iptables-restore < /etc/iptables/rules.v4

Network Namespaces

Isolate network stacks (covered in Fundamentals).

Network Diagnostics

tcpdump

Capture packets for analysis.

# Capture on interface
tcpdump -i eth0

# Save to file
tcpdump -i eth0 -w capture.pcap

# Read from file
tcpdump -r capture.pcap

# Filter by host
tcpdump host 192.168.1.1

# Filter by port
tcpdump port 80

# TCP flags
tcpdump 'tcp[tcpflags] & tcp-syn != 0'

# Verbose output
tcpdump -i eth0 -nn -vv

ss (socket statistics)

Modern replacement for netstat.

# All TCP connections
ss -tan

# Listening sockets
ss -tln

# With process info
ss -tlnp

# Socket memory
ss -tm

# Filter by state
ss state established

# Filter by port
ss -tan sport :22

netstat (legacy)

# All connections
netstat -an

# Listening
netstat -tln

# Routing table
netstat -rn

# Interface statistics
netstat -i

Network Performance Tuning

TCP Tuning

# TCP buffer sizes
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 67108864"
sysctl -w net.ipv4.tcp_wmem="4096 65536 67108864"

# TCP window scaling
sysctl -w net.ipv4.tcp_window_scaling=1

# Congestion control
sysctl -w net.ipv4.tcp_congestion_control=bbr

# SYN cookies (prevent SYN flood)
sysctl -w net.ipv4.tcp_syncookies=1

# Connection tracking
sysctl -w net.netfilter.nf_conntrack_max=1048576

Monitoring

# Network stats
netstat -s
nstat

# Per-interface stats
ip -s link

# Network throughput
iftop
nethogs

Practice Questions

  1. Explain the path of a packet from NIC to application.
  2. What is the difference between bind() and listen()?
  3. How does SO_REUSEADDR work?
  4. Explain the iptables packet flow through chains.
  5. What is an sk_buff?
  6. When would you use TCP_NODELAY?
  7. How do you capture packets on a specific port with tcpdump?

Further Reading

  • man 7 socket, man 2 socket
  • man 7 tcp, man 7 udp
  • man 8 iptables
  • Kernel source: net/ directory