Skip to content

System Calls & Kernel Internals

Critical Topic

This is a critical topic for Google SRE interviews. Expect deep dive questions that test your understanding of the system call mechanism and kernel internals.

Overview

System calls are the fundamental interface between user space applications and the Linux kernel. They provide a controlled way for applications to request kernel services while maintaining system security and stability.

System Call Mechanism

User Space to Kernel Space Transition

When a user space application needs kernel services, it must transition from user space to kernel space. This transition is carefully controlled to maintain system security.

// User space code
int fd = open("/tmp/file.txt", O_RDONLY);

The transition process:

  1. Application calls a library function (e.g., open() from libc)
  2. Library function prepares the system call:
    • Places system call number in a register (e.g., %rax on x86-64)
    • Places arguments in specific registers
  3. Special CPU instruction is executed (e.g., syscall on x86-64, svc on ARM)
  4. CPU switches to kernel mode:
    • Changes privilege level (Ring 3 → Ring 0)
    • Switches to kernel stack
    • Saves user space context
  5. Kernel's system call handler executes:
    • Validates arguments
    • Performs requested operation
    • Prepares return value
  6. Return to user space:
    • Restores user space context
    • Switches back to user mode
    • Returns control to application
graph TD
    A[User Application] -->|1. Call libc function| B[C Library glibc]
    B -->|2. Setup registers| C[System Call Interface]
    C -->|3. syscall instruction| D[Kernel Mode Switch]
    D -->|4. Execute| E[System Call Handler]
    E -->|5. Kernel Service| F[File System / Memory / etc.]
    F -->|6. Return| E
    E -->|7. Return to user mode| A

Why Direct Kernel Access is Prohibited

User space applications cannot directly access kernel space for several critical reasons:

  1. Security: Prevents malicious code from compromising the system
  2. Stability: Prevents buggy code from crashing the kernel
  3. Isolation: Each process has its own memory space
  4. Abstraction: Provides a stable API regardless of hardware changes

Privilege Separation

If user space could directly access kernel space, a single bug or malicious program could:

  • Crash the entire system
  • Access any process's memory
  • Bypass all security mechanisms
  • Corrupt kernel data structures

Common System Calls

Process Control

fork()

Creates a new process by duplicating the calling process.

#include <unistd.h>
#include <stdio.h>

int main() {
    pid_t pid = fork();

    if (pid == 0) {
        // Child process
        printf("Child PID: %d\n", getpid());
    } else if (pid > 0) {
        // Parent process
        printf("Parent PID: %d, Child PID: %d\n", getpid(), pid);
    } else {
        // Error
        perror("fork failed");
    }

    return 0;
}

What happens during fork():

  1. Kernel creates a new Process Control Block (PCB)
  2. Assigns new PID to child
  3. Copies parent's memory space (using Copy-on-Write)
  4. Copies file descriptor table
  5. Child gets CPU scheduling
  6. Both processes continue from the same point

Copy-on-Write (COW)

Modern Linux uses COW to optimize fork(). The child initially shares the parent's memory pages. Pages are only copied when either process writes to them.

exec() Family

Replaces the current process image with a new program.

#include <unistd.h>

int main() {
    char *args[] = {"/bin/ls", "-l", NULL};

    // Replace current process with ls
    execv("/bin/ls", args);

    // This line only executes if execv fails
    perror("execv failed");
    return 1;
}

exec() variants:

int execl(char const* path, char const* arg0, ...);
int execle(char const* path, char const* arg0, ..., char const* envp[]);
int execlp(char const* file, char const* arg0, ...);
int execv(char const* path, char const* argv[]);
int execve(char const* path, char const* argv[], char const* envp[]);
int execvp(char const* file, char const* argv[]);
int execvpe(const char* file, char* const argv[], char* const envp[]);
int fexecve(int fd, char* const argv[], char* const envp[]);

The base of each is exec, followed by one or more letters:

  • e – Environment variables are passed as an array of pointers to null-terminated strings of form name=value. The final element of the array must be a null pointer.
  • l – Command-line arguments are passed as individual pointers to null-terminated strings. The last argument must be a null pointer.
  • p – Uses the PATH environment variable to find the file named in the file argument to be executed.
  • v – Command-line arguments are passed as an array of pointers to null-terminated strings. The final element of the array must be a null pointer
  • f (prefix) – A file descriptor is passed instead. The file descriptor must be opened with O_RDONLY or O_PATH and the caller must have permission to execute its file.

In functions where no environment variables can be passed execl(), execlp(), execv() and execvp(), the new process image inherits the current environment variables.

What exec() does:

  1. Loads new program binary into memory
  2. Replaces current process's code, data, and stack
  3. Keeps same PID
  4. Keeps open file descriptors (unless marked close-on-exec)
  5. Resets signal handlers
  6. Process never returns to original code

File Operations

open()

Opens a file and returns a file descriptor.

#include <fcntl.h>
#include <unistd.h>

int fd = open("/tmp/file.txt", O_RDWR | O_CREAT, 0644);
if (fd == -1) {
    perror("open failed");
    return 1;
}

Flags:

  • O_RDONLY - Read only
  • O_WRONLY - Write only
  • O_RDWR - Read and write
  • O_CREAT - Create if doesn't exist
  • O_APPEND - Append mode
  • O_TRUNC - Truncate to zero length
  • O_NONBLOCK - Non-blocking mode

read() and write()

Transfer data between user space and kernel buffers.

char buffer[1024];
ssize_t bytes_read = read(fd, buffer, sizeof(buffer));
ssize_t bytes_written = write(fd, "Hello", 5);

System call flow for read():

  1. Validate file descriptor
  2. Check file permissions
  3. Check if data is in page cache
  4. If not in cache, schedule disk I/O
  5. Block process until data available (unless O_NONBLOCK)
  6. Copy data from kernel buffer to user space
  7. Update file offset
  8. Return number of bytes read

close()

Closes a file descriptor.

close(fd);

What close() does:

  1. Removes entry from process's file descriptor table
  2. Decrements reference count in system-wide file table
  3. If reference count reaches zero:
  4. Flushes buffers
  5. Releases inode reference
  6. Frees resources

Memory Operations

sbrk()

Changes the program's data segment size (used for heap management).

#include <unistd.h>

void *old_brk = sbrk(0);        // Get current break
void *new_ptr = sbrk(1024);      // Increase by 1024 bytes

Used by malloc() for small allocations:

  • Increases heap size linearly
  • Fast for small allocations
  • Cannot free memory in the middle
  • Heap can only shrink from the top

mmap()

Maps files or devices into memory, or allocates anonymous memory.

#include <sys/mman.h>

// Anonymous memory allocation
void *ptr = mmap(NULL, 4096,
                 PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS,
                 -1, 0);

// Memory-mapped file
int fd = open("file.txt", O_RDONLY);
void *mapped = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);

Used by malloc() for large allocations:

  • Can allocate/free memory anywhere
  • Page-aligned allocations
  • Typically used for allocations > 128KB
  • Can map files directly into memory

Using strace

strace traces system calls made by a process. It's an invaluable debugging tool.

Basic Usage

# Trace a command
strace ls -l

# Trace an existing process
strace -p <PID>

# Trace only specific system calls
strace -e open,read,write ./program

# Show timing information
strace -T ./program

# Count system calls
strace -c ./program

# Follow child processes
strace -f ./program

Interpreting strace Output

$ strace cat /etc/hostname
execve("/usr/bin/cat", ["cat", "/etc/hostname"], ...) = 0
brk(NULL)                               = 0x55555556c000
openat(AT_FDCWD, "/etc/hostname", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
read(3, "myhost\n", 131072)             = 9
write(1, "myhost\n", 9)                 = 9
close(3)                                = 0
exit_group(0)                           = ?

Understanding the output:

  • System call name: e.g., openat, read, write
  • Arguments: shown in parentheses
  • Return value: shown after =
  • Error codes: -1 with errno (e.g., ENOENT)

Common strace Patterns

  • ENOENT: File not found (check paths)
  • EACCES: Permission denied
  • EAGAIN: Resource temporarily unavailable (retry)
  • EINTR: Interrupted by signal
  • Blocking calls: No immediate return value shown

System Calls vs Library Functions

Understanding the difference is crucial:

System Calls

  • Direct kernel interface
  • Examples: open(), read(), write(), fork(), mmap()
  • Performance: Context switch overhead
  • Behavior: Consistent across all programs
  • Errors: Return -1 and set errno

Library Functions

  • User space code (often in libc)
  • Examples: printf(), malloc(), fopen(), strlen()
  • May use system calls internally
  • Performance: No context switch (unless they call system calls)
  • Behavior: Can be overridden or replaced

Example: printf() vs write()

// Library function - buffered, formatted
printf("Hello, World!\n");

// System call - unbuffered, raw
write(1, "Hello, World!\n", 14);

printf() internally:

  1. Formats the string
  2. Buffers output
  3. Eventually calls write() system call
  4. Provides convenience features

Process Control Block (PCB)

The kernel maintains a PCB (also called task_struct in Linux) for each process.

PCB Contents:

  • Process identification:

    • PID (Process ID)
    • PPID (Parent PID)
    • UID/GID (User/Group IDs)
  • Process state:

    • Running, ready, blocked, zombie, etc.
  • CPU scheduling information:

    • Priority
    • Scheduling policy
    • CPU time used
  • Memory management:

    • Page tables
    • Memory limits
    • Heap and stack pointers
  • File system information:

    • Current working directory
    • Root directory
    • Open file descriptor table
  • Signal handling:

    • Signal handlers
    • Pending signals
    • Signal mask

Kernel vs User Space

Understanding the separation is fundamental:

Kernel Space

  • Privilege Level: Ring 0 (highest privilege)
  • Memory Access: Can access all memory
  • Functions:
    • Process scheduling
    • Memory management
    • Device drivers
    • File system operations
    • Network stack

User Space

  • Privilege Level: Ring 3 (restricted)
  • Memory Access: Only own process memory
  • Functions:
    • Application code
    • Libraries
    • User interfaces

Memory Layout:

graph TD
    subgraph HighAddress["High Memory Address (0xFFFFFFFF)"]
        direction TB
        KernelSpace["Kernel Space<br/>~1GB<br/>Shared across all processes"]
    end

    subgraph UserSpace["User Space<br/>~3GB per process"]
        direction TB
        Stack["Stack<br/>Function calls, local variables<br/>Grows Downward"]
        Gap["Dynamic Gap<br/>Available memory"]
        Heap["Heap<br/>Dynamic allocations (malloc, new)<br/>Grows Upward"]
        Data["Data Segment<br/>Global & static variables<br/>(initialized & uninitialized)"]
        Text["Text Segment<br/>Program code (read-only)"]
    end

    subgraph LowAddress["Low Memory Address (0x00000000)"]
        Reserved["Reserved/Null"]
    end

    KernelSpace -.boundary.-> Stack
    Stack --> Gap
    Gap --> Heap
    Heap --> Data
    Data --> Text
    Text --> Reserved

Further Reading

  • Man pages: man 2 syscalls, man 2 fork, man 2 open, etc.
  • /usr/include/asm/unistd_64.h - System call numbers
  • Kernel source: arch/x86/entry/entry_64.S - System call entry point

Practice Questions

  1. What is the difference between fork() and vfork()?
  2. Why does exec() not return on success?
  3. How would you implement a simple shell using fork() and exec()?
  4. What happens to file descriptors after fork()?
  5. What is the purpose of the O_CLOEXEC flag?
  6. How does strace work without modifying the target process?
  7. What is the difference between brk() and mmap()?