System Calls & Kernel Internals¶

Critical Topic

This is a critical topic for Google SRE interviews. Expect deep dive questions that test your understanding of the system call mechanism and kernel internals.

Overview¶

System calls are the fundamental interface between user space applications and the Linux kernel. They provide a controlled way for applications to request kernel services while maintaining system security and stability.

System Call Mechanism¶

User Space to Kernel Space Transition¶

When a user space application needs kernel services, it must transition from user space to kernel space. This transition is carefully controlled to maintain system security.

// User space code
int fd = open("/tmp/file.txt", O_RDONLY);

The transition process:

Application calls a library function (e.g., open() from libc)
Library function prepares the system call:
- Places system call number in a register (e.g., %rax on x86-64)
- Places arguments in specific registers
Special CPU instruction is executed (e.g., syscall on x86-64, svc on ARM)
CPU switches to kernel mode:
- Changes privilege level (Ring 3 → Ring 0)
- Switches to kernel stack
- Saves user space context
Kernel's system call handler executes:
- Validates arguments
- Performs requested operation
- Prepares return value
Return to user space:
- Restores user space context
- Switches back to user mode
- Returns control to application

graph TD
    A[User Application] -->|1. Call libc function| B[C Library glibc]
    B -->|2. Setup registers| C[System Call Interface]
    C -->|3. syscall instruction| D[Kernel Mode Switch]
    D -->|4. Execute| E[System Call Handler]
    E -->|5. Kernel Service| F[File System / Memory / etc.]
    F -->|6. Return| E
    E -->|7. Return to user mode| A

Why Direct Kernel Access is Prohibited¶

User space applications cannot directly access kernel space for several critical reasons:

Security: Prevents malicious code from compromising the system
Stability: Prevents buggy code from crashing the kernel
Isolation: Each process has its own memory space
Abstraction: Provides a stable API regardless of hardware changes

Privilege Separation

If user space could directly access kernel space, a single bug or malicious program could:

Crash the entire system
Access any process's memory
Bypass all security mechanisms
Corrupt kernel data structures

Common System Calls¶

Process Control¶

fork()¶

Creates a new process by duplicating the calling process.

#include <unistd.h>
#include <stdio.h>

int main() {
    pid_t pid = fork();

    if (pid == 0) {
        // Child process
        printf("Child PID: %d\n", getpid());
    } else if (pid > 0) {
        // Parent process
        printf("Parent PID: %d, Child PID: %d\n", getpid(), pid);
    } else {
        // Error
        perror("fork failed");
    }

    return 0;
}

What happens during fork():

Kernel creates a new Process Control Block (PCB)
Assigns new PID to child
Copies parent's memory space (using Copy-on-Write)
Copies file descriptor table
Child gets CPU scheduling
Both processes continue from the same point

Copy-on-Write (COW)

Modern Linux uses COW to optimize fork(). The child initially shares the parent's memory pages. Pages are only copied when either process writes to them.

exec() Family¶

Replaces the current process image with a new program.

#include <unistd.h>

int main() {
    char *args[] = {"/bin/ls", "-l", NULL};

    // Replace current process with ls
    execv("/bin/ls", args);

    // This line only executes if execv fails
    perror("execv failed");
    return 1;
}

exec() variants:

int execl(char const* path, char const* arg0, ...);
int execle(char const* path, char const* arg0, ..., char const* envp[]);
int execlp(char const* file, char const* arg0, ...);
int execv(char const* path, char const* argv[]);
int execve(char const* path, char const* argv[], char const* envp[]);
int execvp(char const* file, char const* argv[]);
int execvpe(const char* file, char* const argv[], char* const envp[]);
int fexecve(int fd, char* const argv[], char* const envp[]);

The base of each is exec, followed by one or more letters:

e – Environment variables are passed as an array of pointers to null-terminated strings of form name=value. The final element of the array must be a null pointer.
l – Command-line arguments are passed as individual pointers to null-terminated strings. The last argument must be a null pointer.
p – Uses the PATH environment variable to find the file named in the file argument to be executed.
v – Command-line arguments are passed as an array of pointers to null-terminated strings. The final element of the array must be a null pointer
f (prefix) – A file descriptor is passed instead. The file descriptor must be opened with O_RDONLY or O_PATH and the caller must have permission to execute its file.

In functions where no environment variables can be passed execl(), execlp(), execv() and execvp(), the new process image inherits the current environment variables.

What exec() does:

Loads new program binary into memory
Replaces current process's code, data, and stack
Keeps same PID
Keeps open file descriptors (unless marked close-on-exec)
Resets signal handlers
Process never returns to original code

File Operations¶

open()¶

Opens a file and returns a file descriptor.

#include <fcntl.h>
#include <unistd.h>

int fd = open("/tmp/file.txt", O_RDWR | O_CREAT, 0644);
if (fd == -1) {
    perror("open failed");
    return 1;
}

Flags:

O_RDONLY - Read only
O_WRONLY - Write only
O_RDWR - Read and write
O_CREAT - Create if doesn't exist
O_APPEND - Append mode
O_TRUNC - Truncate to zero length
O_NONBLOCK - Non-blocking mode

read() and write()¶

Transfer data between user space and kernel buffers.

char buffer[1024];
ssize_t bytes_read = read(fd, buffer, sizeof(buffer));
ssize_t bytes_written = write(fd, "Hello", 5);

System call flow for read():

Validate file descriptor
Check file permissions
Check if data is in page cache
If not in cache, schedule disk I/O
Block process until data available (unless O_NONBLOCK)
Copy data from kernel buffer to user space
Update file offset
Return number of bytes read

close()¶

Closes a file descriptor.

close(fd);

What close() does:

Removes entry from process's file descriptor table
Decrements reference count in system-wide file table
If reference count reaches zero:
Flushes buffers
Releases inode reference
Frees resources

Memory Operations¶

sbrk()¶

Changes the program's data segment size (used for heap management).

#include <unistd.h>

void *old_brk = sbrk(0);        // Get current break
void *new_ptr = sbrk(1024);      // Increase by 1024 bytes

Used by malloc() for small allocations:

Increases heap size linearly
Fast for small allocations
Cannot free memory in the middle
Heap can only shrink from the top

mmap()¶

Maps files or devices into memory, or allocates anonymous memory.

#include <sys/mman.h>

// Anonymous memory allocation
void *ptr = mmap(NULL, 4096,
                 PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS,
                 -1, 0);

// Memory-mapped file
int fd = open("file.txt", O_RDONLY);
void *mapped = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);

Used by malloc() for large allocations:

Can allocate/free memory anywhere
Page-aligned allocations
Typically used for allocations > 128KB
Can map files directly into memory

Using strace¶

strace traces system calls made by a process. It's an invaluable debugging tool.

Basic Usage¶

# Trace a command
strace ls -l

# Trace an existing process
strace -p <PID>

# Trace only specific system calls
strace -e open,read,write ./program

# Show timing information
strace -T ./program

# Count system calls
strace -c ./program

# Follow child processes
strace -f ./program

Interpreting strace Output¶

$ strace cat /etc/hostname
execve("/usr/bin/cat", ["cat", "/etc/hostname"], ...) = 0
brk(NULL)                               = 0x55555556c000
openat(AT_FDCWD, "/etc/hostname", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
read(3, "myhost\n", 131072)             = 9
write(1, "myhost\n", 9)                 = 9
close(3)                                = 0
exit_group(0)                           = ?

Understanding the output:

System call name: e.g., openat, read, write
Arguments: shown in parentheses
Return value: shown after =
Error codes: -1 with errno (e.g., ENOENT)

Common strace Patterns

ENOENT: File not found (check paths)
EACCES: Permission denied
EAGAIN: Resource temporarily unavailable (retry)
EINTR: Interrupted by signal
Blocking calls: No immediate return value shown

System Calls vs Library Functions¶

Understanding the difference is crucial:

System Calls¶

Direct kernel interface
Examples: open(), read(), write(), fork(), mmap()
Performance: Context switch overhead
Behavior: Consistent across all programs
Errors: Return -1 and set errno

Library Functions¶

User space code (often in libc)
Examples: printf(), malloc(), fopen(), strlen()
May use system calls internally
Performance: No context switch (unless they call system calls)
Behavior: Can be overridden or replaced

Example: printf() vs write()

// Library function - buffered, formatted
printf("Hello, World!\n");

// System call - unbuffered, raw
write(1, "Hello, World!\n", 14);

printf() internally:

Formats the string
Buffers output
Eventually calls write() system call
Provides convenience features

Process Control Block (PCB)¶

The kernel maintains a PCB (also called task_struct in Linux) for each process.

PCB Contents:

Process identification:
- PID (Process ID)
- PPID (Parent PID)
- UID/GID (User/Group IDs)
Process state:
- Running, ready, blocked, zombie, etc.
CPU scheduling information:
- Priority
- Scheduling policy
- CPU time used
Memory management:
- Page tables
- Memory limits
- Heap and stack pointers
File system information:
- Current working directory
- Root directory
- Open file descriptor table
Signal handling:
- Signal handlers
- Pending signals
- Signal mask

Kernel vs User Space¶

Understanding the separation is fundamental:

Kernel Space¶

Privilege Level: Ring 0 (highest privilege)
Memory Access: Can access all memory
Functions:
- Process scheduling
- Memory management
- Device drivers
- File system operations
- Network stack

User Space¶

Privilege Level: Ring 3 (restricted)
Memory Access: Only own process memory
Functions:
- Application code
- Libraries
- User interfaces

Memory Layout:

graph TD
    subgraph HighAddress["High Memory Address (0xFFFFFFFF)"]
        direction TB
        KernelSpace["Kernel Space<br/>~1GB<br/>Shared across all processes"]
    end

    subgraph UserSpace["User Space<br/>~3GB per process"]
        direction TB
        Stack["Stack<br/>Function calls, local variables<br/>Grows Downward"]
        Gap["Dynamic Gap<br/>Available memory"]
        Heap["Heap<br/>Dynamic allocations (malloc, new)<br/>Grows Upward"]
        Data["Data Segment<br/>Global & static variables<br/>(initialized & uninitialized)"]
        Text["Text Segment<br/>Program code (read-only)"]
    end

    subgraph LowAddress["Low Memory Address (0x00000000)"]
        Reserved["Reserved/Null"]
    end

    KernelSpace -.boundary.-> Stack
    Stack --> Gap
    Gap --> Heap
    Heap --> Data
    Data --> Text
    Text --> Reserved

Practice Questions¶

What is the difference between fork() and vfork()?
Why does exec() not return on success?
How would you implement a simple shell using fork() and exec()?
What happens to file descriptors after fork()?
What is the purpose of the O_CLOEXEC flag?
How does strace work without modifying the target process?
What is the difference between brk() and mmap()?