System Calls & Kernel Internals¶
Critical Topic
This is a critical topic for Google SRE interviews. Expect deep dive questions that test your understanding of the system call mechanism and kernel internals.
Overview¶
System calls are the fundamental interface between user space applications and the Linux kernel. They provide a controlled way for applications to request kernel services while maintaining system security and stability.
System Call Mechanism¶
User Space to Kernel Space Transition¶
When a user space application needs kernel services, it must transition from user space to kernel space. This transition is carefully controlled to maintain system security.
The transition process:
- Application calls a library function (e.g.,
open()from libc) - Library function prepares the system call:
- Places system call number in a register (e.g.,
%raxon x86-64) - Places arguments in specific registers
- Places system call number in a register (e.g.,
- Special CPU instruction is executed (e.g.,
syscallon x86-64,svcon ARM) - CPU switches to kernel mode:
- Changes privilege level (Ring 3 → Ring 0)
- Switches to kernel stack
- Saves user space context
- Kernel's system call handler executes:
- Validates arguments
- Performs requested operation
- Prepares return value
- Return to user space:
- Restores user space context
- Switches back to user mode
- Returns control to application
graph TD
A[User Application] -->|1. Call libc function| B[C Library glibc]
B -->|2. Setup registers| C[System Call Interface]
C -->|3. syscall instruction| D[Kernel Mode Switch]
D -->|4. Execute| E[System Call Handler]
E -->|5. Kernel Service| F[File System / Memory / etc.]
F -->|6. Return| E
E -->|7. Return to user mode| A
Why Direct Kernel Access is Prohibited¶
User space applications cannot directly access kernel space for several critical reasons:
- Security: Prevents malicious code from compromising the system
- Stability: Prevents buggy code from crashing the kernel
- Isolation: Each process has its own memory space
- Abstraction: Provides a stable API regardless of hardware changes
Privilege Separation
If user space could directly access kernel space, a single bug or malicious program could:
- Crash the entire system
- Access any process's memory
- Bypass all security mechanisms
- Corrupt kernel data structures
Common System Calls¶
Process Control¶
fork()¶
Creates a new process by duplicating the calling process.
#include <unistd.h>
#include <stdio.h>
int main() {
pid_t pid = fork();
if (pid == 0) {
// Child process
printf("Child PID: %d\n", getpid());
} else if (pid > 0) {
// Parent process
printf("Parent PID: %d, Child PID: %d\n", getpid(), pid);
} else {
// Error
perror("fork failed");
}
return 0;
}
What happens during fork():
- Kernel creates a new Process Control Block (PCB)
- Assigns new PID to child
- Copies parent's memory space (using Copy-on-Write)
- Copies file descriptor table
- Child gets CPU scheduling
- Both processes continue from the same point
Copy-on-Write (COW)
Modern Linux uses COW to optimize fork(). The child initially shares the parent's memory pages. Pages are only copied when either process writes to them.
exec() Family¶
Replaces the current process image with a new program.
#include <unistd.h>
int main() {
char *args[] = {"/bin/ls", "-l", NULL};
// Replace current process with ls
execv("/bin/ls", args);
// This line only executes if execv fails
perror("execv failed");
return 1;
}
exec() variants:
int execl(char const* path, char const* arg0, ...);
int execle(char const* path, char const* arg0, ..., char const* envp[]);
int execlp(char const* file, char const* arg0, ...);
int execv(char const* path, char const* argv[]);
int execve(char const* path, char const* argv[], char const* envp[]);
int execvp(char const* file, char const* argv[]);
int execvpe(const char* file, char* const argv[], char* const envp[]);
int fexecve(int fd, char* const argv[], char* const envp[]);
The base of each is exec, followed by one or more letters:
- e – Environment variables are passed as an array of pointers to null-terminated strings of form name=value. The final element of the array must be a null pointer.
- l – Command-line arguments are passed as individual pointers to null-terminated strings. The last argument must be a null pointer.
- p – Uses the PATH environment variable to find the file named in the file argument to be executed.
- v – Command-line arguments are passed as an array of pointers to null-terminated strings. The final element of the array must be a null pointer
- f (prefix) – A file descriptor is passed instead. The file descriptor must be opened with O_RDONLY or O_PATH and the caller must have permission to execute its file.
In functions where no environment variables can be passed execl(), execlp(), execv() and execvp(), the new process image inherits the current environment variables.
What exec() does:
- Loads new program binary into memory
- Replaces current process's code, data, and stack
- Keeps same PID
- Keeps open file descriptors (unless marked close-on-exec)
- Resets signal handlers
- Process never returns to original code
File Operations¶
open()¶
Opens a file and returns a file descriptor.
#include <fcntl.h>
#include <unistd.h>
int fd = open("/tmp/file.txt", O_RDWR | O_CREAT, 0644);
if (fd == -1) {
perror("open failed");
return 1;
}
Flags:
O_RDONLY- Read onlyO_WRONLY- Write onlyO_RDWR- Read and writeO_CREAT- Create if doesn't existO_APPEND- Append modeO_TRUNC- Truncate to zero lengthO_NONBLOCK- Non-blocking mode
read() and write()¶
Transfer data between user space and kernel buffers.
char buffer[1024];
ssize_t bytes_read = read(fd, buffer, sizeof(buffer));
ssize_t bytes_written = write(fd, "Hello", 5);
System call flow for read():
- Validate file descriptor
- Check file permissions
- Check if data is in page cache
- If not in cache, schedule disk I/O
- Block process until data available (unless O_NONBLOCK)
- Copy data from kernel buffer to user space
- Update file offset
- Return number of bytes read
close()¶
Closes a file descriptor.
What close() does:
- Removes entry from process's file descriptor table
- Decrements reference count in system-wide file table
- If reference count reaches zero:
- Flushes buffers
- Releases inode reference
- Frees resources
Memory Operations¶
sbrk()¶
Changes the program's data segment size (used for heap management).
#include <unistd.h>
void *old_brk = sbrk(0); // Get current break
void *new_ptr = sbrk(1024); // Increase by 1024 bytes
Used by malloc() for small allocations:
- Increases heap size linearly
- Fast for small allocations
- Cannot free memory in the middle
- Heap can only shrink from the top
mmap()¶
Maps files or devices into memory, or allocates anonymous memory.
#include <sys/mman.h>
// Anonymous memory allocation
void *ptr = mmap(NULL, 4096,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0);
// Memory-mapped file
int fd = open("file.txt", O_RDONLY);
void *mapped = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
Used by malloc() for large allocations:
- Can allocate/free memory anywhere
- Page-aligned allocations
- Typically used for allocations > 128KB
- Can map files directly into memory
Using strace¶
strace traces system calls made by a process. It's an invaluable debugging tool.
Basic Usage¶
# Trace a command
strace ls -l
# Trace an existing process
strace -p <PID>
# Trace only specific system calls
strace -e open,read,write ./program
# Show timing information
strace -T ./program
# Count system calls
strace -c ./program
# Follow child processes
strace -f ./program
Interpreting strace Output¶
$ strace cat /etc/hostname
execve("/usr/bin/cat", ["cat", "/etc/hostname"], ...) = 0
brk(NULL) = 0x55555556c000
openat(AT_FDCWD, "/etc/hostname", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
read(3, "myhost\n", 131072) = 9
write(1, "myhost\n", 9) = 9
close(3) = 0
exit_group(0) = ?
Understanding the output:
- System call name: e.g.,
openat,read,write - Arguments: shown in parentheses
- Return value: shown after
= - Error codes:
-1with errno (e.g.,ENOENT)
Common strace Patterns
- ENOENT: File not found (check paths)
- EACCES: Permission denied
- EAGAIN: Resource temporarily unavailable (retry)
- EINTR: Interrupted by signal
- Blocking calls: No immediate return value shown
System Calls vs Library Functions¶
Understanding the difference is crucial:
System Calls¶
- Direct kernel interface
- Examples:
open(),read(),write(),fork(),mmap() - Performance: Context switch overhead
- Behavior: Consistent across all programs
- Errors: Return -1 and set errno
Library Functions¶
- User space code (often in libc)
- Examples:
printf(),malloc(),fopen(),strlen() - May use system calls internally
- Performance: No context switch (unless they call system calls)
- Behavior: Can be overridden or replaced
Example: printf() vs write()
// Library function - buffered, formatted
printf("Hello, World!\n");
// System call - unbuffered, raw
write(1, "Hello, World!\n", 14);
printf() internally:
- Formats the string
- Buffers output
- Eventually calls
write()system call - Provides convenience features
Process Control Block (PCB)¶
The kernel maintains a PCB (also called task_struct in Linux) for each process.
PCB Contents:
-
Process identification:
- PID (Process ID)
- PPID (Parent PID)
- UID/GID (User/Group IDs)
-
Process state:
- Running, ready, blocked, zombie, etc.
-
CPU scheduling information:
- Priority
- Scheduling policy
- CPU time used
-
Memory management:
- Page tables
- Memory limits
- Heap and stack pointers
-
File system information:
- Current working directory
- Root directory
- Open file descriptor table
-
Signal handling:
- Signal handlers
- Pending signals
- Signal mask
Kernel vs User Space¶
Understanding the separation is fundamental:
Kernel Space¶
- Privilege Level: Ring 0 (highest privilege)
- Memory Access: Can access all memory
- Functions:
- Process scheduling
- Memory management
- Device drivers
- File system operations
- Network stack
User Space¶
- Privilege Level: Ring 3 (restricted)
- Memory Access: Only own process memory
- Functions:
- Application code
- Libraries
- User interfaces
Memory Layout:
graph TD
subgraph HighAddress["High Memory Address (0xFFFFFFFF)"]
direction TB
KernelSpace["Kernel Space<br/>~1GB<br/>Shared across all processes"]
end
subgraph UserSpace["User Space<br/>~3GB per process"]
direction TB
Stack["Stack<br/>Function calls, local variables<br/>Grows Downward"]
Gap["Dynamic Gap<br/>Available memory"]
Heap["Heap<br/>Dynamic allocations (malloc, new)<br/>Grows Upward"]
Data["Data Segment<br/>Global & static variables<br/>(initialized & uninitialized)"]
Text["Text Segment<br/>Program code (read-only)"]
end
subgraph LowAddress["Low Memory Address (0x00000000)"]
Reserved["Reserved/Null"]
end
KernelSpace -.boundary.-> Stack
Stack --> Gap
Gap --> Heap
Heap --> Data
Data --> Text
Text --> Reserved
Further Reading¶
- Man pages:
man 2 syscalls,man 2 fork,man 2 open, etc. /usr/include/asm/unistd_64.h- System call numbers- Kernel source:
arch/x86/entry/entry_64.S- System call entry point
Practice Questions¶
- What is the difference between
fork()andvfork()? - Why does
exec()not return on success? - How would you implement a simple shell using fork() and exec()?
- What happens to file descriptors after fork()?
- What is the purpose of the
O_CLOEXECflag? - How does strace work without modifying the target process?
- What is the difference between
brk()andmmap()?