File System¶
Overview¶
The Linux file system is a hierarchical structure that organizes data on storage devices. Understanding file system internals is crucial for SRE interviews, particularly the relationship between inodes, dentries, and file descriptors.
Inode (Index Node)¶
The inode is the fundamental data structure in Unix-like file systems. It contains metadata about a file or directory.
Common Misconception
An inode does NOT store the filename! Filenames are stored in directory entries (dentries).
Inode Structure¶
An inode contains:
- File type: Regular file, directory, symbolic link, device file, etc.
- Permissions: Read, write, execute for owner, group, others
- Owner information: UID (User ID) and GID (Group ID)
- File size: Size in bytes
- Timestamps:
atime: Last access timemtime: Last modification time (content)ctime: Last change time (metadata)
- Link count: Number of hard links pointing to this inode
- Data block pointers: Locations of actual file data
- File metadata: Block size, number of blocks
Inode Numbers¶
Each inode has a unique number within its file system.
# View inode numbers
ls -i
# Find files by inode number
find / -inum <inode_number>
# Show detailed inode information
stat filename
Example output:
$ stat /etc/hostname
File: /etc/hostname
Size: 9 Blocks: 8 IO Block: 4096 regular file
Device: 803h/2051d Inode: 131075 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-01-15 10:30:45.123456789 -0500
Modify: 2025-01-10 08:15:20.987654321 -0500
Change: 2025-01-10 08:15:20.987654321 -0500
Birth: -
Data Block Pointers¶
Inodes use a multi-level pointer scheme to support files of various sizes:
Inode
├── Direct pointers (12) → Data blocks (48KB for 4KB blocks)
├── Single indirect pointer → Block of pointers → Data blocks
├── Double indirect pointer → Block → Block → Data blocks
└── Triple indirect pointer → Block → Block → Block → Data blocks
Advantages:
- Small files are accessed quickly (direct pointers)
- Large files are supported (indirect pointers)
- Efficient space usage
Relationship: Inodes, Dentries, and File Descriptors¶
Understanding how these three concepts relate is crucial:
graph LR
A[Filename in Directory] -->|dentry| B[Inode]
B -->|file table entry| C[File Descriptor]
C -->|used by| D[Process]
B -->|points to| E[Data Blocks]
The Complete Picture¶
-
Directory Entry (dentry):
- Maps filename → inode number
- Stored in parent directory's data blocks
- Cached by kernel for performance
-
Inode:
- Contains file metadata
- Points to data blocks
- One inode can have multiple dentries (hard links)
-
File Table Entry:
- System-wide table
- Tracks file offset, access mode, reference count
- Created when file is opened
-
File Descriptor:
- Per-process integer (0, 1, 2, ...)
- Index into process's file descriptor table
- Points to file table entry
Example Flow¶
What happens:
-
Kernel traverses directory path:
/→ inode for roothome→ dentry lookup → inodeuser→ dentry lookup → inodefile.txt→ dentry lookup → inode
-
Kernel creates file table entry:
- References the inode
- Sets file offset to 0
- Sets access mode (read-only)
-
Kernel allocates file descriptor:
- Finds lowest available integer
- Adds entry to process's FD table
- Points to file table entry
-
Returns file descriptor to process
Superblock¶
The superblock contains metadata about the entire file system.
Superblock Contents¶
- File system type: ext4, xfs, btrfs, etc.
- File system size: Total blocks
- Block size: Typically 4096 bytes
- Free blocks: Available space
- Free inodes: Available inodes
- Mount information: Mount point, mount flags
- Last mount time: When file system was last mounted
- Last write time: Last modification
- Mount count: Number of times mounted
- Magic number: File system identifier
# View superblock information
sudo dumpe2fs /dev/sda1 | grep -A 10 "Superblock"
# Show file system information
df -T
stat -f /
Superblock Corruption
If the superblock is corrupted, the file system becomes inaccessible. ext⅔/4 filesystems maintain backup superblocks for recovery.
fsck (File System Check)¶
fsck checks and repairs file system inconsistencies.
When fsck Runs¶
- During boot: If file system wasn't cleanly unmounted
- After crashes: Power failure, kernel panic
- Scheduled: After X mounts or Y days (configurable)
- Manually: When administrator suspects problems
What fsck Checks¶
- Inode validity: Proper structure and ranges
- Block allocation: No double allocation, orphaned blocks
- Directory structure: Valid entries, no cycles
- Link counts: Correct number of hard links
- Free block/inode counts: Matches superblock
- Bad blocks: Marks unusable blocks
# Check file system (unmounted!)
sudo fsck /dev/sda1
# Force check even if clean
sudo fsck -f /dev/sda1
# Automatically repair
sudo fsck -y /dev/sda1
# Check during boot
sudo tune2fs -c 1 /dev/sda1 # Check after next mount
Running fsck
Never run fsck on a mounted file system! This can cause severe data corruption. Always unmount first or use read-only mode.
Journaling¶
Journaling improves file system reliability and recovery time after crashes.
How Journaling Works¶
Traditional file systems (non-journaled):
- Update inode
- Update directory entry
- Update data blocks
- Crash occurs → inconsistent state
Journaled file systems:
- Write operation to journal (log)
- Mark as committed in journal
- Perform actual file system update
- Mark journal entry as complete
- If crash occurs: Replay journal on next mount
Journal Modes (ext4)¶
1. Journal (Slowest, Most Safe)
- Both metadata AND data written to journal first
- Guarantees data consistency
- Performance impact
2. Ordered (Default)
- Only metadata written to journal
- Data written to disk before metadata committed
- Good balance of safety and performance
3. Writeback (Fastest, Least Safe)
- Only metadata journaled
- Data and metadata can be written in any order
- Risk of stale data after crash
Benefits of Journaling¶
- Fast recovery: Replay journal instead of full fsck
- Consistency: File system always in valid state
- No lost updates: Committed operations survive crashes
Virtual File System (VFS)¶
The VFS is an abstraction layer that provides a unified interface to different file system types.
VFS Architecture¶
graph TD
Apps["👥 User Space Applications<br/>text editors, databases, shells, etc."]
SysCalls["🔧 System Call Interface<br/>open(), read(), write(), close()<br/>stat(), mkdir(), unlink()"]
VFS["🗂️ Virtual File System (VFS)<br/>Unified interface layer<br/>File operations abstraction"]
EXT4["📁 ext4<br/>Traditional Linux FS<br/>Journaling"]
XFS["📁 XFS<br/>High-performance<br/>Large files"]
BTRFS["📁 btrfs<br/>Copy-on-write<br/>Snapshots"]
NFS["🌐 NFS<br/>Network File System<br/>Remote storage"]
NTFS["📁 NTFS<br/>Windows FS<br/>Cross-platform"]
TmpFS["⚡ tmpfs<br/>Memory-based<br/>Temporary storage"]
BlockLayer["💽 Block Device Layer<br/>I/O scheduling & buffering"]
Drivers["🔌 Device Drivers<br/>HDD, SSD, NVMe, Network"]
Hardware["⚙️ Physical Hardware<br/>Storage devices"]
Apps -->|"userland → kernel"| SysCalls
SysCalls -->|"standardized API"| VFS
VFS -->|"filesystem-specific operations"| EXT4
VFS --> XFS
VFS --> BTRFS
VFS -->|"network protocol"| NFS
VFS --> NTFS
VFS --> TmpFS
EXT4 --> BlockLayer
XFS --> BlockLayer
BTRFS --> BlockLayer
NTFS --> BlockLayer
TmpFS -.->|"no disk I/O"| BlockLayer
NFS -.->|"network I/O"| Drivers
BlockLayer --> Drivers
Drivers --> Hardware
VFS Components¶
1. Superblock Operations
- Mount/unmount file system
- Sync file system
- Get file system statistics
2. Inode Operations
- Create/delete files
- Lookup dentries
- Set permissions
- Create links
3. File Operations
- Open/close
- Read/write
- Seek
- Memory map
4. Dentry Operations
- Compare filenames
- Hash filenames
- Delete dentries
Benefits of VFS¶
- Unified interface: Same API for all file systems
- Easy addition: New file systems plug into VFS
- Abstraction: User space doesn't know file system type
- Caching: Common dentry and inode cache
/proc File System¶
/proc is a pseudo or virtual file system that provides an interface to kernel data structures.
Characteristics¶
- Not on disk: Files are generated on-the-fly
- Process information: Each PID has a directory
- Kernel parameters: Readable and writable
- System information: CPU, memory, interrupts
Important /proc Entries¶
# Process information
/proc/<PID>/cmdline # Command line arguments
/proc/<PID>/environ # Environment variables
/proc/<PID>/fd/ # Open file descriptors
/proc/<PID>/maps # Memory mappings
/proc/<PID>/status # Process status
/proc/<PID>/stat # Process statistics
# System information
/proc/cpuinfo # CPU information
/proc/meminfo # Memory information
/proc/version # Kernel version
/proc/uptime # System uptime
/proc/loadavg # Load averages
# Kernel parameters (sysctl)
/proc/sys/ # Tunable kernel parameters
Example: Examining a process
# What is the process running?
cat /proc/1234/cmdline
# What files does it have open?
ls -l /proc/1234/fd/
# What is its memory usage?
cat /proc/1234/status | grep VmSize
# What libraries is it using?
cat /proc/1234/maps
File I/O at Kernel Level¶
Read Operation Flow¶
Kernel operations:
- Validate file descriptor: Check process FD table
- Get file table entry: Follow pointer from FD
- Get inode: From file table entry
- Check permissions: Can process read this file?
- Check page cache: Is data already in memory?
- Cache hit: Copy from page cache to user buffer
- Cache miss: Schedule disk I/O
- Read from disk (if needed):
- Submit I/O request to block layer
- Block process (unless O_NONBLOCK)
- Wait for disk operation
- Copy data to page cache
- Copy to user space: From kernel buffer to user buffer
- Update file offset: Advance position
- Return: Number of bytes read
Write Operation Flow¶
Kernel operations:
- Validate and check permissions
- Copy from user space: Buffer to kernel space
- Update page cache: Write data to cache
- Mark pages dirty: Need to be flushed to disk
- Return immediately: Unless O_SYNC flag set
- Background writeback: pdflush/writeback threads
- Periodically flush dirty pages
- On memory pressure
- On explicit sync()
Buffering and Caching¶
Page Cache Benefits:
- Reduces disk I/O
- Improves performance dramatically
- Shared between processes
- Uses available RAM
Flushing Mechanisms:
# Explicit flush
sync # Flush all buffers
fsync(fd) # Flush specific file (in code)
fdatasync(fd) # Flush data only, not metadata
# View dirty pages
cat /proc/meminfo | grep Dirty
# Tune writeback behavior
/proc/sys/vm/dirty_ratio
/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_writeback_centisecs
Hard Links vs Soft Links¶
Hard Links¶
A hard link is an additional directory entry for an existing inode.
Characteristics:
- Same inode: Both entries reference the same inode
- Link count: Increases inode's link count
- Deletion: File data persists until last link removed
- Same file system: Cannot cross file system boundaries
- No directories: Cannot hard link directories (prevents cycles)
Symbolic Links (Soft Links)¶
A symbolic link is a special file that contains a path to another file.
# Create symbolic link
ln -s /path/to/file.txt symlink.txt
# Has its own inode
ls -i file.txt symlink.txt
Characteristics:
- Different inode: Symlink has its own inode
- Contains path: Stores target path as data
- Can break: Target file can be deleted
- Cross file systems: Can link across different file systems
- Can link directories: Directories can be symlinked
- Resolution: Kernel resolves path when accessed
Comparison¶
| Feature | Hard Link | Symbolic Link |
|---|---|---|
| Inode | Same as target | Own inode |
| Size | Same as target | Size of path string |
| Cross file systems | No | Yes |
| Link to directory | No | Yes |
| Breaks if target deleted | No | Yes |
| Performance | Slightly faster | Requires path resolution |
File Permissions and Access Control¶
Permission Bits¶
- rwx r-x r-- 1 user group 1024 Jan 15 10:30 file.txt
File type: - (regular)
Owner: rwx (read, write, execute) = 7
Group: r-x (read, execute) = 5
Other: r-- (read) = 4
File types:
-: Regular filed: Directoryl: Symbolic linkc: Character deviceb: Block devices: Socketp: Named pipe (FIFO)
Special Permissions¶
Setuid (4000)
- Execute as file owner, not caller
- Example:
/usr/bin/passwd
Setgid (2000)
- Execute as file group
- On directories: new files inherit group
Sticky Bit (1000)
- Only owner can delete files
- Example:
/tmp
Access Control Lists (ACL)¶
ACLs provide more fine-grained permissions:
# Set ACL
setfacl -m u:john:rw file.txt
# View ACL
getfacl file.txt
# Remove ACL
setfacl -x u:john file.txt
Practice Questions¶
- What happens to a file's data when all hard links are removed?
- Why can't you create hard links to directories?
- What is the purpose of the sticky bit on directories?
- How does journaling prevent data corruption?
- What happens if you delete a file that's still open by a process?
- Explain the difference between atime, mtime, and ctime.
- How does the page cache improve file I/O performance?
- What is the maximum file size on ext4?
Further Reading¶
man 7 inodeman 2 statman 8 fsck- Kernel source:
fs/directory - "The Linux Programming Interface" Chapter 14-18