17.3. Checking and Fixing Filesystem Errors
17.3.1. How do filesystems sustain damage? What is fsck?
When a machine loses power without being properly shut down,
data can be lost. On a multi-user machine, data loss is almost assured.
The reason for this is that, for efficiency's sake, not every change to a
file is written directly to the files itself. Instead, the changes are
recorded in memory buffers that temporarily hold part of the files'
contents. That way small changes can be collected in the buffer and written
to disk all at once, sidestepping the extreme slowness of disk I/O. The act
of writing a buffer's contents to disk is called flushing or
syncing. When a machine goes down without syncing, the changes
in memory are lost.
17.3.2. What is fsck?
Fortunately, it is at least possible to deduce which files were
open when the machine went down and restore the filesystem to a
consistent state. The command that does this work is called fsck
(in abbreviation of "filesystem check").
Since the Unix operating systems each support multiple
filesystem types, fsck is usually a front-end
for different filesystem-specific programs. Because
filesystems vary in their features, those back-end
programs will vary in their options. In operating systems where
this is the case, fsck can take different filesystem-specific options
to pass to those back ends.
In addition, fsck syntax varies among Unix implementations.
However, the semantics are fairly consistent. The main useful
options to fsck allow you to
- check several filesystems in parallel. This option works
faster than checking the systems in succession. It is generally used
for the initial check and repair of all the filesystems
on a machine after it loses power. If fsck finds problems that
it cannot correct automatically, it is necessary to
- repair a filesystem by hand, which involves saving the contents of
unlinked inodes in a top-level directory of the
filesystem called lost+found/. This action implies
a loss of data, which is the reason fsck will not do it automatically.
Normally this interactive mode and the parallel mode described above
cannot be used in conjunction.
- repair a filesystem automatically, which is equivalent to repairing
it by hand and answering "yes" to every question about whether to place
an inode's contents in lost+found/. This option
is usually mutually exclusive with the parallel-checking option.
17.3.3. Running fsck
fsck performs low-level operations on a disk, treating it as
a raw device rather than a filesystem
(i.e. a character device rather than a
block device).
For that reason, it is necessary to unmount a
filesystem before checking it. Otherwise high-level
filesystem activity would cause the memory buffers to drift out of
sync with the disk, effectively causing the whole problem that
required you to run fsck, all over again.
The root (/) filesystem cannot be unmounted, since it contains the
fsck executable itself. This is the reason it is good to make
root filesystems small, to reduce the chance that they will sustain damage
and need to be repaired. If the root filesystem does sustain damage,
it must be repaired either by booting the system in single-user
mode to assure that there will be no filesystem activity other than
the work of fsck itself, or by using
emergency boot media with a root image and another copy
of the fsck executable, and then unmounting the root filesystem.
The first method will fail in the case that the fsck executable itself
is damaged, so the second method is preferrable.