About
This book serves as a reference and guide for the fidx tool.
Introduction
fidx (file indexer) is a tool for indexing file archives.
The primary goal of this tool is to assist in maintaining list of hashes for files in backup archives, but in addition it can tag files and deduplicate files based on their hash.
Known limitations
- Only supports unicode compliant pathnames (non-complient file system entries are skipped).
- Deduplication is currently unimplemented, though the tool can report duplicates.
- Unsure if deduplication will be supported on Windows.
Installation
If fidx is installed from crates.io then simply run:
$ cargo install fidx
To install from source, clone the repository, open a work directory and run cargo install --path .
in it:
$ fossil clone --workdir fidx https://repos.qrnch.tech/pub/fidx fidx.fossil
$ cd fidx
$ cargo install --path .
Initialization
The first step in using fidx on a directory tree is the initialize a state
database. This is done by running fidx init
in the directory tree's base
directory.
If the tool will be used maintain checksums on an external disk which is
mounted on /Volumes/backup_archive
, then run:
$ cd /Volumes/backup_archive
$ fidx init
All the entries in the database will be relative to the directory where the database was initialized.
Hashing
In order to not miss bit rot due to degenerating storage media the checksum database must only be updated when there were conscious modifications made to the file archive. fidx accomplishes this by never recalculating hashes for files which do not have an altered modification timestamp1 in the filesystem.
Ignore lists
fidx supports ignore lists which can be used to ignore filesystem entries by
glob expressions. There's no way to manage ignore entries using the fidx
command line tool. Instead, users must manipulate the database directly using
their favorite sqlite database editor, such as the standard sqlite3
command
line tool.
The default ignore list for a newly initialized fidx database is:
$ sqlite3 .findex
sqlite> .mode column
sqlite> .header on
sqlite> SELECT * FROM ignore;
id path
-- -------------
1 .findex
2 .findex-shm
3 .findex-wal
4 .findexer.log
5 **/.*.swp
Updating
To update the hash database run the subcommand update
:
$ fidx update
This must be run from the base directory (i.e. where the fidx initialization was performed, and the state database resides).
Verify
To verify the integrity of the archive against the stored database run the
subcommand verify
:
$ fidx verify
The verify subcommand will verify all files from the current subdirectory
within the managed tree given the current state of the database, which means
that the verify will not detect newly added files in the file system -- it
works in the premise that since the latest changes in the tree the user has run
an update
.
To verify the entire archive the verify
subcommand must be run from the base
directory (the directory where the fidx database resides) or it will only
verify a subtree from the current subtree.
An optional argument can be passed to the verify
subcommand to
specify a subset to verify. If a tree is managed under /backups
, which
contains a subdirectory foo
, which in turn contains a subdirectory bar
(i.e. /backups/foo/bar
) the following would only verify files under
/backups/foo/bar
:
$ cd /backups/foo
$ fidx verify bar
1 In other words: fidx uses the file's modification time to detect intended updates of files. Don't try to do anything creative and abnormal with mtimes which breaks this imporant assumption.
Tagging
The tagging system in fidx is simple (and limited), but it has one particular quirk which can cause some confusion: Tags are internally associated with hashes, not file entries. This is done based off the assumption that the tags are used to describe the contents of a file, which has two benefits:
- If a file is renamed1 (without changing its contents) it will not lose its tags.
- If several files share the same contents, only one will need to be tagged but all of the files will gain the tag(s).
Tags are stored in the fidx tree database, and thus are local to the tree.
Managing tags
In order to tag files there needs to be tags in the database. To add tags use
the subcommand add-tag
(more than one tag can be added at a time):
$ fidx add-tag some-tag another
(Each tag must be unique).
Once there are tags in the database, use the subcommand list-tags
to list
them:
$ fidx list-tags
2 another
1 some-tag
A tag can be renamed using the rename-tag
subcommand. The following would
rename a tag called from-name
to to-name
:
$ fidx rename-tag from-name to-name
(Only one tag can be renamed at a time)
Tags can be removed using the remove-tag
subcommand:
$ fidx remove-tag some-tag another
Tagging/Untagging files
Note: It is important to only manage tagging/untagging in a tree which has not
been modified since last database update since it associates tags with hashes.
Always run and update
before managing tag
associations in a tree.
To attach one or more tags to a file use the subcommand tag
, which by default
takes as its first argument the name of the file to tag, and the remaining
arguments are tag names:
$ fidx tag fossil-repos-2020.tar scm fossil source-archives
This behavior can be changed by adding an --
argument, which causes all
arguments before the --
to be interpreted as a list of files, while
arguments after the --
will be interpreted as a list of tags.
$ fidx tag fossil-repos-2020.tar fossil-repos-2021.tar -- scm fossil
A side-effect of this design is that fidx tagging does not work well with files
named --
(but naming a file that is kind of asking for trouble, so wish
granted).
If a file name or tag argument begins with an @
character, and the string
immediately following the @
character is a file name (which exists), this
file will be treated as a list of files/tags, depending on where it is
specified.
Assuming files.list
exists and is a list of files to tag, and tags.list
exists and is a list of files to apply to the files in files.list
, use:
$ fidx tag @files.list @tags.list
The @
-list feature can be combined with --
:
$ fidx tag some-file.tar @files.list another-file.tar -- some-tag @tags.list another-tag
Just as the special --
argument implies some weird semantics regarding files
called simply --
, the @<file/tag>
introduces some weirdness surrounding
file and tag names with a leading @
, and thus this is discouraged.
To inspect what tags a file is associated with use the subcommand tags
:
$ fidx tags fossil-repos-2020.tar
list tags for fossil-repos-2020.tar
2 fossil
6 scm
7 source-archives
Removing tags from a file can be done using the untag
subcommand, which has
the same format as tag
; first argument is the file name and the following
arguments are the tags to disassociate with that file:
$ fidx untag fossil-repos-2020.tar scm
Search
To search for a file based on tags use the subcommand search
:
$ fidx search some-tag another
The search is very basic, and only supports implicit AND
boolean searches,
meaning that only files with all matching tags will be returned.
To run a command on each entry in an output list from a search on unix-like
platforms one can use xargs
, though note that it by default treats spaces as
argument separators, which causes issues if files/paths have spaces in them.
The workaround unfortunately does not look the same on macOS and Linux.
The folloing examples will remove any files from the filesystem that are tagged with the tag "delete-me".
macOS (use tr
to convert newlines to nul, tell xargs
(using -0
) to only
separate entries by nul):
$ fidx search delete-me | tr '\n' '\0' | xargs -0 rm
Linux (tell xargs
to only use \n
as a deliminter (by default this will
include spaces):
$ fidx search delete-me | xargs -d '\n' rm
1 fidx has no concept of a file/directory rename; it treats renames as a deletion and a new file, where the deletion would cause the tag association to disappear if tags where associated with file entries rather than their contents.
Deduplication
Caution: The deduplication feature in fidx is built around its database of hashes, which means that it's important not to perform deduplication unless the database is known to be up to date with the current state of the file system tree.
Listing duplicates
To list all known duplicates use the subcommand dups
:
$ fidx dups
This list will not include entries which have already been deduplicated.
Deduplicating
To perform a deduplication use the subcommand dedup
:
$ fidx dedup