About

This book serves as a reference and guide for the fidx tool.

Introduction

fidx (file indexer) is a tool for indexing file archives.

The primary goal of this tool is to assist in maintaining list of hashes for files in backup archives, but in addition it can tag files and deduplicate files based on their hash.

Known limitations

  • Only supports unicode compliant pathnames (non-complient file system entries are skipped).
  • Deduplication is currently unimplemented, though the tool can report duplicates.
  • Unsure if deduplication will be supported on Windows.

Installation

If fidx is installed from crates.io then simply run:

$ cargo install fidx

To install from source, clone the repository, open a work directory and run cargo install --path . in it:

$ fossil clone --workdir fidx https://repos.qrnch.tech/pub/fidx fidx.fossil
$ cd fidx
$ cargo install --path .

Initialization

The first step in using fidx on a directory tree is the initialize a state database. This is done by running fidx init in the directory tree's base directory.

If the tool will be used maintain checksums on an external disk which is mounted on /Volumes/backup_archive, then run:

$ cd /Volumes/backup_archive
$ fidx init

All the entries in the database will be relative to the directory where the database was initialized.

Hashing

In order to not miss bit rot due to degenerating storage media the checksum database must only be updated when there were conscious modifications made to the file archive. fidx accomplishes this by never recalculating hashes for files which do not have an altered modification timestamp1 in the filesystem.

Ignore lists

fidx supports ignore lists which can be used to ignore filesystem entries by glob expressions. There's no way to manage ignore entries using the fidx command line tool. Instead, users must manipulate the database directly using their favorite sqlite database editor, such as the standard sqlite3 command line tool.

The default ignore list for a newly initialized fidx database is:

$ sqlite3 .findex
sqlite> .mode column
sqlite> .header on
sqlite> SELECT * FROM ignore;
id  path
--  -------------
1   .findex
2   .findex-shm
3   .findex-wal
4   .findexer.log
5   **/.*.swp

Updating

To update the hash database run the subcommand update:

$ fidx update

This must be run from the base directory (i.e. where the fidx initialization was performed, and the state database resides).

Verify

To verify the integrity of the archive against the stored database run the subcommand verify:

$ fidx verify

The verify subcommand will verify all files from the current subdirectory within the managed tree given the current state of the database, which means that the verify will not detect newly added files in the file system -- it works in the premise that since the latest changes in the tree the user has run an update.

To verify the entire archive the verify subcommand must be run from the base directory (the directory where the fidx database resides) or it will only verify a subtree from the current subtree.

An optional argument can be passed to the verify subcommand to specify a subset to verify. If a tree is managed under /backups, which contains a subdirectory foo, which in turn contains a subdirectory bar (i.e. /backups/foo/bar) the following would only verify files under /backups/foo/bar:

$ cd /backups/foo
$ fidx verify bar

1 In other words: fidx uses the file's modification time to detect intended updates of files. Don't try to do anything creative and abnormal with mtimes which breaks this imporant assumption.

Tagging

The tagging system in fidx is simple (and limited), but it has one particular quirk which can cause some confusion: Tags are internally associated with hashes, not file entries. This is done based off the assumption that the tags are used to describe the contents of a file, which has two benefits:

  • If a file is renamed1 (without changing its contents) it will not lose its tags.
  • If several files share the same contents, only one will need to be tagged but all of the files will gain the tag(s).

Tags are stored in the fidx tree database, and thus are local to the tree.

Managing tags

In order to tag files there needs to be tags in the database. To add tags use the subcommand add-tag (more than one tag can be added at a time):

$ fidx add-tag some-tag another

(Each tag must be unique).

Once there are tags in the database, use the subcommand list-tags to list them:

$ fidx list-tags
    2 another
    1 some-tag

A tag can be renamed using the rename-tag subcommand. The following would rename a tag called from-name to to-name:

$ fidx rename-tag from-name to-name

(Only one tag can be renamed at a time)

Tags can be removed using the remove-tag subcommand:

$ fidx remove-tag some-tag another

Tagging/Untagging files

Note: It is important to only manage tagging/untagging in a tree which has not been modified since last database update since it associates tags with hashes. Always run and update before managing tag associations in a tree.

To attach one or more tags to a file use the subcommand tag, which by default takes as its first argument the name of the file to tag, and the remaining arguments are tag names:

$ fidx tag fossil-repos-2020.tar scm fossil source-archives

This behavior can be changed by adding an -- argument, which causes all arguments before the -- to be interpreted as a list of files, while arguments after the -- will be interpreted as a list of tags.

$ fidx tag fossil-repos-2020.tar fossil-repos-2021.tar -- scm fossil

A side-effect of this design is that fidx tagging does not work well with files named -- (but naming a file that is kind of asking for trouble, so wish granted).

If a file name or tag argument begins with an @ character, and the string immediately following the @ character is a file name (which exists), this file will be treated as a list of files/tags, depending on where it is specified.

Assuming files.list exists and is a list of files to tag, and tags.list exists and is a list of files to apply to the files in files.list, use:

$ fidx tag @files.list @tags.list

The @-list feature can be combined with --:

$ fidx tag some-file.tar @files.list another-file.tar -- some-tag @tags.list another-tag

Just as the special -- argument implies some weird semantics regarding files called simply --, the @<file/tag> introduces some weirdness surrounding file and tag names with a leading @, and thus this is discouraged.

To inspect what tags a file is associated with use the subcommand tags:

$ fidx tags fossil-repos-2020.tar
list tags for fossil-repos-2020.tar
   2 fossil
   6 scm
   7 source-archives

Removing tags from a file can be done using the untag subcommand, which has the same format as tag; first argument is the file name and the following arguments are the tags to disassociate with that file:

$ fidx untag fossil-repos-2020.tar scm

To search for a file based on tags use the subcommand search:

$ fidx search some-tag another

The search is very basic, and only supports implicit AND boolean searches, meaning that only files with all matching tags will be returned.

To run a command on each entry in an output list from a search on unix-like platforms one can use xargs, though note that it by default treats spaces as argument separators, which causes issues if files/paths have spaces in them. The workaround unfortunately does not look the same on macOS and Linux.

The folloing examples will remove any files from the filesystem that are tagged with the tag "delete-me".

macOS (use tr to convert newlines to nul, tell xargs (using -0) to only separate entries by nul):

$ fidx search delete-me | tr '\n' '\0' | xargs -0 rm

Linux (tell xargs to only use \n as a deliminter (by default this will include spaces):

$ fidx search delete-me | xargs -d '\n' rm

1 fidx has no concept of a file/directory rename; it treats renames as a deletion and a new file, where the deletion would cause the tag association to disappear if tags where associated with file entries rather than their contents.

Deduplication

Caution: The deduplication feature in fidx is built around its database of hashes, which means that it's important not to perform deduplication unless the database is known to be up to date with the current state of the file system tree.

Listing duplicates

To list all known duplicates use the subcommand dups:

$ fidx dups

This list will not include entries which have already been deduplicated.

Deduplicating

To perform a deduplication use the subcommand dedup:

$ fidx dedup