bupstash-repository
SYNOPSIS
Overview of the bupstash repository format.
DESCRIPTION
The most important part of bupstash is the repository. It is where all data is stored in a mostly
encrypted form. The bupstash client interacts via the repository over stdin/stdout of the bupstash-serve
process. This may be locally, or via a protocol such as ssh.
Because most data is encrypted, the repository structure is quite simple.
Files:
repo
├── data
│ ├── ...
│ └── 5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03
├── items
│ ├── 031d91b342fc76b8a4b32e2a8d12e4d0
│ └── ffaa0127fd9938aa0a3eaf6070aa947d
├── meta
│ ├── gc_generation
│ ├── gc_dirty
│ ├── schema_version
│ └── storage_engine
├── wal
│ ├── ...
│ └── 00000000N.wal
├── repo.oplog
├── repo.lock
├── tx.lock
├── tx.seq
└── tx.wal
repo.oplog
This file is an append only ledger where each entry is a bare encoded log op of the following format:
type Xid data<16>;
type Address data<32>;
type LogOp (AddItem | RemoveItems | RecoverRemoved);
type AddItem {
id: Xid
metadata: VersionedItemMetadata
}
type RemoveItems {
items: []Xid
}
type RecoverRemoved {}
type VersionedItemMetadata = (V1VersionedItemMetadata | V2VersionedItemMetadata | V2VersionedItemMetadata)
type V1VersionedItemMetadata {
// deprecated
}
type V2VersionedItemMetadata {
// deprecated
}
type V3VersionedItemMetadata {
primary_key_id: Xid,
unix_timestamp_millis: u64,
tree_height: usize,
address: Address,
encryped_metadata: data
}
struct V3SecretItemMetadata {
plain_text_hash: data<32>
send_key_id: Xid,
hash_key_part_2: data<32>,
tags: Map[String]String,
}
It is important to note, all metadata like search tags are stored encrypted and are not
readable without a master key or metadata key.
repo.lock
This lock is held exclusively during garbage collection and in a shared fashion
during operations that modify the repository.
tx.lock
Bupstash uses tx.lock
and tx.wal
to coordinate crash safe edits across multiple files.
tx.wal
This file is a bare encoded WAL (write ahead log) with the following schema:
type WalOp = Begin | End | CreateFile | WriteFileAt | Remove | Rename | Mkdir;
type Begin {
sequence_number: u64,
};
type End {};
type CreateFile {
path: String,
data_size: Uint,
};
type WriteFileAt {
path: String,
offset: Uint,
data_size: Uint,
};
type Remove {
path: String,
};
type Rename {
path: String,
to: String,
};
type Mkdir {
path: String,
};
The final 32 bytes of the write ahead log are the blake3 hash of the previous file contents.
When bupstash needs to modify repository metadata, it first writes a wal file and flushes it to disk, then performs the given operations in sequence. On crash the operations
will be replayed again preventing data loss.
Do not delete a write ahead log if you see one, it is critical for data integrity.
tx.seq
A file containing a sequence number used for numbering WAL files.
data/
This directory contains a set of encrypted and deduplicated data chunks.
The name of the file corresponds to the an HMAC hash of the unencrypted contents, as such
if two chunks are added to the repository with the same hmac, they only need to be stored once.
This directory is not used when the repository is configured for storage engines other than "Dir" storage.
items/
This directory contains one file for each item, where the contents of the file is an encoded
VersionedItemMetadata
as described in the repo.oplog section. When an item is removed and is
pending garbage collection it is given the .removed suffix.
Contains the JSON storage engine specification, which allows storage of data chunks
in external or alternative storage formats. This file is human editable to assist
manual data migrations between supported formats.
This file contains schema version of a repository.
Each time a garbage collection happens, this file is changed and is used to invalidate
client side caches.
This file marks if a garbage collection was interrupted prematurely and is used for crash
recovery. This file not always present.
wal/
When the BUPSTASH_KEEP_WAL=1 env var is set for the bupstash serve
process, this
directory contains the historic WAL files that can be used for point in time recovery.
The hash tree structure
Bupstash stores arbitrary streams of data in the repository by splitting the stream into chunks,
hmac addressing the chunks, then compressing and encrypting the chunks with the public key portion of a bupstash key.
Each chunk is then stored in the data directory in a file named after the hmac hash of the contents.
As we generate a sequence of chunks with a corresponding hmac addresses,
we can build a tree structure out of these addresses. Leaf nodes of the tree are simply the encrypted data.
Other nodes in the tree are simply unencrypted lists of hmac hashes, which may point to encrypted leaf nodes,
or other subtrees. The key idea behind the hash tree, is we can convert an arbitrary stream of data
into a single HMAC address with approximately equal sized chunks.
When multiple hash trees are added to the repository, they share structure and enable deduplication.
This addressing and encryption scheme has some important properties:
- The repository owner cannot guess chunk contents as the HMAC key is unknown to him.
- The repository owner cannot decrypt leaves of the hash tree, as they are encrypted.
- The repository owner can iterate the hash tree for garbage collection purposes.
- The repository owner can run garbage collection without retrieving the leaf nodes from cold storage.
- The repository owner can push stream a of hash tree nodes to a client with no network round trips.
- A client can send data streams to a repository without sharing the encryption key.
- A client can retrieve and verify a datastream by checking hmacs.
These properties are desirable for enabling high performance garbage collection and data streaming
with prefetch on the repository side.
Chunking and deduplication
Data is deduplicated by splitting a data stream into small chunks, and never storing the same chunk twice.
The performance of this deduplication is thus determined by how chunks split points are defined. For curious
readers - bupstash uses something known as 'content defined chunking' to find efficient chunk splits.
Chunks in the database are one of the following types, in general we know the type of a chunk
based on the item metadata and the hash tree height.
Encrypted data chunk
These chunks form the roots of our hash trees, they contain encrypted data. They contain
a key exchange packet, with enough information for the master key to derive the session key.
KEY_EXCHANGE_PACKET1_BYTES[PACKET1_SZ] || ENCRYPTED_BYTES[...]
After decryption, the chunk is optionally compressed, so is either compressed data, or data with a null footer byte.
COMPRESSED_DATA[...] || DECOMPRESSED_SIZE[4] || COMPRESSION_TYPE[1]
or
DATA[...] || 0x00
Valid compression types are:
- 1 == lz4 compression.
- 2 == zstd compression.
Hash tree node chunk
These chunks form non leaf nodes in our hash tree, they consist of an array of addresses prefixed
with the total number of data chunks that are beneath them in the tree.
NUM_DATA_CHUNKS[8] ADDRESS[ADDRESS_SZ]
NUM_DATA_CHUNKS[8] ADDRESS[ADDRESS_SZ]
NUM_DATA_CHUNKS[8] ADDRESS[ADDRESS_SZ]
NUM_DATA_CHUNKS[8] ADDRESS[ADDRESS_SZ]
...
These addresses must be recursively followed to read our data chunks, these addresses correspond
to data chunks when the tree height is 0. The chunk counts can be used to efficiently seek to address offsets
in the tree.
Coming soon...
SEE ALSO
bupstash