Full Mattermost server source with integrated Community Enterprise features. Includes vendor directory for offline/air-gapped builds. Structure: - enterprise-impl/: Enterprise feature implementations - enterprise-community/: Init files that register implementations - enterprise/: Bridge imports (community_imports.go) - vendor/: All dependencies for offline builds Build (online): go build ./cmd/mattermost Build (offline/air-gapped): go build -mod=vendor ./cmd/mattermost 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
642 lines
25 KiB
Markdown
642 lines
25 KiB
Markdown
# MINLZ FORMAT SPECIFICATION V1.0.0
|
|
|
|
All implementations are requested to state: "This implements the MinLZ specification v1.0"
|
|
|
|
Furthermore, if a subset of features is supported, it should state this clearly.
|
|
|
|
A reference decoder is provided. If there is any ambiguity in the specification,
|
|
the behavior of the reference decoder should be followed.
|
|
|
|
The spec versioning follows [semantic versioning](https://semver.org/).
|
|
|
|
* Major version numbers indicate breaking changes.
|
|
* Minor version numbers indicate added functionality that will not be readable by previous versions.
|
|
* Patch version numbers indicate non-breaking additions to the spec.
|
|
|
|
# BLOCK FORMAT
|
|
|
|
MinLZ is a LZ77-style compressor with a fixed size, byte-oriented encoding.
|
|
|
|
All values are encoded with full-byte offsets as an interleaved stream,
|
|
where operations and literals are intermixed, similar to LZ4 and Snappy.
|
|
|
|
This specification defines the decoding format.
|
|
The encoding of specific content is not defined and may change for any implementation.
|
|
|
|
The basic structure is similar to Snappy, but most encodings have been adjusted.
|
|
|
|
## 1.0 MinLZ Indicator
|
|
|
|
An initial byte of value 0 indicates that this is MinLZ encoded block.
|
|
|
|
A 0-byte block is allowed, and is encoded as a single byte with a 0 value.
|
|
|
|
If the first byte is not 0, this can be used to handle seamless
|
|
fallback to Snappy.
|
|
Decoders *may* implement fallback to Snappy and provide seamless decoding
|
|
of the block if this value is non-zero.
|
|
|
|
## 1.1 Block Size
|
|
|
|
A block starts with the uncompressed length up to a maximum of 2^24,
|
|
stored as an unsigned varint.
|
|
|
|
Maximum uncompressed block size is 8 MiB = 8,388,608 bytes.
|
|
|
|
If this value is 0, the rest of the block should be emitted as literals.
|
|
|
|
A compressed block may not be bigger than the decompressed block after reading the header.
|
|
|
|
## 2 MinLZ Encoding
|
|
|
|
Each block is encoded as a sequence of literals and copy/repeat commands.
|
|
|
|
Each element starts with a tag byte, and the lower two bits of this tag
|
|
byte signal what type of element will follow:
|
|
|
|
| Tag | Meaning |
|
|
|-----|--------------------------------------|
|
|
| 0 | Literal(s) or Repeat previous offset |
|
|
| 1 | Copy with 10-bit offset |
|
|
| 2 | Copy with 16-bit offset |
|
|
| 3 | Copy with up to 21-bit offset |
|
|
|
|
The interpretation of the upper six bits is tag-dependent.
|
|
|
|
Values spanning multiple bytes are stored in little-endian order.
|
|
|
|
### 2.1 Literals/Repeat (tag 00)
|
|
|
|
The literal length is stored differently depending on the length
|
|
of the literal block.
|
|
|
|
Optionally, instead of literals, a repeat copy can be specified,
|
|
copying from the previous offset, or 1 if at the beginning of block
|
|
|
|
The value of the tag byte should be interpreted as follows:
|
|
|
|
| Bits | Meaning | Description |
|
|
|------|---------|------------------------------------|
|
|
| 0-1 | Tag | Always 0 |
|
|
| 2 | Repeat | When set do repeat copy |
|
|
| 3-7 | Length | Length of literal or repeat block. |
|
|
|
|
Length follows the tag, and is encoded like this:
|
|
|
|
| Value | Literals |
|
|
|-------|-------------------|
|
|
| 0-28 | 1 + Value |
|
|
| 29 | 30 + Read 1 byte |
|
|
| 30 | 30 + Read 2 bytes |
|
|
| 31 | 30 + Read 3 bytes |
|
|
|
|
Repeats are handled as copies, but with no offset specified.
|
|
See below how offsets are handled.
|
|
|
|
Literals follow the tag or extended length field.
|
|
|
|
### 2.2 Copies
|
|
|
|
Copies are references back into previous decompressed data, telling
|
|
the decompressor to reuse data it has previously decoded.
|
|
They encode two values: The _offset_, saying how many bytes back
|
|
from the current position to read, and the _length_, how many bytes
|
|
to copy.
|
|
|
|
Backreferences that go past the end of the block
|
|
(offset > current decompressed position) are not allowed.
|
|
|
|
As in most LZ77-based compressors, the length can be larger than the offset,
|
|
yielding a form of run-length encoding (RLE). For instance,
|
|
"xababab" could be encoded as
|
|
|
|
`<literal: "xab"> <copy: offset=2 length=4>`
|
|
|
|
There are several different kinds of copy elements, depending on
|
|
the number of bytes to be copied (length), and how far back the
|
|
data to be copied is (offset).
|
|
|
|
Furthermore, a "repeat" can be emitted, which will use the last offset for a copy,
|
|
but with a new length. Initial repeat offset is 1. So an initial RLE can be encoded as
|
|
|
|
`<start of block> <literal: "x"> <repeat: length=4>`
|
|
|
|
This will emit "xxxxx"
|
|
|
|
Matches shorter than 4 bytes cannot be represented, except for repeats.
|
|
Longer offset representations have minimum offsets, mainly to help decompression speed.
|
|
|
|
#### 2.2.1 Fused Literals+Copy
|
|
|
|
Copies with 2 or 3 byte offset can contain up to 4 literals in their encoding.
|
|
The literals must be emitted before the copy.
|
|
|
|
The offset is the offset of the destination *after* the copy,
|
|
so the copy offset is the same as if the literals were emitted separately.
|
|
|
|
See section 2.5.1 and 2.5.2 for details on fused copy operations.
|
|
|
|
Encoding should prefer fused literal+copy when tied for size, since the fused operation
|
|
will typically decode faster than separate operations.
|
|
|
|
### 2.3 Copy1 with 1-byte offset (tag 01)
|
|
|
|
These elements can encode lengths between [4..273] bytes and offsets
|
|
between [1..1024] bytes.
|
|
|
|
| Bits | Meaning | Description |
|
|
|------|-----------|--------------------------------------------------------------|
|
|
| 0-1 | Tag | Always 1 |
|
|
| 2-5 | Length | Length of copy<br/>Values are 0-15. See decoding table below |
|
|
| 6-7 | Offset LB | Lower 2 bits of offset |
|
|
|
|
| Bits | Meaning | Description |
|
|
|------|-----------|-------------------|
|
|
| 0-7 | Offset UB | Offset Upper bits |
|
|
|
|
Offset is 1 more than stored value. The minimum offset is 1, and the maximum is 1024.
|
|
|
|
| Value | Output Length |
|
|
|-------|------------------|
|
|
| 0-14 | 4 + Value |
|
|
| 15 | 18 + Read 1 byte |
|
|
|
|
Maximum length is therefore 18 + 255 = 273
|
|
|
|
The extra length byte is stored *after* the offset byte if present.
|
|
|
|
Minimum encoded length 2 bytes, max 3 bytes with length 19 -> 273.
|
|
Longer matches should emit a repeat with the extra bytes, or use Copy2.
|
|
|
|
### 2.4 Copy2 with 2-byte offset (tag 10)
|
|
|
|
These elements can encode lengths between [4...] bytes and offsets
|
|
between [64...65599] bytes.
|
|
|
|
| Bits | Meaning | Description |
|
|
|------|---------|---------------------------------------------------------------|
|
|
| 0-1 | Tag | Always 2 |
|
|
| 2-7 | Length | Length of copy.<br/>Values are 0-63. See decoding table below |
|
|
|
|
Offsets are encoded as 2 little-endian bytes following the tag.
|
|
The minimum offset is 64 which should be added to the stored value.
|
|
The maximum backreference offset is therefore 65,599.
|
|
|
|
| Bits | Meaning | Description |
|
|
|-------|-------------|---------------------------|
|
|
| 0-15 | Offset | Offset + 64 `[64->65599]` |
|
|
|
|
Lengths are encoded as follows:
|
|
|
|
| Value | Output |
|
|
|-------|-------------------|
|
|
| 0-60 | 4 + Value |
|
|
| 61 | 64 + Read 1 byte |
|
|
| 62 | 64 + Read 2 bytes |
|
|
| 63 | 64 + Read 3 bytes |
|
|
|
|
Minimum encoded length 3 bytes, max 6 bytes.
|
|
|
|
When both Copy1 and Copy2 have similar encoded length (mainly length 19->64),
|
|
prefer copy2 as decoding will be faster.
|
|
|
|
### 2.5 Fused Copy2/3 (tag 11)
|
|
|
|
Tag can contain either a copy with 2 or 3 byte offsets and fused literals.
|
|
|
|
The third bit indicates if this is a copy2 or copy3.
|
|
|
|
| Bits | Meaning | Description |
|
|
|------|----------------|---------------------------------------------|
|
|
| 0-1 | Tag | Always 3 |
|
|
| 2 | Copy3 | 1 when Copy3, 0 when Fused Copy2. |
|
|
| 3-4 | Literal Length | Number of literals to emit before the copy. |
|
|
|
|
The literal length is shared as bit 3-4, but Fused Copy2 has a minimum of 1.
|
|
|
|
#### 2.5.1 Fused Copy2
|
|
|
|
Fused Copy2 offers a short 4->11 byte copy with 16 bit offset, preceded by 1-4 literals.
|
|
|
|
| Bits | Meaning |
|
|
|------|----------------------------------|
|
|
| 0-1 | Tag. Always 3 |
|
|
| 2 | Copy3. Always 0 |
|
|
| 3-4 | Literal length + 1 `[1->4]` |
|
|
| 5-7 | Copy Length + 4 `[4->11]` |
|
|
| 8-23 | 16 bit offset + 64 `[64->65599]` |
|
|
| 24-> | 1-4 Literals |
|
|
|
|
For literal + copy2, a 2-byte offset will follow the tag, then the immediate(s) will follow.
|
|
|
|
A repeat operation can be used to extend the copy.
|
|
|
|
Encoded fused copy2 is 3 bytes, with 1-4 additional literals.
|
|
|
|
### 2.5.2 Copy3 (Optionally Fused)
|
|
|
|
Offsets are encoded as 21 bits, with 0 -> 3 fused literals.
|
|
The minimum offset is 65,536 and must be added the stored offset.
|
|
The maximum backreference offset is therefore 2,162,687 (1<<21 + 65535).
|
|
|
|
| Bits | Meaning |
|
|
|-------|------------------------------------|
|
|
| 0-1 | Tag. Always 3 |
|
|
| 2 | Copy3. Always 1. |
|
|
| 3-4 | Literal length `[0->3]` |
|
|
| 5-10 | Copy Length. 6 bits. (See table) |
|
|
| 11-31 | 21 bit offset + 65536 `[64K->2MB]` |
|
|
| 32->x | 0-3 Extended Length Bytes |
|
|
| x-> | 0-3 Literals |
|
|
|
|
Length is encoded as follows:
|
|
|
|
| Value | Output |
|
|
|-------|-------------------|
|
|
| 0-60 | 4 + Value |
|
|
| 61 | 64 + Read 1 byte |
|
|
| 62 | 64 + Read 2 bytes |
|
|
| 63 | 64 + Read 3 bytes |
|
|
|
|
Minimum encoded Copy3 length is 4 bytes.
|
|
Max length is 7 bytes, plus up to 3 literals.
|
|
|
|
Note that the shortest length Copy3 does not gain anything,
|
|
except possibly avoid a literal tag, if there are fused literals,
|
|
or a repeat can be set up for copy.
|
|
|
|
## 3 DICTIONARY FORMAT
|
|
|
|
TBD.
|
|
|
|
# STREAM FORMAT
|
|
|
|
Follows [Snappy Framing Format](https://github.com/google/snappy/blob/main/framing_format.txt) -
|
|
but with modifications to allow to be easily backwards compatible with Snappy/S2.
|
|
|
|
### 1. General structure
|
|
|
|
The file consists solely of chunks, lying back-to-back with no padding
|
|
in between. Each chunk consists first a single byte of chunk identifier,
|
|
then a three-byte little-endian length of the chunk in bytes (from 0 to
|
|
16,777,215 inclusive), and then the data if any. The four bytes of chunk
|
|
header is not counted in the data length.
|
|
|
|
The different chunk types are listed below. The first chunk must always
|
|
be the stream identifier chunk (see section 4.1, below).
|
|
|
|
### 2. File type identification
|
|
|
|
The following identifiers for this format are recommended where appropriate.
|
|
However, note that none have been registered officially, so this is only to
|
|
be taken as a guideline.
|
|
|
|
File extension: .mz
|
|
MIME type: application/x-minlz-compressed
|
|
HTTP Content-Encoding: x-minlz-compressed
|
|
|
|
Individual blocks contain no corruption detection, so these should not be exchanged.
|
|
However, if software produces them they should use the `.mzb` extension.
|
|
|
|
### 3. Checksum format
|
|
|
|
Some chunks have data protected by a checksum (the ones that do will say so
|
|
explicitly). The checksums are always masked CRC-32Cs.
|
|
|
|
A description of CRC-32C can be found in [RFC 3720](https://datatracker.ietf.org/doc/html/rfc3720),
|
|
section 12.1, with examples in section B.4.
|
|
|
|
Checksums are not stored directly, but masked, as checksumming data and
|
|
then its own checksum can be problematic. The masking is the same as used
|
|
in Apache Hadoop: Rotate the checksum by 15 bits, then add the constant
|
|
0xa282ead8 (using wraparound as normal for unsigned integers). This is
|
|
equivalent to the following C code:
|
|
|
|
```
|
|
uint32_t mask_checksum(uint32_t x) {
|
|
return ((x >> 15) | (x << 17)) + 0xa282ead8;
|
|
}
|
|
```
|
|
|
|
Note that the masking is reversible.
|
|
|
|
The checksum is always stored as a four-byte long integer, in little-endian.
|
|
|
|
### 4. Chunk types
|
|
|
|
The currently supported chunk types are described below. The list may
|
|
be extended in the future.
|
|
|
|
| ID | Description | See Section |
|
|
|---------|-------------------------------|-------------|
|
|
| 0 | (legacy compressed Data) | 4.3 |
|
|
| 1 | Uncompressed Data | 4.3 |
|
|
| 2, 3 | MinLZ Compressed Block | 4.4, 4.5 |
|
|
| 32 | EOF | 4.6 |
|
|
| 4-63 | (reserved, non-skippable) | 4.8 |
|
|
| 64 | Stream Index | 4.12 |
|
|
| 65-127 | (reserved, skippable) | 4.9 |
|
|
| 128-191 | (user defined, skippable) | 4.10 |
|
|
| 192-253 | (user defined, non-skippable) | 4.11 |
|
|
| 254 | Padding | 4.7 |
|
|
| 255 | Stream identifier | 4.1 |
|
|
|
|
### 4.1. Stream identifier (chunk type 0xff)
|
|
|
|
The stream identifier is always the first element in the stream.
|
|
It is exactly six bytes long and starts with "MinLz" in ASCII.
|
|
This means that a valid MinLZ framed stream always starts with the bytes:
|
|
|
|
(type) (length) M i n L z
|
|
0xff 0x06 0x00 0x00 0x4d 0x69 0x6e 0x4c 0x7a <stream info>
|
|
|
|
The final byte of the identifier is a block size indicator.
|
|
|
|
| Bits | Description |
|
|
|------|---------------------------|
|
|
| 0-3 | Max block size indicator |
|
|
| 4-5 | Reserved, must be ignored |
|
|
| 6-7 | Reserved, must be 0 |
|
|
|
|
#### 4.1.1 Max Block Size
|
|
|
|
The value is the log2 of the maximum block size minus 10.
|
|
|
|
So if the value is `0x4`, the maximum block size is 2^(4+10) (16KiB).
|
|
The maximum block size is 2^23 (8MiB), so the maximum identifier is 13,
|
|
which can also be used if for some reason the maximum block size is not known.
|
|
This size only applies to content frames.
|
|
|
|
Decoders may choose not to decode streams based on the maximum block size.
|
|
Decoders *must* reject any value > 13.
|
|
|
|
To allow concatenation, a stream identifier can follow an EOF chunk.
|
|
|
|
### 4.2. Uncompressed data (chunk type 0x01)
|
|
|
|
Uncompressed data chunks allow a compressor to send uncompressed,
|
|
raw data; this is useful if, for instance, incompressible or
|
|
near-incompressible data is detected, and faster decompression is desired.
|
|
|
|
As in the compressed chunks, the data is preceded by its own masked
|
|
CRC-32C (see section 3).
|
|
|
|
An uncompressed data chunk minus the CRC should not exceed the maximum block size
|
|
as indicated by the Stream identifier.
|
|
|
|
### 4.3. Legacy Compressed data (chunk type 0x00) — BACKCOMPAT ONLY
|
|
|
|
Type 0x00 compressed chunks are not allowed in MinLZ streams.
|
|
|
|
Instead, use type 0x02, which indicates a MinLZ compressed block.
|
|
|
|
### 4.4. MinLZ Compressed data (chunk type 0x02)
|
|
|
|
A MinLZ block *without* the MinLZ identifier (initial 0 byte).
|
|
|
|
A CRC32C (see section 3) checksum of the *uncompressed*
|
|
data is stored at the beginning of the block.
|
|
|
|
Chunks with 0 decompressed bytes are not allowed,
|
|
as well as blocks with decompressed size less than compressed size.
|
|
|
|
### 4.5. MinLZ Compressed data - Compressed CRC (chunk type 0x03).
|
|
|
|
A MinLZ block *without* the MinLZ identifier (initial 0 byte).
|
|
|
|
A CRC32C (see section 3) checksum of the *compressed*
|
|
data is stored at the beginning of the block.
|
|
|
|
Chunks with 0 decompressed bytes are not allowed,
|
|
as well as blocks with decompressed size less than compressed size.
|
|
|
|
If possible prefer type `0x02` over this.
|
|
|
|
### 4.6 EOF (chunk type 0x20)
|
|
|
|
The end of the stream is indicated by a chunk with the ID `0x20`.
|
|
This allows detection of truncated streams.
|
|
|
|
The output size of the stream is encoded as an unsigned varint as the only content.
|
|
It is allowed to add an empty block to not validate size.
|
|
Maximum chunk size is 10 bytes (64 bit varint encoded).
|
|
|
|
Encoders should always emit this chunk.
|
|
Decoders can optionally reject streams that do not have this chunk.
|
|
|
|
If this is the first chunk in the stream, the stream is empty.
|
|
|
|
If there are multiple Stream identifier (chunk type 0xff) chunks in the stream,
|
|
each must have a corresponding EOF chunk and the count resets to 0 at each Stream identifier.
|
|
|
|
### 4.7. Padding (chunk type 0xfe)
|
|
|
|
Padding chunks allow a compressor to increase the size of the data stream
|
|
so that it complies with external demands, e.g. that the total number of
|
|
bytes is a multiple of some value.
|
|
|
|
All bytes of the padding chunk, except the chunk byte itself and the length,
|
|
should be zero, but decompressors must not try to interpret or verify the
|
|
padding data in any way.
|
|
|
|
### 4.8. Reserved unskippable chunks (chunk types 0x04-0x3f)
|
|
|
|
These are reserved for future expansion. A decoder that sees such a chunk
|
|
should immediately return an error, as it must assume it cannot decode the
|
|
stream correctly.
|
|
|
|
Future versions of this specification may define meanings for these chunks.
|
|
|
|
### 4.9. Reserved skippable chunks (chunk types 0x40-0x7f)
|
|
|
|
These are also reserved for future expansion, but unlike the chunks
|
|
described in 4.5, a decoder seeing these must skip them and continue
|
|
decoding.
|
|
|
|
Future versions of this specification may define meanings for these chunks.
|
|
|
|
### 4.10. User defined skippable chunks (chunk types 0x80-0xbf)
|
|
|
|
These are allowed for user-defined data. A decoder that does not recognize
|
|
the chunk type should skip it and continue decoding.
|
|
|
|
Users should use additional checks to ensure that the chunk is of the expected type.
|
|
|
|
Future versions of this specification will not define meanings for these chunks.
|
|
|
|
### 4.11. User defined non-skippable chunks (chunk types 0xc0-0xfd)
|
|
|
|
These are allowed for user-defined data. If users do not recognize these chunks,
|
|
decoding should stop.
|
|
|
|
Users should use additional checks to ensure that the chunk is of the expected type.
|
|
|
|
Future versions of this specification will not define meanings for these chunks.
|
|
|
|
|
|
### 4.12 Index (chunk type 0x40) — OPTIONAL
|
|
|
|
Each block is structured as a skippable block, with the chunk ID 0x40.
|
|
|
|
Decoders are free to skip this block.
|
|
|
|
The block can be read from the front, but contains information
|
|
so it can easily be read from the back as well.
|
|
|
|
Numbers are stored as fixed size little endian values or [zigzag encoded](https://developers.google.com/protocol-buffers/docs/encoding#signed_integers) [base 128 varints](https://developers.google.com/protocol-buffers/docs/encoding),
|
|
with un-encoded value length of 64 bits, unless other limits are specified.
|
|
|
|
| Content | Format |
|
|
|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
|
|
| ID, `[1]byte` | Always 0x40. |
|
|
| Data Length, `[3]byte` | 3 byte little-endian length of the chunk in bytes, following this. |
|
|
| Header `[6]byte` | Header, must be `[115, 50, 105, 100, 120, 0]` or in text: "s2idx\x00". |
|
|
| UncompressedSize, Varint | Total Uncompressed size. |
|
|
| CompressedSize, Varint | Total Compressed size if known. Should be -1 if unknown. |
|
|
| EstBlockSize, Varint | Block Size, used for guessing uncompressed offsets. Must be >= 0. |
|
|
| Entries, Varint | Number of Entries in index, must be < 65536 and >=0. |
|
|
| HasUncompressedOffsets `byte` | 0 if no uncompressed offsets are present, 1 if present. Other values are invalid. |
|
|
| UncompressedOffsets, [Entries]VarInt | Uncompressed offsets. See below how to decode. |
|
|
| CompressedOffsets, [Entries]VarInt | Compressed offsets. See below how to decode. |
|
|
| Block Size, `[4]byte` | Little Endian total encoded size (including header and trailer). Can be used for searching backwards to start of block. |
|
|
| Trailer `[6]byte` | Trailer, must be `[0, 120, 100, 105, 50, 115]` or in text: "\x00xdi2s". Can be used for identifying block from end of stream. |
|
|
|
|
For regular streams the uncompressed offsets are fully predictable,
|
|
so `HasUncompressedOffsets` allows to specify that compressed blocks all have
|
|
exactly `EstBlockSize` bytes of uncompressed content.
|
|
|
|
Entries *must* be in order, starting with the lowest offset,
|
|
and there *must* be no uncompressed offset duplicates.
|
|
Entries *may* point to the start of a skippable block,
|
|
but it is then not allowed to also have an entry for the next block since
|
|
that would give an uncompressed offset duplicate.
|
|
|
|
There is no requirement for all blocks to be represented in the index.
|
|
In fact there is a maximum of 65535 block entries in an index.
|
|
|
|
The writer can use any method to reduce the number of entries.
|
|
An implicit block start at 0,0 can be assumed.
|
|
|
|
### Decoding entries:
|
|
|
|
```
|
|
// Read Uncompressed entries.
|
|
// Each assumes EstBlockSize delta from previous.
|
|
for each entry {
|
|
uOff = 0
|
|
if HasUncompressedOffsets == 1 {
|
|
uOff = ReadVarInt // Read value from stream
|
|
}
|
|
|
|
// Except for the first entry, use previous values.
|
|
if entryNum == 0 {
|
|
entry[entryNum].UncompressedOffset = uOff
|
|
continue
|
|
}
|
|
|
|
// Uncompressed uses previous offset and adds EstBlockSize
|
|
entry[entryNum].UncompressedOffset = entry[entryNum-1].UncompressedOffset + EstBlockSize + uOff
|
|
entryNum++
|
|
}
|
|
|
|
|
|
// Guess that the first block will be 50% of uncompressed size.
|
|
// Integer truncating division must be used.
|
|
CompressGuess := EstBlockSize / 2
|
|
|
|
// Read Compressed entries.
|
|
// Each assumes CompressGuess delta from previous.
|
|
// CompressGuess is adjusted for each value.
|
|
for each entry {
|
|
cOff = ReadVarInt // Read value from stream
|
|
|
|
// Except for the first entry, use previous values.
|
|
if entryNum == 0 {
|
|
entry[entryNum].CompressedOffset = cOff
|
|
continue
|
|
}
|
|
|
|
// Compressed uses previous and our estimate.
|
|
entry[entryNum].CompressedOffset = entry[entryNum-1].CompressedOffset + CompressGuess + cOff
|
|
|
|
// Adjust compressed offset for next loop, integer truncating division must be used.
|
|
CompressGuess += cOff/2
|
|
|
|
entryNum++
|
|
}
|
|
```
|
|
|
|
To decode from any given uncompressed offset `(wantOffset)`:
|
|
|
|
* Iterate entries until `entry[n].UncompressedOffset > wantOffset`.
|
|
* Start decoding from `entry[n-1].CompressedOffset`.
|
|
* Discard `entry[n-1].UncompressedOffset - wantOffset` bytes from the decoded stream.
|
|
|
|
This is similar to S2, except the ID is 0x40 instead of 0x99.
|
|
|
|
# Implementation Notes
|
|
|
|
This section contains guidelines for implementation and use.
|
|
None of these are strict requirements.
|
|
|
|
MinLZ is designed for a certain speed/size tradeoff.
|
|
|
|
It is designed to be used in scenarios where encoding and decoding speed is critical.
|
|
|
|
Unless decompression speed is critical, it is not designed for long-term storage,
|
|
and formats like [zstandard](https://facebook.github.io/zstd/) should be considered for better compression.
|
|
Formats like xz/bzip2 also offer excellent compression, but these offer even more reduced decompression speed.
|
|
|
|
## No entropy nor dynamic encoding
|
|
|
|
MinLZ by design only offers static encoding types and no entropy coding of remainder literals,
|
|
making decoding possible with no tables.
|
|
|
|
While it was considered, all conditional decoding (decoding based on previous operation or output position)
|
|
was avoided to simplify the decoder.
|
|
|
|
## Independent block streams
|
|
|
|
A primary design choice is to make blocks on streams fully independent to facilitate
|
|
independent compression and decompression.
|
|
|
|
This will make streams seekable, and even without an index, streams can be
|
|
skipped forwards without decompression, and blocks can be decoded concurrently.
|
|
|
|
The maximum block size of 8MB is designed to minimize the size impact of this.
|
|
|
|
## Speed Optimizations
|
|
|
|
This section describes tricks that can help achieve maximum speed.
|
|
|
|
Decompression has been designed to make use of modern CPU branch prediction.
|
|
|
|
### Margin-specific code
|
|
|
|
For decompression it can be beneficial to have 2 parts of the decompressors:
|
|
A primary decoding loop, that runs while there is some input and output margin and another
|
|
that deals when at the end of the input/output, which has stricter checks.
|
|
|
|
### No-overlap Encodings
|
|
|
|
Minimum offsets to certain encodings mainly exist to avoid overlapping copies and make a 64 byte/loop
|
|
copy safe to do.
|
|
|
|
Copy2/Copy3 operations guarantee there are no copies with an offset less than 64 bytes.
|
|
This means that these can use a bigger copy loop without the need to worry about these.
|
|
|
|
This effectively moves this branch from the encoder to the decoder.
|
|
|
|
### Safe Fused Literals
|
|
|
|
Fused literals can always be safely copied with a 4 byte copy,
|
|
since there will always "follow" a match with at least 4 bytes that will "fix up" any extra literals.
|
|
|
|
Therefore, fused copies are typically faster than a separate literal + copy operation.
|
|
|
|
### Encoding Tips
|
|
|
|
* Prefer fused literals, even when encoded size is equal.
|
|
* Avoid 1 and 2 byte repeats. The encoding exists mostly for flexibility.
|
|
* Prefer Copy2 over Copy1 when encoded size is the same.
|