kusano 7d535a
LZ4 Block Format Description
kusano 7d535a
============================
kusano 7d535a
Last revised: 2015-05-07.
kusano 7d535a
Author : Yann Collet
kusano 7d535a
kusano 7d535a
kusano 7d535a
This specification is intended for developers
kusano 7d535a
willing to produce LZ4-compatible compressed data blocks
kusano 7d535a
using any programming language.
kusano 7d535a
kusano 7d535a
LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
kusano 7d535a
There is no entropy encoder back-end nor framing layer.
kusano 7d535a
The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
kusano 7d535a
This design is assumed to favor simplicity and speed.
kusano 7d535a
It helps later on for optimizations, compactness, and features.
kusano 7d535a
kusano 7d535a
This document describes only the block format,
kusano 7d535a
not how the compressor nor decompressor actually work.
kusano 7d535a
The correctness of the decompressor should not depend
kusano 7d535a
on implementation details of the compressor, and vice versa.
kusano 7d535a
kusano 7d535a
[LZ4 Frame format]: LZ4_Frame_format.md
kusano 7d535a
kusano 7d535a
kusano 7d535a
kusano 7d535a
Compressed block format
kusano 7d535a
-----------------------
kusano 7d535a
An LZ4 compressed block is composed of sequences.
kusano 7d535a
A sequence is a suite of literals (not-compressed bytes),
kusano 7d535a
followed by a match copy.
kusano 7d535a
kusano 7d535a
Each sequence starts with a token.
kusano 7d535a
The token is a one byte value, separated into two 4-bits fields.
kusano 7d535a
Therefore each field ranges from 0 to 15.
kusano 7d535a
kusano 7d535a
kusano 7d535a
The first field uses the 4 high-bits of the token.
kusano 7d535a
It provides the length of literals to follow.
kusano 7d535a
kusano 7d535a
If the field value is 0, then there is no literal.
kusano 7d535a
If it is 15, then we need to add some more bytes to indicate the full length.
kusano 7d535a
Each additional byte then represent a value from 0 to 255,
kusano 7d535a
which is added to the previous value to produce a total length.
kusano 7d535a
When the byte value is 255, another byte is output.
kusano 7d535a
There can be any number of bytes following the token. There is no "size limit".
kusano 7d535a
(Side note : this is why a not-compressible input block is expanded by 0.4%).
kusano 7d535a
kusano 7d535a
Example 1 : A length of 48 will be represented as :
kusano 7d535a
kusano 7d535a
  - 15 : value for the 4-bits High field
kusano 7d535a
  - 33 : (=48-15) remaining length to reach 48
kusano 7d535a
kusano 7d535a
Example 2 : A length of 280 will be represented as :
kusano 7d535a
kusano 7d535a
  - 15  : value for the 4-bits High field
kusano 7d535a
  - 255 : following byte is maxed, since 280-15 >= 255
kusano 7d535a
  - 10  : (=280 - 15 - 255) ) remaining length to reach 280
kusano 7d535a
kusano 7d535a
Example 3 : A length of 15 will be represented as :
kusano 7d535a
kusano 7d535a
  - 15 : value for the 4-bits High field
kusano 7d535a
  - 0  : (=15-15) yes, the zero must be output
kusano 7d535a
kusano 7d535a
Following the token and optional length bytes, are the literals themselves.
kusano 7d535a
They are exactly as numerous as previously decoded (length of literals).
kusano 7d535a
It's possible that there are zero literal.
kusano 7d535a
kusano 7d535a
kusano 7d535a
Following the literals is the match copy operation.
kusano 7d535a
kusano 7d535a
It starts by the offset.
kusano 7d535a
This is a 2 bytes value, in little endian format
kusano 7d535a
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
kusano 7d535a
kusano 7d535a
The offset represents the position of the match to be copied from.
kusano 7d535a
1 means "current position - 1 byte".
kusano 7d535a
The maximum offset value is 65535, 65536 cannot be coded.
kusano 7d535a
Note that 0 is an invalid value, not used. 
kusano 7d535a
kusano 7d535a
Then we need to extract the match length.
kusano 7d535a
For this, we use the second token field, the low 4-bits.
kusano 7d535a
Value, obviously, ranges from 0 to 15.
kusano 7d535a
However here, 0 means that the copy operation will be minimal.
kusano 7d535a
The minimum length of a match, called minmatch, is 4. 
kusano 7d535a
As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
kusano 7d535a
Similar to literal length, on reaching the highest possible value (15), 
kusano 7d535a
we output additional bytes, one at a time, with values ranging from 0 to 255.
kusano 7d535a
They are added to total to provide the final match length.
kusano 7d535a
A 255 value means there is another byte to read and add.
kusano 7d535a
There is no limit to the number of optional bytes that can be output this way.
kusano 7d535a
(This points towards a maximum achievable compression ratio of about 250).
kusano 7d535a
kusano 7d535a
With the offset and the matchlength,
kusano 7d535a
the decoder can now proceed to copy the data from the already decoded buffer.
kusano 7d535a
On decoding the matchlength, we reach the end of the compressed sequence,
kusano 7d535a
and therefore start another one.
kusano 7d535a
kusano 7d535a
kusano 7d535a
Parsing restrictions
kusano 7d535a
-----------------------
kusano 7d535a
There are specific parsing rules to respect in order to remain compatible
kusano 7d535a
with assumptions made by the decoder :
kusano 7d535a
kusano 7d535a
1. The last 5 bytes are always literals
kusano 7d535a
2. The last match must start at least 12 bytes before end of block.   
kusano 7d535a
   Consequently, a block with less than 13 bytes cannot be compressed.
kusano 7d535a
kusano 7d535a
These rules are in place to ensure that the decoder
kusano 7d535a
will never read beyond the input buffer, nor write beyond the output buffer.
kusano 7d535a
kusano 7d535a
Note that the last sequence is also incomplete,
kusano 7d535a
and stops right after literals.
kusano 7d535a
kusano 7d535a
kusano 7d535a
Additional notes
kusano 7d535a
-----------------------
kusano 7d535a
There is no assumption nor limits to the way the compressor
kusano 7d535a
searches and selects matches within the source data block.
kusano 7d535a
It could be a fast scan, a multi-probe, a full search using BST,
kusano 7d535a
standard hash chains or MMC, well whatever.
kusano 7d535a
kusano 7d535a
Advanced parsing strategies can also be implemented, such as lazy match,
kusano 7d535a
or full optimal parsing.
kusano 7d535a
kusano 7d535a
All these trade-off offer distinctive speed/memory/compression advantages.
kusano 7d535a
Whatever the method used by the compressor, its result will be decodable
kusano 7d535a
by any LZ4 decoder if it follows the format specification described above.