Compression
This section describes the compression algorithms supported by YTsaurus.
By default, user data in YTsaurus is stored and transmitted in compressed form.
The system performs compression:
- When writing to a table or file, data is split into parts — chunks.
- Each chunk is compressed using the algorithm specified for that file or table and written to the disk.
- For network transmission, data is read from the disk already in compressed form.
Viewing a compression algorithm
Chunks, files, and tables can be compressed in the YTsaurus system.
The compression algorithm is specified in the compression_codec
attribute:
- For chunks, the
compression_codec
attribute specifies the compression algorithm for all chunk blocks. Different chunks within a file or table can be compressed using different algorithms. - For tables and files,
compression_codec
defines the default compression algorithm. It will be used if no algorithm is specified for individual file or table chunks. The defaultcompression_codec
value: for tables —lz4
, for files —none
(without compression).
The degree of data compression is specified in the compressed_data_size
and uncompressed_data_size
attributes of a chunk, table, or file:
- For chunks, the
uncompressed_data_size
attribute shows the size of the chunk before compression and thecompressed_data_size
attribute shows the size of the chunk after compression. - For tables and files, the
uncompressed_data_size
attribute shows the total uncompressed data size of all object chunks and thecompressed_data_size
shows the compressed data size.
Statistics on the compression algorithms used in the table chunks are contained in the compression_statistics
attribute of the table.
Changing a compression algorithm
To change the compression algorithm of an existing static table, set a new value of the compression_codec
attribute, then run the merge
command to recompress the chunks.
CLI
yt set //path/to/table/@compression_codec zstd_3
yt merge --src //path/to/table --dst //path/to/table --spec '{force_transform = %true}'
To change the compression algorithm of a dynamic table, set a new value for the compression_codec
attribute. Then run the remount-table
command. The old chunks will eventually be recompressed in the process of compaction. The compression rate depends on the table size, the write speed, and the background compaction settings.
CLI
yt set //path/to/table/@compression_codec zstd_3
yt remount-table //path/to/table
If the table is permanently written to, there is usually no need for forced compression. If you need to force compression, use forced compaction
.
Supported compression algorithms
Compression algorithms | Description | Compression/decompression rate | Compression degree |
---|---|---|---|
none |
Without compression. | - | - |
snappy |
Learn more about the snappy algorithm. | +++ | +- |
zlib_[1..9] |
The YTsaurus system supports all 9 levels. The higher the level is, the more substantially and slowly data is compressed. Learn more about the zlib algorithm. | ++ | ++ |
lz4 |
The default compression algorithm for tables. Learn more about the lz4 algorithm. | +++ | + |
lz4_high_compression |
The lz4 algorithm with the enabled high_compression option. Compresses more efficiently, but is significantly slower. Its compression rate is inferior to zlib . |
++ | ++- |
zstd_[1..21] |
Learn more about the zstd algorithm. | ++ | ++ |
brotli_[1..11] |
We recommend using this algorithm for data that is not temporary. We recommend using levels 3, 5, and 8, because they have the best volume-to-speed ratio. Learn more about the brotli algorithm. | ++ | +++ |
lzma_[0..9] |
Learn more about the lzma algorithm. | + | +++ |
bzip2_[1..9] |
Learn more about the bzip2 algorithm. | ++ | ++ |
Best practices
- Compression is usually not applied to files that are used as symlinks to chunks in an operation.
lz4
is often used for actual data. The algorithm provides high compression and decompression rates at an acceptable compression ratio.- When you need maximum compression and long runtime is acceptable,
brotli_8
is often used. - For operations consisting of a small number of jobs, for example,
final sort
orsorted merge
, we recommend adding a separate processing stage — data compression in a merge operation with a large number of jobs.
Attention
While strong compression algorithms (zlib
, zstd
, brotli
) save disk space, they compress data significantly slower than the default algorithm (lz4
). Using algorithms with substantial compression can lead to a significant increase in operation execution time. We recommend using them only for tables that take up a lot of space but rarely change.
Comparing compression algorithms
The following method applies only to static tables.
To determine which algorithm is best for a particular table, run yt run-compression-benchmarks TABLE
. A sample that is 1 GB by default will be taken from the table. It will be compressed by all algorithms.
After the operation is complete, you will see codec/cpu/encode
, codec/cpu/decode
, and compression_ratio
for each algorithm. For more information, see Job stats.
For algorithms with multiple compression levels, the minimum, medium, and maximum levels are used by default.
To obtain results for all levels, use the --all-codecs
option.
CLI
yt run-compression-benchmarks //home/dev/tutorial/compression_benchmark_data
[
{
"codec": "brotli_1",
"codec/cpu/decode": 2103,
"codec/cpu/encode": 2123,
"compression_ratio": 0.10099302059669339
},
...
{
"codec": "brotli_11",
"codec/cpu/decode": "Not launched",
"codec/cpu/encode": "Timed out", # compression did not finish within a standard --time-limit-sec (200)
"compression_ratio": "Timed out"
},
...
{
"codec": "none",
"codec/cpu/decode": 0,
"codec/cpu/encode": 247,
"compression_ratio": 1.0
},
...
{
"codec": "zstd_11",
"codec/cpu/decode": 713,
"codec/cpu/encode": 15283,
"compression_ratio": 0.07451278201257201
},
...
]