Background compaction

A tablet of a sorted dynamic table is an LSM tree. When a row is written, it enters the dynamic store. As the dynamic store gets filled with data, it is flushed to the chunks on the disk. A background compaction process periodically combines several chunks. It is needed to:

  • Apply delete tombstones and physically delete rows.
  • Delete old versions.
  • Reduce overlapping store count, the number of chunks in which a fixed key may need to be searched.

Compaction relies mainly on heuristics. However, depending on the table write scenario, the optimal approaches may differ. For example:

  • If a small max_data_ttl is set on the table, it makes sense to periodically compact the old chunks and free up space.
  • If rows are regularly deleted, we recommend examining the chunk that contains insertions and deletions during compaction in order to physically delete the rows.
  • If the table write stream is large, you can sacrifice read optimality and loosen the overlapping store count settings in favor of reducing write amplification — the ratio of the data amount processed by compaction to the amount of written data, and vice versa.

Glossary

  • Dynamic store: A structure for storing freshly written rows; it is located in RAM. Analog of MemTable.
  • Chunk****: An immutable structure for storing rows flushed to the disk. Analog of SSTable.
  • Store****: A common name for chunks and dynamic stores. Technically speaking, chunk stores (and dynamic stores), not chunks are stored in the tablet, but the chunk and chunk store terms are usually interchangeable.
  • Partition****: A part of the tablet's subdivision. Similarly to tables that are split into tablets bounded by pivot keys, tablets are split into partitions. Chunks within a partition do not cross its boundaries.
  • Eden****: A special partition containing chunks that cannot be placed in any other partition because they cross their boundaries. Chunks typically do not live long in Eden, quickly undergoing partitioning.
  • Overlapping store count, OSC: Maximum overlap of chunks; maximum number of chunks covering a particular key. Limits fan-in from above, i.e. the number of chunks that actually have to be read in order to get the actual value by key.
  • Flush****: The process of flushing data from a dynamic store to a chunk on the disk.
  • Compaction********: The process of merging chunks that involves discarding old versions of data, removing delete tombstones, and combining small chunks.
  • Partitioning****: A process that is similar to compaction, except its purpose is not to combine chunks, but to split chunks located in Eden into different partitions.
  • Background processes: A collective term for flush+compaction+partitioning.
  • Write amplification, WA: The ratio of the amount of data handled by background processes (compaction/partitioning) to the amount of data written to the table. It is an important effectiveness indicator of the selected parameters.
  • Row version: In the MVCC model, a single key can be associated with multiple values, each with its own timestamp. An individual value is called a version. Generally speaking, writes to each column can be versioned independently, so it makes sense to refer to versions of the values, not a row as a whole. In terms of setting up compaction, this is usually irrelevant.

List of attributes

All values with the Duration type are specified in milliseconds. All values indicating the amount of data are specified in bytes. Attributes for which there is no description have a complex semantics and we do not recommend changing them.

Warning

Many attributes influence the behavior of background compaction processes. Careless setting of attributes can create an unexpected load on a bundle or cluster.

Please use only those attributes whose meaning you understand.

Note

Setting an attribute on a mounted table does not apply the settings. To apply settings, use the remount-table command:

CLI

yt set //path/to/table/@auto_compaction_period 86400000
yt remount-table //path/to/table

You can find out the tablet's current settings at //sys/tablets/x-x-x-x/orchid/config.

Flush

The following attributes regulate flush behavior.

Name Type Default Description
dynamic_store_auto_flush_period Duration* 900,000 (15 min) Frequency of forced flushes, when the dynamic store is flushed to the disk straight away, even if it hasn't reached its overflow threshold yet.
dynamic_store_flush_period_splay Duration 60 000 (1 min) Random shift for the period to avoid synchronization of different tablets. Real flush will come after period + random(0, splay).
merge_rows_on_flush bool false Allows version merging and deletion of rows by TTL at flush.
merge_deletions_on_flush bool false Allows consecutive deletions to be merged into one at flush.
max_dynamic_store_row_count int 1 000 000 Maximum number of rows in the dynamic store.
max_dynamic_store_pool_size int 1 073 741 824 (1 GB) Maximum dynamic store size.
dynamic_store_overflow_threshold double 0.7 The share of filling of the dynamic store relative to max_dynamic_store_row_count and max_dynamic_store_pool_size at which flush starts.

Compaction

Basic options

Name Type Default Description
auto_compaction_period int* - Frequency of periodic compaction in ms. Compaction will affect every table chunk at least once per auto_compaction_period.
auto_compaction_period_splay_ratio double 0.3 Random shift for the period to avoid synchronization. Compaction will come after period * (1 + random(0, splay_ratio)).
periodic_compaction_mode store, partition store For more information, see Periodic compaction.
forced_compaction_revision - - For more information, see Forced compaction.
max_overlapping_store_count int 30 Maximum number of chunks that potentially contain a key. When the threshold is reached, writing to the table will be locked until compaction optimizes the structure.
critical_overlapping_store_count int* - Sets a threshold beyond which compaction ignores the size limits for chunks within a single partition. By setting a small value (5–10) for this option, you can reduce the OSC and data access speed at the cost of a noticeable increase in write amplification.
enable_compaction_and_partitioning bool true Completely disables compaction. Writing to the table in this state will quickly lead to exceeding overlapping store count. If no writing is intended, then instead of this attribute,mount the table in a frozen state.

Sizes and constants

Name Type Default Description
min_partition_data_size int 96 MB Minimum, desired, and maximum partition size.
desired_partition_data_size int 256 MB
max_partition_data_size int 320 MB
min_partitioning_data_size int 64 MB Minimum and maximum data size for a single partitioning. Increasing it reduces write amplification at the cost of increasing the number of chunks in Eden and therefore increasing overlapping store count.
max_partitioning_data_size int 1 GB
min_partitioning_store_count int 1 Minimum and maximum number of chunks for a single partitioning.
max_partitioning_store_count int 5
min_compaction_store_count int 3 Minimum and maximum number of chunks for a single compaction. Periodic and forced compactions ignore the lower estimate, but take into account the upper one.
max_compaction_store_count int 5
compaction_data_size_base int 16 MB
compaction_data_size_ratio double 2.0

Deleting old data

For more information about these attributes, see Deleting old data.

Name Type Default
min_data_ttl int 1 800 000 (30 min)
max_data_ttl int 1 800 000 (30 min)
min_data_versions int 1
max_data_versions int 1

Forced compaction

To trigger the forced compaction of all table chunks, set the forced_compaction_revision attribute value to 1.

If you execute remount-table, the setting will be immediately applied to all table tablets. When working with tables of a terabyte or more, this can create a load surge both on the table bundle and on the entire cluster. Therefore, we recommend executing remount of different tablets at different times. To do this, run the following command: yt remount-table --first-tablet-index X --last-tablet-index Y. To estimate the recommended time for remounting the table, use the formula table_size / bundle_node_count / (100 Mb/s).

To abort forced compaction, delete the forced_compaction_revision attribute from the table and execute remount-table.

Note

If the table size is more than 20 TB or 100,000 chunks, get administrator permission before starting forced compaction.

Periodic compaction

To ensure that all table chunks are periodically compacted, regardless of whether the table is being written to or not, use the auto_compaction_period attribute. It has two modes regulated by the periodic_compaction_mode attribute:

  • store (default value): The decision to compact each chunk created earlier than now - auto_compaction_period is made independently.
  • partition: If the partition contains at least one chunk created earlier than now - auto_compaction_period, all partition chunks are compacted at the same time.

To avoid synchronization and even out the load, a random shift is added to auto_compaction_period when calculating the compaction time of each specific chunk. This shift is defined by the auto_compaction_period_splay_ratio attribute.

Setting only the periodic_compaction_mode attribute is not enough to enable periodic compaction — you need to explicitly set the auto_compaction_period attribute.

Selecting a mode

  • You need to delete data by TTL: store mode (default).
  • You need to clear the rows deleted via delete-rows: partition mode.

Partition mode is better suitable for clearing rows deleted via delete-rows, because in case of store, write by a key and the corresponding delete tombstone can get into different chunks which will be independently compacted one after another, and tombstone will never be deleted.

Store mode reviews chunks independently, and when the oldest chunk is reviewed, the obsolete data will be deleted.

Selecting a period

The longer the period is, the less load there is on the bundle. Each bundle node is capable of compacting about 100-200 MB/s maximum. The load from periodic compaction usually does not exceed several units of MB/s. The period is often commensurate with max_data_ttl or is several times less than it.

For example, there is a table of 500 GB and a compaction period of one day and two nodes in the bundle. The load per node would then be 500 GB/86,400 sec/2 ≃ 3.1 MB/s, which is allowable.

Warning

Initially setting auto_compaction_period on the table with a lot of old chunks can cause all chunks to start being compacted at the same time. In this case, follow the same recommendations as for forced compaction.

Scenarios

Deleting data by TTL

Scenario: TTL is set for the table, but the size continues to increase as if no TTL is applied.

Solution: Use periodic compaction.

Deleting short-lived keys

Scenario: The row is written and deleted after a while. It is required that delete tombstones do not pile up and space is cleared.

Solution: Use periodic compaction in partition mode.

Many writes by one key

Scenario: There may be more than a few thousand writes by one key.

Problem: There is a limit on the number of versions of one key in the system. It is about 30,000 versions. If this is exceeded, background processes will end up with the "Too many write timestamps in a versioned row" error.

Solution: Set the True value of the merge_rows_on_flush attribute and reduce TTL via the min_data_ttl attribute so that the number of versions within TTL is not more that a few thousand. If many deletions are made per key, use merge_deletions_on_flush.

Tablets and background processes under the hood

This section details the structure of tablets. It's not mandatory for reading, but it may be helpful to those who want to tweak some of the settings on their own.

Structure

A tablet is a part of the table that is responsible for the data between two pivot keys. Typically, tablet size ranges from 100 MB (for in-memory tables) to a few gigabytes, sometimes reaching tens of gigabytes for especially large tables. From the master server's point of view, a tablet is a set of chunks. All operations described below occur on the tablet node directly serving the tablet.

Similarly to tables that are split into tablets bounded by pivot keys, tablets are split into partitions. The partition size is about 200 MB (compressed size). In addition, there is a special partition, Eden. Its boundaries coincide with those of the tablet. Each chunk belongs either to Eden or to one of the partitions. If a chunk falls entirely between the boundary keys of a particular partition, it belongs to it. Otherwise it belongs to Eden.

Dynamic stores and flushing data to the disk

There is always at least one dynamic store in the table — an active one. When writing, the data first gets into the dynamic store. When the dynamic store becomes too large (hundreds of megabytes) or the node runs out of tablet dynamic memory, the store is rotated. That is, it becomes passive and is replaced by a new active store. This activates the flush process, and the passive store is flushed to the disk over time.

If flush does not work for some reason, the data is accumulated in memory, and after overflow, writing starts failing with the "Node is out of tablet memory, all writes disabled" error.

By default, all row versions are saved to the chunk during flush. To apply cleanup by TTL, use the merge_rows_on_flush attribute. It should be done when the TTL is less than the typical dynamic store lifetime (about 15 minutes), and there are many writes by one key.

Partitioning

The chunk flushed to the disk ends up in Eden. It usually contains keys from the entire tablet range and can't be attributed to a particular partition. The partitioning process takes one or more chunks from Eden, merges them, splits the data into partitions, and places one chunk into each partition. If there are many small chunks in Eden, it will first run compaction and merge them.

Small chunks make the system less efficient. The first reason for this is overhead costs. The second reason is write amplification. Suppose a node serves several tablets of a large table. Since there are many tablets and the memory is shared, the dynamic store size in each tablet will be tens of megabytes, which is rather small. If there are 100 partitions in a tablet, the chunk size in each of them will be less than one megabyte after partitioning. If you increase the minimum allowable Eden size at which partitioning starts, larger chunks will be flushed to partitions at the cost of the increased overlapping store count.

Compaction

The compaction process reads a batch of chunks in a single partition and merges them into one (or several, in rare cases), physically deleting rows and old versions. Compaction can be started for several reasons:

  • forced: The forced_compaction_revision attribute is set on the table. In this case, all tablet chunks will be compacted and the batch size will only be limited by the max_compaction_store_count value.
  • periodic: The auto_compaction_period attribute is set on the table. Chunks created earlier than now - auto_compaction_period will be compacted.
  • regular: Regular mode started without any external influence.

Picking chunks based on size

In regular mode, the system selects chunks taking into account write amplification. If there is a 100 MB chunk in the partition and a 1 MB chunk periodically appears, then if you compact them together each time, you will get x100 amplification. The following rules are used:

  • There must be from min_compaction_store_count to max_compaction_store_count chunks in a batch, but the more the better.
  • Chunks must be sorted by size. Each successive chunk must be by no more than compaction_data_size_ratio times larger than the sum of the sizes of the previous chunks.
  • The previous rule does not apply as long as the total size of the chunks is less than compaction_data_size_base.

For example, ratio = 2, base = 16 MB. Then:

  • A set of 1 KB, 1 MB, 10 MB chunks is allowed: the sum is not exceeding 16 MB.
  • A set of 10 MB, 20 MB, 50 MB, 150 MB chunks is allowed: 50 < 2 × (10 + 20), 150 < 2 × (10 + 20 + 50).
  • A set of 1 MB, 10 MB, 100 MB chunks is not allowed: 100 > 2 × (10 + 1).

Reducing the base and ratio values improves the WA at the cost of increasing the OSC. In the limiting case where base = 1 but no data is deleted, it can be demonstrated that the write amplification is logarithmically dependent on the amount of data. Each time a row participates in compaction, the size of the chunk that contains it increases by at least (1 + 1 / ratio) times. Consequently, this row participates in a total of no more than log(tablet_size, 1 + 1 / ratio) compactions. In practice, this estimation is inaccurate not only because of the deletions, but also because the partition splits into two when the threshold size is reached.

Mechanism for cleaning up old versions

When compaction merges another batch of chunks, it can remove old versions of some rows and apply deletions. However, this may not always be possible, because compaction considers only some of the chunks within a partition, not all of them.

Let's consider a table with min/max_data_ttl = 0, min/max_data_versions = 1 (only the most recent version must be stored by each key). For example, there were two writes by a key: {value = 1; timestamp = 10}, {value = 2; timestamp = 20}, and deletion with timestamp = 30.
Suppose these versions are stored in three different chunks. If compaction considers only the first and third chunks, the following data is processed:

delete tombstone {timestamp = 30}
{value = 1; timestamp = 10}

If you apply deletions and remove the row completely without writing it to a new chunk, it will result in an incorrect read: further reads will read the {value = 2; timestamp = 20} value, although the row was deleted.

To avoid this problem, we calculate the major timestamp, which is the earliest timestamp across all data in all the chunks of a given partition and Eden to not be included in the current compaction batch. Compaction then has the right to delete only those versions where timestamp < major timestamp.

This logic can lead to obsolete versions not actually being deleted. First, if the tablet or partition is not written to, many chunks within the partition stabilize, and compaction no longer initiates since it has no way of detecting if any of the rows are aged. Second, if there is one large chunk in the partition and writing is not very intensive, then the newly appearing small chunks will be compacted with one another without affecting the large one. The large chunk imposes a limit on major timestamp, so even if ttl = 0, repeated versions in fresh chunks will not be deleted. You can deal with this using auto_compaction_period.