Chunks
This section contains information about chunks — parts of a table or file that store data.
General information
Data stored in tables, files, and logs is split into parts that are called chunks. A chunk stores a continuous range of data.
Specifically, each chunk of the table contains a range of its rows.
A chunk is an unmodifiable entity: once a chunk is created, the data in it cannot be changed.
The chunk data is physically stored on the cluster nodes. YTsaurus master servers monitor the location of the chunks and keep them in replicated state.
YTsaurus master servers store only metadata. However, metadata has a significant size, so in order to limit the memory consumption by the master servers, the number of chunks available to the account is quoted.
Chunk size
Chunks are not monolithic, they consist of blocks so that you can read and write chunks in parts.
By default, static table blocks are 16 megabytes and dynamic table blocks are 256 kilobytes.
You can change the block size in the table_writer
settings, the block_size
attribute.
A chunk usually occupies hundreds of megabytes or several gigabytes on a cluster node.
The size of the chunks to be created can be configured via the table_writer
section, the desired_chunk_size
attribute.
We do not recommend making chunks too small. This leads to an increase in their number, which consumes the master server memory. Small chunks also slow down reading: the number of requests to the master and the disk that need to be made to read all the data increases.
Reading data
When the client calls the procedure for reading data from a static table, the following steps are performed on the proxy server:
- Getting a list of chunks that make up the table from the master server.
- Getting a list of cluster nodes where chunk replicas are located from the master server.
- Downloading data from cluster nodes and transmitting it to the client.
Writing data
Writing is performed in a similar way:
- Splitting the input data stream into parts.
- Designing each part in the form of a chunk.
- Transferring data to cluster nodes and binding it to a table.
Chunk format
There are two chunk formats in YTsaurus: row-by-row and column-by-column.
The chunk format is determined by the optimize_for
parameter. For row-by-row format, optimize_for
has the lookup
value, for column-by-column format — the scan
value.
Row-by-row format: optimize_for=lookup
In row-by-row format, each row is stored entirely in one block.
The row-by-row format is suitable for dynamic tables from which data is selected by key.
Column-by-column format: optimize_for=scan
The column-by-column format does not store information about the type next to each value and uses lightweight column compression techniques.
By default, the data of each column specified in the schema gets into a separate sequence of blocks within a chunk. This speeds up the reading of a small number of columns.
If you need to read one row or a small range of rows by key, you have to make as many disk accesses as there are columns in the table. To minimize disk accesses, you can use the group
attribute in the description of columns in the schema, which will allow you to store related data in one block. For more information, see the Data schema section.
Note
Using a column-by-column format without a schema makes no sense.
Example of specifying a format
CLI
yt set //tmp/table/@optimize_for scan
Setting the attribute affects only future chunks and does not convert tables (similar to setting the @erasure_codec
and @compression_codec
attributes).
To convert already written data into a new format, perform the Merge operation:
CLI
yt merge --mode ordered --src //tmp/table --dst //tmp/table --spec '{force_transform = %true}'
For dynamic tables, this is solved by forced compaction. Before using it, make sure to read the article on Background compaction.
Example of running the operation via the CLI:
yt set //tmp/table/@forced_compaction_revision 1
In addition, for columnar dynamic tables, note the group
attribute in the column schema. If your use case involves frequent lookups of table rows without column filtering, it is a good idea to set the same group for all columns. For more information about groups, see the documentation page on table schemas.
Chunk owners
Cypress nodes consisting of chunks — tables, files, and logs — are called chunk owners.
A set of chunks belonging to the chunk owners is organized as a tree-like data structure. The leaves in this structure are chunks, and the intermediate nodes are chunk lists (chunk_list
).
Besides the attributes inherent to all Cypress nodes, chunk owners have the following additional attributes:
Name | Type | Description | Mandatory |
---|---|---|---|
chunk_list_id |
Guid |
Root chunk list ID. | Yes |
chunk_ids |
array<Guid> |
List of IDs of all chunks. | |
chunk_count |
int |
Number of chunks. | |
compression_statistics |
CompressionStatistics |
Statistics on data sizes and the types of compression codecs used. | Yes |
erasure_statistics |
ErasureStatistics |
Statistics on the types of used erasure codecs and data sizes. | Yes |
optimize_for_statistics |
OptimizeForStatistics |
Statistics on the types of used chunk types (optimize_for ). |
Yes |
multicell_statistics |
MulticellStatistics |
Statistics on the distribution of chunks on master servers. | Yes |
uncompressed_data_size |
int |
Total uncompressed data volume of all chunks (not including replication and erasure coding). | Yes |
compressed_data_size |
int |
Total compressed data volume of all chunks (not including replication and erasure coding). | Yes |
compression_ratio |
double |
Compression ratio, the ratio of compressed volume to uncompressed volume. | Yes |
update_mode |
string |
The way to change data under a transaction: none , append , or overwrite . For nodes outside a transaction, it is none . |
Yes |
replication_factor |
integer |
Replication factor (equals 1 for erasure coding). | Yes |
compression_codec |
string |
Compression codec name. | Yes |
erasure_codec |
string |
Erasure codec name. | Yes |
vital |
bool |
Whether the chunks of this node are vital | No |
media |
Media |
Replication factors and other medium settings. | Yes |
primary_medium |
string |
Primary medium. | Yes |
Data vitality
The YTsaurus system divides chunks into two types: vital and non-vital. By default, data in the system is vital (the vital
attribute is true
) and its loss is a serious incident. Data loss can occur due to failure of cluster nodes.
Non-vital data is data whose loss is not critical. YTsaurus cluster settings are such that inaccessibility of non-vital chunks does not trigger monitorings and exploitation intervention. Typical examples of non-vital data are intermediate results of operations: the scheduler will notice losses and perform recomputation. For non-vital data, the vital
attribute is false
.
Note
The system does not consider chunks written without erasure coding and having a single replication factor to be vital.
Data may be lost when cluster nodes fail. To avoid this, three copies of data are stored in YTsaurus by default.
However, if multiple hosts fail, the system may lose access to all copies. If computations were interrupted when vital data was lost, restart them.