Multi-versioning and transactions of dynamic tables

This section describes transactions as they apply to dynamic tables. The transactional model of static tables is described in the Transactions section.

Transactions are divided into master transactions and tablet transactions. Master transactions enable you to perform operations on the master meta-information. Tablet transactions enable you only to write data to dynamic tables.

Data in sorted dynamic tables is read and modified using table transactions.

The MVCC model is used to isolate transactions and resolve conflicts. The default isolation level is snapshot (not serializable snapshot).

The system supports distributed transactions, i.e. it enables multiple rows of one or more dynamic tables to be changed within a single transaction.

Restrictions

When using transactions, keep in mind:

Transactions must be short (the default limit is 1 minute).
Transactions must write a limited amount of data (the default limit is 100,000 rows).

Calculating the moment of the last write is no simple task, because some of the data may have already been written to the disk. Attempting to read such data from the disk would cause a write by key to become a random read by key. The YTsaurus system design stipulates that all writes are made quickly. In terms of disk activity, access is sequential. Therefore, the following logic is implemented:

Dynamic stores (dynamic_store), which store freshly written data, are stored in the cluster node's memory for a guaranteed period of time even if they are already written to chunks on the disk.
These dynamic stores are checked quickly, because it does not require access to the disk.
The system prohibits commit of transactions whose lifetime exceeds a certain configurable limit.

Timestamps

Each value stored in the sorted dynamic table is annotated with a timestamp. Timestamps are 64-bit integers generated by the cluster if necessary.

When any transaction starts, a start timestamp is generated. All reads of dynamic tables within a transaction are made relative to this timestamp, thereby generating a snapshot of the data state in these tables.

When a transaction is committed, another timestamp is generated — a commit timestamp. All changes to the tables made by this transaction are annotated with a commit timestamp.

Timestamps within a cluster are generated centrally on the master servers. For better scalability, clients do not request timestamps directly, but through special proxy servers that buffer queries. In practice, it takes less than 10 ms to generate a new timestamp when the cluster and the client are within the same DC. The total capacity of their generation is practically unlimited.

For timestamps the following is guaranteed:

Uniqueness: All obtained timestamps will be different.
Monotony: If A means an event of obtaining a T(A) timestamp and B means an event of ordering a T(B) timestamp, with A occurring before B, then T(A) < T(B)).
Consistency: If an A transaction has a Tc(A) commit timestamp and a B transaction has a Ts(B) start timestamp, with Tc(A) < Ts(B), then the B transaction will see all changes made by the A transaction (and only those). It is also true that any transaction in progress does not see its own changes (because the changes have not yet obtained any timestamp).

In addition, the timestamps are arranged so that it is easy to calculate an approximate physical time of generation with an error of about a second.

Working with transactions requires the client to maintain a certain state in memory. For example, you need to remember the transaction start timestamp. Changes made within a transaction are also buffered on the client and only sent out at the time of commit. In the current implementation, this state is maintained by the YTsaurus client library, but in theory it is possible to transfer this work to a suitable stateful proxy.

Weakening guarantees

By default, when working with sorted dynamic tables, all user transactions are atomic and guarantee data integrity: written changes cannot be lost.

A two-phase commit is used to ensure atomicity and the server waits until the data is written to the disk to ensure data integrity. In total, these guarantees contribute significantly to the latency of write operations (hundreds of milliseconds). However, there are user scenarios when less strict guarantees are acceptable but a markedly lower latency of write operations is required. To support such scenarios, the system has a number of parameters that enable you to set reduced levels of atomicity and preservation.

Atomicity

Transaction atomicity means that if a user at time t observes an effect (written data) from the T transaction, then at any subsequent time t' they observe all effects from the T transaction. For example, if user 1 changes rows with k_1 and k_2 keys in a single or different dynamic tables, it is impossible that user 2 reads the changed k_1 row, but not the changed k_2 row.

Tables have the atomicity attribute which defines the guarantees for the transactions to be applied. This attribute can only be changed for unmounted tables. The attribute values can be the following:

full: Means that transactions are fully atomic (default mode). In full mode, changes become atomically visible at any time after commit timestamp (as stated earlier, the system provides a snapshot isolation level in this mode).
none: Means that the system does not give any guarantees about the atomicity of the changes. In none mode, there may be situations where commit fails (for example, because a tablet is not available), but some of the changes are committed and became visible to users.

Non-atomic transactions

For non-atomic transactions, the isolation level can be considered read committed. Reading within non-atomic transactions is not performed relative to the start time (there is no start time as such). Instead, the most recently written data is read.

Values written by non-atomic transactions receive timestamps, but the timestamps are generated by the clients based on system clock data. YTsaurus checks that the timestamp data does not differ from the server timestamps generated in a serialized way by more than the client_timestamp_threshold value (the default value is one minute) in the system configuration. If the difference between the client and server clocks exceeds the specified threshold, the client will get the Transaction timestamp is off limits, check the local clock readings error.
Even with non-atomic modifications within a row, the uniqueness and monotony of the timestamps are guaranteed. For this, the YTsaurus system corrects the timestamps received from the client upwards if necessary.

Non-atomic transactions do not take locks, so the transaction that is committed last (last write wins) is counted for concurrent writes. Successful completion of a non-atomic transaction means that all changes are applied on the server.

In case of non-atomic transactions, outside observers may see only partially applied changes during commit time. For atomic transactions, visibility of a part of the changes at any time by anyone means visibility of all the changes. Once commit is complete (in any mode), all changes are guaranteed to be visible to all.

To use non-atomic transactions, you need to:

Specify the atomicity attribute equal to none for the table.
Specify the atomicity=none mode in the settings when starting a transaction.

At the time of commit, the server checks the atomicity mode of the table and the atomicity mode specified by the client. If the modes do not match, an execution time error occurs. There are two reasons why it is mandatory to specify atomicity mode for a transaction:

Ideological: To ensure that places in the code where non-atomic commits occur are explicitly marked.
Technical: When an atomic transaction starts, there is an additional delay due to the calculation of the start timestamp. For non-atomic transactions, this timestamp does not make sense. This makes it impossible to start a non-atomic transaction for free and only find out that it is non-atomic at the time of commit.

Many versions of the same value can be written to tables with atomicity=none at one time. For example, on the writing side, the repeated query goes into an infinite cycle, or a replicated table with an accumulated queue writes it all at to the replica at once. This causes the table data not to be able to be written to the disk and writing to it is stopped.

Therefore, we recommend setting the following options for tables with atomicity=none:

min_data_ttl = 0
merge_rows_on_flush = %true

In non-atomic transaction mode, the system behaves almost identically to Apache HBase. By weakening guarantees, each write requires only one round-trip to the server and one disk write cycle (tens of milliseconds).

Durability

Transaction durability means that if the user at time t successfully commits a T transaction (receives confirmation from the server), then at any subsequent t' time, they observe all the effects of a T transaction.

For non-atomic transactions, there is also a way to drastically reduce their latency to units of milliseconds at the cost of preservation. When starting a transaction, you can specify the durability parameter that has the following values:

async: The transaction was accepted by the server, saved to memory, and, if there are no failures, will once be written to the disk (for non-atomic transactions only).
sync: The transaction was accepted by the server, validated by the server quorum, and saved to the logs on the disk. If replication settings are correct and there are no serious problems, it survives failures (default mode).

Supported operations

Creating a transaction

There are two types of transactions in YTsaurus:

Master transactions: They are created and maintained on the master and enable you to work with Cypress only.
Tablet transactions: They are created by clients without the master and enable you to work with dynamic tables, but not Cypress.

The transaction type is specified by the client when the transaction is created. When a transaction is started, the system generates a transaction start timestamp. In case of a master transaction, the timestamp is registered on the master and periodically polled by the client. In case of a tablet transaction, its state is handled by the client library. From YTsaurus's point of view, the transaction is not visible and does not cost anything.

You can write of one or more dynamic tables within a single tablet transaction to a random set of rows. You cannot write to dynamic tables from master transactions.

You can specify the atomicity and durability settings for tablet transactions.

Writing rows

The client can write data within the active transaction using the insert_rows method. To do this, the rows to be written must be reported. All key fields must be present in each such row. Some of the data fields from those specified in the schema may be missing.

Semantically, if a row with the specified key is not in the table, it appears. If there is already a row with this key, some of the columns are overwritten.

When part of the fields are specified, the following modes exist:

overwrite: All unspecified fields update their values to null (default mode).
update: Enabled by the update == true option. The previous value will then be retained. In this mode, all columns marked with the required attribute must be transmitted.

Note

When reading and executing an SQL query, the data is visible only when the transaction starts. Changes written within the same transaction are not available for reading.

Completing a transaction

This call sends the changes accumulated during operation to the cluster and a commit attempt is made. It can be successful, and if it is, the client will know the transaction end timestamp. A commit may fail (for reasons ranging from network problems and temporary unavailability of individual cluster nodes to conflicts between transactions). There is no built-in reliable means to find out if a commit has been completed successfully (the relevant message may have gotten lost on the way to the client).

Aborting a transaction

On the client, the memory occupied by the changes accumulated in the transaction is freed up. If this is a master transaction, it is aborted on the master server. If this is a tablet transaction, no further actions are required.

Checking transaction conflicts

To check and prevent conflicts between atomic transactions that work (lock read or modify operations) with data overlapping by keys, there are locks on the table rows:

In case of read locks, shared-locks are used.
Exclusive-locks are used to modify data.

In the simplest case without further configuration, there is one basic shared-exclusive lock associated with each key. This feature is implemented in the form of a shared-lock counter and an exclusive-lock flag.

By default, locks apply immediately to entire rows. You can increase the granularity of locks in the system by marking the columns in the schema with a lock attribute and specifying the name of an additional lock. In this way, you can divide the columns into groups, each controlled by a different lock. If no lock attribute is specified for a column, it is considered to be controlled by the main lock. If the transaction writes by key, it takes the locks controlling all affected columns.

You cannot take an additional lock if another transaction has already taken the main lock. When deleting a row, the transaction takes the main lock.

Modification lock

It is guaranteed that there cannot be two transactions modifying the same key at the same time.
When modifying data by key, whether reading or writing,, the transaction takes an exclusive-lock by that key at the moment of commit. If the specified lock has already been taken by another modify transaction or by some number of read transactions that have taken shared-locks, a conflict occurs and the transaction is aborted.

It is guaranteed that transactions with overlapping lifetimes cannot write data by the same key.
Locks on rows are only taken and retained during commit. Until then, the data is on the client. Commit in the system is two-phase and its process starts with the prepare timestamp and ends with the commit timestamp.
For example, A and B transactions were started at about the same time, then A performed a write by k key and was committed. A B transaction could now perform a write by k key and be committed, because there is no longer any lock on k. But this behavior is contrary to the snapshot isolation level provided by the system.

Therefore, when a B transaction modifies a key, in addition to checking a lock on that key, the system also checks:

Whether there were any writes with timestamps greater than Tstart(B) on that key.
Whether this key was locked by another transaction before the time greater than Tstart(B).

If at least one of the conditions is not met, a conflict occurs.

Read lock

In case of a read lock, in addition to checking whether the shared-lock can be taken, only the lack of modifications to the given key with timestamps greater than Tstart(B) is checked. The structure for transaction conflict checking is defined by a shared-lock counter, an exclusive-lock flag, and two timestamps — the time before which the row was locked for reading and the time of the last modification of the row.

When row modification is complete, the exclusive-lock is released and the timestamp of the last modification is written to the commit timestamp. When a transaction that takes read locks (shared-locks) is complete, the shared-lock is released at the end of the commit process, and in case of a strong-lock, the transaction commit timestamp is written. In case of a weak-lock, no timestamp is written.

Replicated tables

Sharding