Limits on the number of operations

YTsaurus has limits on the number of operations running in compute pools. These limits serve as a protective mechanism that controls the load on the scheduler. There are two types of limits on the number of operations:

  • total_operation_count limits the total number of operations. This reduces the load from maintaining the state of the pool tree and performing periodic recalculations of the fair share.
  • running_operation_count limits the number of running operations. This reduces the load from scheduling new allocations.

If a pool has reached its limit on the total number of operations, attempting to run the start_operation command will result in an error. If an operation starts successfully, but the pool has already reached its maximum number of running operations, the operation enters the pending state, gets queued, and waits for the currently running operations to complete. In the pool hierarchy, operations count toward the limits of all their ancestor pools. The described rules apply if at least one ancestor pool reaches its limit. Generally, limits from different pools are independent of each other. There may be cases where the combined limits of all child pools exceed the limit of the parent pool.

Lightweight operations

The bulk of the scheduler's load comes from processing exec node heartbeats and scheduling new allocations. This load directly depends on the number of operations currently involved in the scheduling process. These are called schedulable operations. For example, a running operation may be non-schedulable if it already has all the necessary allocations. Since the schedulable status can change while an operation is in progress, it's difficult to enforce a limit on the number of such operations. For this reason, the system manages the load by implementing a conservative restriction on the number of currently running operations.

Some operations may be non-schedulable for much of their duration. For example, an operation may consist of a single small job that requires minimal time for the scheduler to initiate. In this scenario, the operation continues to count toward the limit on running operations, which can be a valuable resource in large YTsaurus installations. To address cases like this, the service supports a special type of operations called lightweight operations. These operations don't count toward the running_operation_count, meaning the limit on the number of running operations doesn't apply to them. However, lightweight operations still count toward the total_operation_count and the special lightweight_running_operation_count counter, which doesn't have a limit.

For an operation to be considered lightweight, it must meet several conditions:

  • The operation must be of the Vanilla type.
  • The pool where the operation is running must be configured in FIFO mode.
  • The pool must allow lightweight operations. This is controlled by the enable_lightweight_operations setting, which can be set by the YTsaurus cluster administrator.

Recommendations for use

You may run non-lightweight operations in a pool that has lightweight operations enabled. These operations count toward the running_operation_count and are subject to the associated limit. However, mixing operations of different types is an anti-pattern and is strongly discouraged.

The scheduler algorithm implements special logic to count lightweight operations and expects their jobs to start successfully within a short timeframe. For this reason, we don't recommend abusing this feature by running heavy Vanilla operations that consist of more than one job in lightweight pools. While these operations are technically considered lightweight, their launch time may be longer than usual.