Resources and preemption

A CHYT clique is a Vanilla YTsaurus operation and instances are run in the jobs of this operation and consume resources. A typical CHYT instance runs on about 10 CPU cores and about 60 GB of RAM.

Before continuing with this article, consider reading about the Scheduler and pools first.

Preemption

The main mechanism for ensuring fair resource allocation in the YTsaurus scheduler is preemption. This means that if at some point an operation starts consuming more resources than it's entitled to, any of its jobs will be "preempted", i.e. forcibly stopped, and another operation that is actually entitled to the resources will be started in its place. This job is called an aborted job and the reason for abort is preemption.

This approach works well for jobs of regular YTsaurus operations, since a typical job in YTsaurus has a duration of a few minutes, and losing a job progress is not a problem — the scheduler will restart the job later when there are available resources for it. Map-Reduce has delays of about several minutes, so this situation is absolutely tolerable.

The situation with CHYT is a bit more complicated. The built-in ClickHouse protocol does not provide a mechanism to recover queries that failed due to a component failure at runtime. This means that if queries are now running on the instance and abort is sent to the job where the instance is running, then all running queries will fail with network layer errors (NetException, Attempt to read after eof, Connection refused, and others). By then, the client could have returned a random set of strings from the response to the query, which only complicates the task of restarting the query.

Interruption

MR jobs have long used a mechanism called Interruption: if YTsaurus realizes that it is about to interrupt a job, it simply stops feeding data to the job's input. If the job manages to complete within a certain interruption_timeout, it is considered completed, and the data it has not processed will be fed to a new job. If the job does not manage to complete within an interruption_timeout, it is simply aborted and its input is fully processed by another job.

ClickHouse instances do not process any input in the usual way for YTsaurus, independently executing queries coming from the outer world, so information about the upcoming interruption comes to such jobs in a different way. For CHYT, the interruption_signal option is supported in the operation specification. This is a signal that will be sent to the job to notify it of an imminent completion.

When receiving the signal, the instance goes into completion mode, sends this information to the others using gossip, and no longer accepts any new queries, but continues executing those that are already in progress. After all current queries are executed, the instance will end with the completed status.

Regular preemption and graceful preemption

The YTsaurus scheduler preempts jobs based on the following logic: if a node has jobs whose operations have a lower fair_share_ratio than their usage_ratio, this job is considered preemptable. If the scheduler has a job of the starving operation that can be run on a node if preemptable jobs are preempted, then it does it — it preempts the required number of preemptable jobs and plans new ones in their place. In this model, it is unacceptable to wait very long to preempt jobs, so the interruption_timeout is only 15 seconds.

A typical usage scenario for CHYT is small and medium queries that run from single-digit seconds to several minutes; interruption_timeout = 15 sec is often less than the standard query time, which means that many queries that were running on a given instance will end with an error.

To combat instance preemption, you can start a clique in the pool with guarantees, but not all users have guarantees. Besides that, some users don't want the clique to interfere with other operations in the pool: they need it to scale down when there is a demand for pool resources and expand when there are extra resources.

To solve this problem, a new operation scheduling mode — Graceful Preemption — was added. To avoid preemption problems in this mode, preemption is performed in advance: if there are preemptable jobs, a completion signal is sent to them, regardless of whether there is a candidate for these resources. Thus, the operation independently seeks that usage_ratio = fair_share_ratio, catching up with the fair amount of resources it is due at the moment. Due to this, you can increase the preemption timeout to 10 minutes, because no one is idle, waiting for the release of resources.

Below is an example of how usage_ratio (orange graph) and fair_share_ratio (green graph) change after a heavyweight operation is started in the same pool as the clique.

To use Graceful Preemption, set it with the --spec {preemption_mode = graceful} parameter.