Dynamic monitoring of CPU consumption

If a job consumes significantly fewer cpu resources than ordered in the operation specification, the job CPU guarantee may be reduced. The freed up resources can be transferred to other jobs working in the operations of a particular pool. The job-cpu-monitor process monitors job CPU consumption and guarantee changes.

The value of CPU consumption for the period (check_period) is regularly read from the job container. Exponential smoothing with the smoothing_factor parameter is applied to this value. The last vote_window_size smoothed values and the interval (relative_lower_bound*current_cpu_limit, relative_upper_bound*current_cpu_limit) where current_cpu_limit is a current limit set on the container are considered next.
After that, each value is converted according to the rule:

  • -1: Value < relative_lower_bound*current_cpu_limit.
  • 1: Value > relative_upper_bound*current_cpu_limit.
  • 0: In other cases.

The resulting values are summed into the votes_sum variable and the current_cpu_limit limit is recalculated:

  • votes_sum > votes_decision_threshold => current_cpu_limit *= increase_coefficient
  • votes_sum < -votes_decision_threshold => current_cpu_limit *= decrease_coefficient

The current_cpu_limit value is limited at the bottom by the min_cpu_limit option from the job-cpu-monitor configuration and at the top by the cpu_limit option from the operation specification.
If the value of the current_cpu_limit variable has changed, the new value is set on the container and sent to the scheduler to update resource consumption in the pool.

Job-cpu-monitor aims to keep the current CPU consumption in the interval between relative_lower_bound and relative_upper_bound from that set on the container and shifts the specified interval up or down if the consumption exceeds its limits.

The default values (the current values may be different):

  • check_period = 1000 (ms);
  • smoothing_factor = 0.1;
  • relative_upper_bound = 0.9;
  • relative_lower_bound = 0.6;
  • increase_coefficient = 1.45;
  • decrease_coefficient = 0.97;
  • vote_window_size = 5;
  • vote_decision_threshold = 3;
  • min_cpu_limit = 1.

The listed settings can be specified in the job-cpu-monitor section of the operation specification. In the section, you can specify the enable_cpu_reclaim option that enables/disables cpu limit changes. You can view the actual option values in the web interface on the operation page, on the Specification -> Resulting specification -> job-cpu-monitor tab.