Job stats
This section provides a list and descriptions of statistics collected by the YTsaurus system from operations and individual jobs. There are examples of issue troubleshooting based on the statistics collected.
Overview
While operations are running, YTsaurus collects various system and user statistics from jobs.
Individual job statistics are entered into the scheduler log while aggregated statistics for an operation are available through the web interface by going to the Statistics tab on the operation page. The figure shows an example of a page with operation job statistics:
Note
For a better understanding of job stats, you should be familiar with the notions of job proxy
and sandbox
, and have an idea of the way the YTsaurus system delivers user files to jobs. You can find out about these things in the Jobs section.
Statistic descriptions
System stats
Statistic | Description |
---|---|
time/total |
Job running time from the moment it is created by the scheduler until the moment the scheduler is notified that the job is completed (successfully or otherwise). Milliseconds. |
time/prepare |
Job set up time until a job proxy is started (files are uploaded into the chunk cache as required, a sandbox directory is created, files are created and copied to tmpfs , and requisite cgroups are set up). Milliseconds. |
time/artifacts_download |
Time to upload files to the chunk cache (as required). Included in time/prepare . Milliseconds. |
time/prepare_root_fs |
Time to prepare the file system in the porto container (as required). Included in time/prepare . Milliseconds. |
time/gpu_check |
Time to run for the utility pre-checking node GPU functionality (relevant for some GPU operations). Included in time/prepare . Milliseconds. |
time/exec |
Job running time from start to job proxy process exit. Milliseconds. |
data/input/chunk_count |
Total number of data slices read by the job. A data slice is a continuous segment of a single chunk (for static tables) or a continuous range of rows between two keys (for dynamic tables). Pieces. |
data/input/row_count |
Total number of table rows read by a job. Pieces. |
data/input/compressed_data_size |
Total compressed block data read by a job (table data fed to operation input). Bytes. |
data/input/uncompressed_data_size |
Total uncompressed block data read by a job. Bytes. |
data/input/data_weight |
"Logical" amount of uncompressed data read by a job. Depends only on the values in the table cells and the number of rows. Not dependent on whether a table has a schema or the values of optimize_for , compression_codec , or erasure_codec . Calculated as row_count + sum(data_weight(value)) for all values in the table cells. Bytes. |
data/input/not_fully_consumed |
1 if a job has not read the entire input, 0 otherwise. |
data/output/K/chunk_count |
Number of chunks written to output table with index K. Pieces. |
data/output/K/row_count |
Number of rows written to output table with index K. In units. |
data/output/K/compressed_data_size |
Compressed block data written to output table K. In bytes. |
data/output/K/uncompressed_data_size |
Uncompressed block data written to output table K. In bytes. |
data/output/K/data_weight |
"Logical" uncompressed data amount written to output table K. Dependent only on the values in the table cells and the number of rows. Not dependent on whether a table has a schema or the values of optimize_for , compression_codec , or erasure_codec . Calculated as row_count + sum(data_weight(value)) for all values in the table cells. Bytes. |
data/output/K/regular_disk_space |
Only populated for tables with erasure_codec == none and equal to compressed_data_size + metadata chunk size. Bytes. |
data/output/K/erasure_disk_space |
Populated only for tables with erasure_codec != none . In addition to compressed_data_size , includes the amount of data in the parity blocks. For more information about erasure codecs, see the section. Bytes. |
job_proxy/cpu/user |
User mode CPU time of the job proxy process. Computed based on the cpu_usage value of the porto container. Milliseconds. |
job_proxy/cpu/system |
Kernel mode CPU time of the job proxy process. Computed based on the cpu_usage_system value of the porto container. Milliseconds. |
job_proxy/cpu/wait |
Wait CPU time of the job proxy process. Computed based on the cpu_wait value of the porto container. Milliseconds. |
job_proxy/cpu/throttled |
Throttled CPU time of the job proxy process. Computed based on the cpu_throttled value of the porto container. Milliseconds. |
job_proxy/block_io/bytes_written |
Number of bytes written by the job proxy process to the local block device. The value is derived from the Write section of blkio.io_serviced_bytes for the relevant cgroup. Normally, a small value because job proxy only writes logs to the local disk. Chunks are written via (local and/or remote) nodes and do not count towards this statistic. |
job_proxy/block_io/bytes_read |
Number of bytes read by the job proxy process from the local block device. The value is derived from the Read section of blkio.io_serviced_bytes for the relevant cgroup. In a typical case approaches zero because job proxy only reads its own configuration from disk. |
job_proxy/block_io/io_read |
Number of reads from the local block device by the job proxy process. The value is derived from the Read section of blkio.io_serviced for the relevant cgroup. Pieces. |
job_proxy/block_io/io_write |
Number of writes to the local block device by the job proxy process. The value is derived from the Write section of blkio.io_serviced for the relevant cgroup. Pieces. |
job_proxy/block_io/io_total |
Total number of input/output transactions on the local block device by the job proxy process. Pieces. |
job_proxy/max_memory |
Maximum amount of RAM used by the job proxy process while a job is running. Bytes. |
job_proxy/memory_reserve |
Amount of RAM guaranteed to the job proxy at launch. If actual usage exceeds memory_reserve , the job may be aborted with a resource_overdraft message. Bytes. |
job_proxy/traffic |
Amount of data transmitted over the network to/from the job proxy . Normally, matches the amount of job input/output data if it only reads and writes tables/files as a job usually does and does not generate any other traffic on its own. Contains fields looking like A_to_B where A and B are data center short names. The value in this field is equal to the amount of data transmitted in this direction. You will find the time it took to transmit the data in the duration_ms field. Normally matches the time the job proxy has been in existence. In addition, incoming/outgoing traffic is also tracked (inbound and outbound fields, respectively), itemized by data center as well. |
exec_agent/traffic |
Same as job_proxy/traffic but for an exec agent. Normally displays the traffic generated during job artifact setup. |
user_job/cpu/* |
Similar to job_proxy/cpu/* but applies to the user job processes. |
user_job/block_io/bytes_read , user_job/block_io/bytes_written , user_job/block_io/io_read , user_job/block_io/io_write , user_job/block_io/io_total |
Block IO statistics for a user job. Similar to job_proxy/block_io/* . |
user_job/cumulative_memory_mb_sec |
Integral of memory used, MB*s |
user_job/current_memory/major_page_faults |
Number of major page faults in a user process. Value |
Derived from the pgmajfault section of memory.stat from cgroup memory May assist in investigating the elevated user_job/block_io/read statistic. Pieces. |
|
user_job/current_memory/rss |
RSS size at job end. The value is derived from the rss section of memory.stat for the relevant cgroup. Bytes. |
user_job/current_memory/mapped_file |
Memory mapped files size at job end. The value is derived from the mapped_file section of memory.stat for the relevant cgroup. Bytes. |
user_job/tmpfs_size |
tmpfs currently being used by the running job. Computed as the difference between the values of total and free returned by a call to statfs on tmpfs mount point. Bytes. |
user_job/max_tmpfs_size |
Maximum tmpfs amount used throughout the job running time. Bytes. |
user_job/tmpfs_volumes/K/size , user_job/tmpfs_volumes/K/max_size |
Same as user_job/tmpfs_size and user_job/max_tmpfs_size for each requested tmpfs volume individually. |
user_job/disk/usage |
Disk space being taken up by a job sandbox; computed through a recursive sandbox walk by adding up all the file sizes. Bytes. |
user_job/disk/limit |
Requested limit on job sandbox size. Bytes. |
user_job/max_memory |
Maximum amount of RAM taken up by a user job while running, net of tmpfs . Bytes. |
user_job/memory_limit |
Limitation on RAM amount set in the operation spec at operation start. Duplicated in the stats for housekeeping reasons. Bytes. |
user_job/memory_reserve |
Amount of RAM guaranteed for a user job at time of launch. When actual usage becomes greater (but less than memory_limit ), a job may be aborted with a resource_overdraft message if a cluster node is short on memory. You can control this value using an option called memory_reserve_factor in the operation spec. Bytes. |
user_job/pipes/input/bytes |
Number of bytes transmitted via the user process stdin (that is, the amount of input data converted to the specified input format with all the control modifiers like table_index , row_index , and so on). |
user_job/pipes/input/idle_time |
Amount of time during which the job proxy transmitted no data to a user job via stdin because it was reading data. For instance, a data cluster node disk was busy, and the data were unavailable, or the compression_codec used was slow to uncompress. Milliseconds. |
user_job/pipes/input/busy_time |
The amount of time during which the job proxy process was writing data to a user job stdin. The value can be large if the user code implements computationally intensive processing and does not have time to read data. In addition, slow reads could be the reason when user code hangs up on a write not reading new data as a consequence. Milliseconds. |
user_job/pipes/output/K/bytes |
Number of bytes written by a user process to the descriptor corresponding to the Kth output table. For information on descriptors and their numbering, please see the dedicated section. |
user_job/pipes/output/K/idle_time |
Time during which the job proxy process did not read the stream corresponding to the Kth output table because it was writing data already read from that stream. For example, the cluster node being written to was slow to respond, or a very slow compression algorithm is being used. Milliseconds. |
user_job/pipes/output/K/busy_time |
Time during which the job proxy process was reading from the stream corresponding to the Kth output table. If this time value is large, the user code has not written anything to the stream for a long time. For instance, because it took a long time to process the input, or because the input was unavailable. Milliseconds. |
user_job/gpu/cumulative_utilization_gpu |
Net amount of time during which there were running GPU computations. Added up across all the GPUs used by a job. Milliseconds. |
user_job/gpu/cumulative_utilization_memory |
Net time spent accessing GPU memory. Added up across all the GPUs used by a job. Milliseconds. |
user_job/gpu/cumulative_utilization_clocks_sm |
An integral of the board frequency over time with respect to the maximum frequency. Added up across all the GPUs used by a job. Milliseconds * share. |
user_job/gpu/cumulative_utilization_power |
Integral of effective GPU board power over time with respect to the maximum power. Added up across all the GPUs used by a job. Milliseconds * share. |
user_job/gpu/cumulative_load |
Time during which GPU was at non-zero load. Added up across all the GPUs used by a job. Milliseconds. |
user_job/gpu/cumulative_memory |
Integral of GPU memory utilization. Added up across all the GPUs used by a job. Milliseconds * bytes. |
user_job/gpu/cumulative_power |
Integral of utilized GPU power. Added up across all the GPUs used by a job. Milliseconds * power. |
user_job/gpu/cumulative_clocks_sm |
Integral of utilized GPU frequency. Added up across all the GPUs used by a job. Milliseconds * frequency. |
user_job/gpu/max_memory_used |
Maximum recorded GPU memory utilization. Added up across all the GPUs used by a job. Bytes. |
user_job/gpu/memory_total |
Total available GPU memory. Added up across all the GPUs used by a job. Bytes. |
codec/cpu/decode |
Wall time spent uncompressing data. Milliseconds. |
codec/cpu/encode |
Wall time spent compressing data. Milliseconds. |
job_proxy/memory_reserve_factor_x10000 , user_job/memory_reserve_factor_x10000 |
Housekeeping parameters used to analyze the operation of the memory_reserve computational algorithm. |
job_proxy/aggregated_max_cpu_usage_x100 , job_proxy/aggregated_smoothed_cpu_usage_x100 , job_proxy/aggregated_preemptible_cpu_x100 , job_proxy/aggregated_preempted_cpu_x100 , job_proxy/preemptible_cpu_x100 , job_proxy/smoothed_cpu_usage_x100 |
Job CPU monitor housekeeping statistics. |
chunk_reader_statistics/data_bytes_read_from_disk |
Amount of data retrieved from disk during chunk reads. Bytes. |
chunk_reader_statistics/data_bytes_transmitted |
Amount of data transmitted over the network during chunk reads. Bytes. |
chunk_reader_statistics/data_bytes_read_from_cache |
Amount of data retrieved from cache while reading chunks. Bytes. |
chunk_reader_statistics/meta_bytes_read_from_disk |
Amount of metadata retrieved from disk during chunk reads. Bytes. |
chunk_reader_statistics/wait_time |
Time spent waiting for data during chunk reads. Milliseconds. |
chunk_reader_statistics/read_time |
Time spent actively reading chunks, such as parsing strings from read blocks. Milliseconds. |
chunk_reader_statistics/idle_time |
Time during which chunk reads were aborted for the processing of previously read data. Milliseconds. |
Python API
Statistic | Description |
---|---|
custom/python_job_preparation_time |
Time between program main entry and the start of input row processing. |
Troubleshooting examples
Slow jobs
The most common use of statistics is to troubleshoot slow jobs.
Note
Job statistics are to be used alongside other troubleshooting methods available in the YTsaurus system.
Slow user code
Simple case: if user_job/cpu/user ~= time/exec
, user code is causing the job to run slowly. You should connect to the job via the job shell and look for the bottleneck using perf and gdb.
Multithreading
A more complex case: a user has analyzed the stats and retrieved the total utilized CPU time:
user_job/cpu/user + user_job/cpu/system+job_proxy/cpu/user+job_proxy/cpu/system = 303162 milliseconds
time/exec = 309955 milliseconds
The user is expecting the numbers to match.
The equality above is not totally proper since time/exec
is wall time while user_job/cpu/*
and job_proxy/cpu/*
are CPU times spent computing. At the same time, both the job proxy
and (potentially) the user process may use more than a single core because of multithreading, and on top of that, both the processes are running concurrently.
If the user process is single-threaded, the following should be approximately true:
time/exec ~= user_job/cpu/* + user_job/pipes/input/idle_time + max(user_job/pipes/output/N/idle_time)
The above computation does not include the time to "close" output chunks: when the output stream is already closed, all its data read but the job proxy
is still finalizing chunks.
Long compression
If an operation is taking much longer to run than user code running time, and an execution time measurement showed that the YTsaurus system is spending most of its time processing a yield
, the problem is that inside yield
the python wrapper is writing data to the output stream. If the job proxy
write pipeline is overloaded, the job proxy
process stops reading data and a yield
becomes blocked. user_job/pipes/output/N/idle_time
statistics showed "lost" minutes. There are two possible reasons why a job proxy
process failed to read data: the job proxy
took a long time processing data on its side or there was a cluster write. The job_proxy/cpu/user
stats make it clear that 70% of the total job time, the job proxy
process was keeping the CPU busy, which means that processing input data took a long time. The heaviest part of job proxy
processing is compression, which is what the profiler indicated in the end.
Note
General rule for slow job troubleshooting: when analyzing running time, you need to find dominant metrics, then look for a cause of the problem in the identified component.
Exceeding memory guarantees
Operations exceed memory limitations. Steps to resolve this issue:
-
When requesting tmpfs, you need to make sure that you are not ordering too much. To make sure, you need to look at the
user_job/tmpfs_size
statistics. -
When selecting
memory_limit
, you need to proceed from theuser_job/max_memory
metric; if you are usingtmpfs
, you must add the requiredtmpfs
amount touser_job/max_memory
. If you use automatictmpfs
ordering via the C++ and the Python APIs, the APIs will make the addition on their own. -
Most jobs do not use a lot of memory but certain jobs require significantly more memory, and you should set
memory_limit
to the maximum job memory requirement but setmemory_reserve_factor
lower (at the same time, you should be ready that jobs will be aborted forresource_overdraft
). You can make the relevant decision by comparing the average and the maximum values of theuser_job/max_memory
statistic (the web interface includes appropriate switches): if average and maximum memory usage in a job differs by several dozen percent or by orders of magnitude, this is an indication that there are jobs whose usage is inordinately high. -
Reverse situation: if too many jobs are being aborted for
resource_overdraft
, you should increasememory_reserve_factor
.
excessive io
message
Operation You must review the user_job/block_io/io_read
statistic. If a job makes tens of thousands of reads, it is harming both itself and other users.
You should check the major page faults
counter. A large value is an indication of possible memory mapping use.
Attention!
The use of memory mapping in jobs is strongly discouraged while jobs doing this create heavy disk load with random reads and writes. Please remember that a conventional hard drive is capable of no more than 100 random access operations per second while there are several dozen jobs running concurrently on a cluster node.
Solution for this issue: reducing executable size (by removing the debugging symbols, for instance), ordering tmpfs for the job, and uploading large files into memory at job launch (copy_files
option). For more information on these options, please see Operation settings.
Supplemental
YTsaurus is unable to measure phase time inside a user process. For example, if user code uploads a large dictionary into memory at launch, you will not be able to see the time it took to load this dictionary in the system metrics. If this happens, you should take advantage of user statistics.
You can use user statistics to compute simple aggregates even in a map operation. The number of user statistics is limited to 128 per job.
You can build different plots from job metrics (in particular, visualize operation phases) to simplify troubleshooting.
Additional blocks for redundancy and original block recovery in the event of a loss of any part. Computed using a logical XOR.