FAQ

This section presents a collection of answers to user questions arising when getting to know or using the YTsaurus system.

If you do not find an answer to your question in this section, please review Reporting an issue.

General questions

Q: How do I add a column to a YTsaurus table?

A: You need to retrieve the current table schema:

yt get //home/maps-nmaps/production/pedestrian/address_tasks/@schema --format '<format=text>yson'

And replace it with a new one:

yt alter-table //home/maps-nmaps/testing/pedestrian/address_tasks --schema '<"unique_keys"=%false;"strict"=%true;>[{"name"="timestamp";"required"=%false;"type"="uint64";"sort_order"="ascending";};{"name"="lon";"required"=%false;"type"="double";};{"name"="lat";"required"=%false;"type"="double";};{"name"="buildingId";"required"=%false;"type"="string";};{"name"="taskId";"required"=%false;"type"="string";};{"name"="pedestrianType";"required"=%false;"type"="string";};{"name"="buildingPerimeter";"required"=%false;"type"="double";};{"name"="buildingShape";"required"=%false;"type"="string";};]'

Q: I get the following error when I attempt to change a table schema: Changing "strict" from "false" to "true" is not allowed. What should I do?

A: You cannot change the schema of a non-empty table from weak to strict since the entire table must be read to validate this action to make sure that the data match the new schema. It is easiest to create a new table and copy data from the old one via a read+write or by launching a Merge transaction.

Q: How do I authorize when communicating with YT through the console?

A: You have to save the required user's token to a file called ~/.yt/token.

Q: Can I reduce the replication factor for temporary tables in the native C++ wrapper?

A: No, the wrapper does not include this capability.

Q: The python wrapper produces error "ImportError: Bad magic number in ./modules/yt/init.pyc". What should I do?

A: This error results from a python version mismatch between the client running the script and your cluster. To run, use the same python version as on the cluster. You can retrieve the current version as follows:

yt vanilla --tasks '{master = {job_count = 1; command = "python --version >&2"}}'
Python 2.7.3

Python versions on different cluster nodes may differ. It is better to use your own proto layer to run jobs.

Q: What is the overhead for reading small table ranges and small tables?

A: Small ranges do not create overhead since only the relevant blocks are read from disk. At the same time, all static table reads require requests for metadata to the master server. In this case, communication with the master server is the bottleneck. Therefore, we recommend using a smaller number of queries (larger parts) to read static tables minimizing master server load. Or restructuring your process to read from dynamic tables.

Q: Can hex numbers be stored efficiently in keys instead of strings?

A: They can using the YSON or the JSON format.

Q: What are the logging levels in the console client and how can I select them?

A: You can select logging levels via the YT_LOG_LEVEL environment variable with INFO being the default. You can change the logging level using the export YT_LOG_LEVEL=WARNING command. The following logging levels are available:

INFO: to display transaction execution progress and other useful information.
WARNING: to display warnings. For instance, a query could not be run and is being resubmitted, or a transaction input table is empty. Errors of these types are not critical, and the client can continue running.
ERROR: all errors that cause the client to fail. Errors of this type result in an exception. The exception is handled, and the client exits returning a non-zero code.

Q: How does the clearing of temporary data work on clusters?

A: Most YTsaurus clusters regularly (twice or more a day) run the //tmp cleaning script that finds and deletes tmp data that have not been used in a long time or use up a certain portion of the account's quota. For a detailed description of the cleaning process, please see the System processes section. When writing data to //tmp, users have to keep this regular cleaning in mind.

Q: When reading a small table using the read command, the client freezes up in a repeat query. What could be the reason?

A: One of the possible common reasons is too many (small) chunks in a table. We recommend running the Merge operation with the --spec '{force_transform=true}' option. When such tables come up in operation output, the console client issues a warning containing, among other things, the command you can run to increase the size of the table's chunks. You can also specify the auto_merge_output={action=merge} option to have a merge occur automatically.

Q: An operation returns error "Account "tmp" is over disk space (or chunk) limit". What is going on?

A: Your cluster has run out of storage space for temporary data (tmp account), or the account has too many chunks. This account is accessed by all the cluster users which may cause it to become full. You have to keep this in mind, and if using up the tmp quota is critical to your processes, we recommend using a different directory in your own account for temporary data. Some APIs use //tmp as the default path for storing temporary data. If that is the case, you must reconfigure these to use subdirectories in your project's directory tree.

Q: An operation fails with error "Account "intermediate" is over disk space (or chunk) limit". What is going on?

A: Your cluster has run out of storage space for intermediate data (intermediate account), or the account has too many chunks.
Unless you specified intermediate_data_account (see Operation settings, Sort, MapReduce), you are sharing this account with everybody else. To preempt this problem, set intermediate_data_account.

Q: Is reading a table (or file) a consistent operation in YTsaurus? What will happen if I am reading a table while deleting it at the same time?

A: On the one hand it is. That is to say that if a read successfully completes, the read will return exactly the data contained in the table or file at the start of the operation. On the other hand, the read may terminate if you delete the table at that time. To avoid this, you need to create a transaction and have it take out a snapshot lock on the table or file. When you are using the python API, including the python CLI, with retry activated for reads, this lock is taken out automatically.

Q: When I start a client, I get "Cannot determine backend type: either driver config or proxy url should be specified". What should I do?

A: Check to see whether the YT_PROXY=<cluster-name> environment variable is set.

Q: What do I do if I get error "Account "..." is over disk space limit (node count limit, etc)"?

A: This message is an indication that the account is out of one of its quotas. The system has quotas for all kinds of resources. For more information on the types of quotas and for forms to change quotas, see the Quotas section.

Q: How do I discover who is taking up space in an account or all the nodes with a specific account?

A: The short answer is yt find / --name "*" --account <account_name>

A more detailed answer:

Look in the recycle bin (//tmp/trash/by-account/<account_name>). To do this, follow the specified path in the web interface's Navigation section.
Use yt find to look for your account's tables in //tmp and your associated project directories. Please note that yt find does not look in directories to which you do not have access.
Contact the system administrator.

Q: How do I change a table's account?

A: yt set //path/to/table/@account my-account

Q: How do I find what part of my resources is being used by a directory together with all the tables, files, and so on it contains?

A: yt get //path/to/dir/@recursive_resource_usage или выбрать Show all resources в разделе Navigation в веб-интерфейсе.

Q: While working with the system, I am getting "Transaction has expired or was aborted". What does it mean and how should I deal with it?

A: When you create a master transaction, you specify a timeout, and a user undertakes to ping the transaction at least once during the specified time interval. If the interval between the moment of creation or the most recent ping is greater than the timeout, the system will terminate the transaction. There may be several things causing this behavior:

Network connection issues between client machine and cluster.
Client-side issues, such as failure to ping. Pings are either not being sent from the code, or the pinging code is not being called. When using python for instance, this might happen, if there is a long-term GIL lock taken out inside a program to work with some native libraries.
There is some maintenance underway on the cluster.
There are known access issues.
The transaction was explicitly aborted by the client which you can ascertain by reviewing server logs.

Otherwise, client debug logs are required for a review of the issue. The server side only knows that there are no pings; therefore, you must make sure to activate debug logging and collect the logs. By setting YT_LOG_LEVEL=debug, for instance, which is suitable for most supported APIs.

Q: The operation page displays "owners" field in spec ignored as it was specified simultaneously with "acl"". What does it mean?

A: This message means that the operation spec includes both the deprecated "owners" field and the "acl" field with preference given to the latter, which means that "owners" was ignored.

Q: How do I automatically rotate nodes deleting those that are older than a certain age?

A: You should use the expiration_time attribute. For more information, see Metainformation tree.

Q: How do I automatically delete nodes that have not been in use longer than a specified period of time?

A: You should use the expiration_timeout attribute. For more information, see Metainformation tree.

Q: A running operation displays warning "Detected excessive disk IO in <job_type> jobs. IO throttling was activated». What does it mean?

A: The operation's jobs are performing many input/output transactions against the local drive. To minimize the negative impact of such behavior on the cluster, jobs were restricted via the blkio cgroup throttling mechanism. For possible reasons of such behavior, please see the examples in the Job statistics section.

Q: Which paths does the web interface correct automatically with the Enable path autocorrection setting enabled?

A: The web interface does not specify which path errors have been corrected.
For instance, a path looking like //home/user/tables/ is always invalid. When the path is displayed in the web interface, the unescaped slash at the end of the path will be stripped.

Q: How do I find out whether a table from the web interface has successfully downloaded, and if not, look at the error?

A: If there is an error, it will be written to the file being downloaded. An error may occur at any time during a download. You need to check the end of the file. Example error:

==================================================================" "# "" ""#
{
    "code" = 1;
    "message" = "Missing key column \"key\" in record";
    "attributes" = {
        "fid" = 18446479488923730601u;
        "tid" = 9489286656218008974u;
        "datetime" = "2017-08-30T15:49:38.645508Z";
        "pid" = 864684;
        "host" = "<cluster_name>";
    };
}
==================================================================" "# "" ""#

Q: You are attempting locally to run a program that communicates via RPC in YTsaurus and get back "Domain name not found»

A: In a log, you may also encounter W Dns DNS resolve failed (HostName: your-local-host-name). The error occurs when resolving the name of the local host that is not listed in global DNS. The thing is that the YTsaurus RPC client uses IPv6 by default and disables IPv4. That is why the line 127.0.1.1 your-local-host-name in the local /etc/hosts file does not work. If you add ::1 your-local-host-name to the above file, it should solve your problem.

Q: How do I copy a specified range of rows rather than the entire table?

A: In the current implementation, the Copy operation does not support the copying of ranges but you can use the Merge command which will run quickly. Using the ordered mode will keep the data sorted in simple situations (when there is a single range, for instance). Example command:

yt merge --src '_path/to/src/table[#100:#500]' --dst _path/to/dst/table --mode ordered

Q: How do I find out who is processing my tables?

A: To accomplish this, you can analyze the master server access log.

Q: How do I recover deleted data?

A: If the delete used the UI, and the Delete permanently option was not selected, you can look for the tables in the recycle bin under the relevant account folder.
If the delete used yt remove or similar API calls, recovery is not possible.

Q: Error "Operations of type "remote-copy" must have small enough specified resource limits in some of ancestor pools"

A: RemoteCopy operations create load in the cross DC network.
To limit the load, an artificial load limit was introduced: RemoteCopy operations must run in a pool with the user_slots limit not exceeding 2000.
If the plan is only to run RemoteCopy in the pool, it is sufficient to set this limit for the pool
yt set //sys/pools/..../your_pool/@resource_limits '{user_slots=2000}'.

Data storage questions

Q: Can I put an arbitrary YSON structure in a table cell?

A: Yes. You can use columns of type any.

Q: Can I modify a table's schema?

A: Yes. You can find examples and limits in the Table schema section.

Q: How large are the files the system supports?

A: The formal limit on file length is just a 64-bit integer. In actual fact, it is the amount of free space on the cluster's nodes. Since a file is broken down into chunks, it does not have to be stored on a single node. Therefore, it can be larger than the size of a single hard drive. A single job cannot process files this large; therefore, the relevant activity will boil down to using the read_file command and reading by ranges.

Q: How do I move a table from an HDD to an SSD?

A: You can change medium type for a table or a file by editing the table's primary medium via the primary_medium attribute. Before modifying this attribute, you need to unmount your table (and remount it when done to return things to the way they were). For example:

yt unmount-table --sync //home/path/to/table
yt set //home/path/to/table/@primary_medium ssd_blobs
yt mount-table --sync //home/path/to/table

Immediately after you set the attribute, new data will start writing to an SSD while old data will be moved in the background. For more information about controlling this process and tracking its progress, please see the section on static tables.

Q: What do I do if table reads are slow?

A: There is a dedicated page about this.

Q: How do I reduce the number of chunks I am using in my quota?

A: If these chunks are taken up by tables (which is the most common scenario), you need to run a Merge with combine_chunks = %true.
This will rebuild your table from larger chunks, thereby reducing the use of chunks within your quota. You can use a command-line command to run the operation replacing src table and dst table:

yt merge --src table --dst table --spec "{combine_chunks=true}"

There is also a way of monitoring chunk usage without running a separate operation. For more information, see Merging chunks automatically on operation exit.

Also, files may use many chunks in certain situations. For instance, when small fragments are being continuously appended to existing files. At the moment, there no ready-made method similar to Merge to combine file chunks. You can run yt read-file //path/to/file | yt write-file //path/to/file in combination. This will make the entire data stream go through the client.

Q: I am getting the Format "YamredDsv" is disabled error. What should I do?

A: The YAMRED_DSV format is no longer supported. Use a different format.

MapReduce questions

Q: Where do I get the simplest client and a step-by-step procedure for running MapReduce?

A: We recommend reviewing the Trial section as well as reading about working with YTsaurus from the console.

Q: Are table numbers/names always available in transactions?

A: Table numbers are available at all stages except the MapReduce reduce phase. Table numbers are also available in the native C++ wrapper.

Q: If a cluster is running several tasks concurrently, how do they get assigned slots?

A: YTsaurus assigns slots based on the fair share algorithm with the number of slots recomputed dynamically while the task is running. For more information, see Scheduler and pools.

Q: What are the factors that go into the overhead of merging a table with a small delta of the same table?

A: Overhead depends on the number of affected chunks that rows from the delta will be written to.

Q: Can I start mappers from the python interactive shell?

A: No, this functionality is not supported.

Q: Reading a table or a file produces the following message: "Chunk ... is unavailable." What should I do?

Q: The operation running on the cluster has slowed or stopped. The following message is displayed: "Some input chunks are not available." What should I do?

Q: The operation running on the cluster has slowed or stopped. The following message is displayed: "Some intermediate outputs were lost and will be regenerated." What should I do?

A: Some of the data has become unavailable because cluster nodes have failed. This unavailable condition may be related to the disappearance of a data replica (if there is erasure coding) or the complete disappearance of all replicas (if there is no erasure coding). In any case, you need to wait for the data to be recovered or the failed cluster nodes to be repaired and force the operation to complete. You can monitor cluster status via the web interface's System tab (Lost Chunks, Lost Vital Chunks, Data Missing Chunks, Parity Missing Chunks parameters). You can terminate an operation waiting on missing data early and get some intermediate output. To do this, use the complete-op command in the CLI or the Complete button on the web interface's operation page.

Q: I am getting the following error: "Table row is too large: current weight ..., max weight ... or Row weight is too large." What is this and what do I do about it?

A: A table row weight is computed as the sum of the lengths of all the row's column values. The system has a limitation on row weight which helps control the size of the buffers used for table writes. The default limit is 16 MB. To increase this value, you need to set the max_row_weight option in the table_writer configuration.

Value lengths are a function of type:

int64, uint64, double: 8 bytes.
boolean: 1 byte.
string: String length.
any: Length of structure serialized as binary yson, bytes.
null: 0 bytes.

If you are getting this error when you start a MapReduce, you need to configure the specific table_writer servicing the relevant step in the operation: --spec '{JOB_IO={table_writer={max_row_weight=...}}}'.

The name of the JOB_IO section is selected as follows:

For operations with a single job type (Map, Reduce, Merge, and so on), JOB_IO = job_io.
For Sort operations, JOB_IO = partition_job_io | sort_job_io | merge_job_io, and we recommend doubling all limits until you find the right ones.
For MapReduce operations, JOB_IO = map_job_io | sort_job_io | reduce_job_io. You can increase certain limits if you are sure where exactly large rows occur.

The maximum value is max_row_weight equal to 128 MB.

Q: I am getting error "Key weight is too large." What is this and what do I do about it?

A: A row key weight is computed as the sum of the lengths of all the row's key columns. There are limits of row key weights. The default limit is 16 K. You can increase the limit via the max_key_weight option.

Value lengths are a function of type:

int64, uint64, double: 8 bytes.
boolean: 1 byte.
string: String length.
any: Length of structure serialized as binary yson, bytes.
null: 0 bytes.

If you are getting this error when you start a MapReduce, you need to configure the table_writer serving the relevant step in the operation:

--spec '{JOB_IO={table_writer={max_key_weight=...}}}'

The name of the JOB_IO section is selected as follows:

For operations with a single job type (Map, Reduce, Merge, etc), JOB_IO = job_io.
For sort operations, JOB_IO = partition_job_io | sort_job_io | merge_job_io, we recommend increasing all the limits right away.
For MapReduce operations, JOB_IO = map_job_io | sort_job_io | reduce_job_io. You can raise individual limits if you are certain where exactly large rows occur but it is better to increase all limits as well.

The maximum value of max_key_weight is 256 KB.

Attention!

Chunk boundary keys are stored on master servers; therefore, raising limits is prohibited except when repairing the production version. Prior to increasing a limit, you must write the system administrator and advise that a limit is about to be increased providing rationale.

Q: Why is reduce_combiner taking a long time? What should I do?

A: Possibly, the job code is fairly slow, and it would make sense to reduce job size. reduce_combiner is triggered if partition size exceeds data_size_per_sort_job. The amount of data in reduce_combiner equals data_size_per_sort_job. The data_size_per_sort_job default is specified in the YTsaurus scheduler configuration, but can be overridden via an operation specification (in bytes).
yt map_reduce ... --spec '{data_size_per_sort_job = N}'

Q: I feed several input tables to MapReduce without specifying a mapper. At the same time, I am unable to get input table indexes in the reducer. What seems to be the problem?

A: Input table indexes are available to mappers only. If you fail to specify a mapper, it will automatically be replaced with a trivial one. Since a table index is an input row attribute rather than part of the data, a trivial mapper will not retain table index information. To resolve the problem, you need to write a custom mapper where you would save the table index in some field.

Q: How do I terminate pending jobs without crashing the operation?

A: You can terminate jobs individually via the web interface.
You can terminate the entire operation using the CLI's yt complete-op <id>

Q: What do I do if am unable to start operations from IPython Notebook?

A: If your error message looks like this:

Traceback (most recent call last):
  File "_py_runner.py", line 113, in <module>
    main()
  File "_py_runner.py", line 40, in main
    ('', 'rb', imp.__dict__[__main_module_type]))
  File "_main_module_9_vpP.py", line 1
    PK
      ^
SyntaxError: invalid syntax

You need to do the following prior to starting all operations:

def custom_md5sum(filename):
    with open(filename, mode="rb") as fin:
        h = hashlib.md5()
        h.update("salt")
        for buf in chunk_iter_stream(fin, 1024):
            h.update(buf)
    return h.hexdigest()

yt.wrapper.file_commands.md5sum = custom_md5sum

Q: Which account will own stored intermediate data for MapReduce and Sort?

A: The default account used is intermediate but you can change this behavior by overriding the intermediate_data_account parameter in the operation spec. For more information, see Operation settings.

Q: Which account will own stored operation output?

A: The account the output tables happen to belong to. If the output tables did not exist prior to operation start, they will be created automatically with the account inherited from the parent directory. To override these settings, you can create the output tables in advance and configure their attributes any way you want.

Q: How do I increase the number of Reduce jobs? The job_count option is not working.

A: Most likely, the output table is too small, and the scheduler does not have enough key samples to spawn a larger number of jobs. To get more jobs out of a small table, you will have forcibly to move the table creating more chunks. You can do this with merge using the desired_chunk_size option. To create 5-MB chunks, for instance, you need to run the command below:

yt merge --src _table --dst _table --spec '{job_io = {table_writer = {desired_chunk_size = 5000000}}; force_transform = %true}'

An alternative way of solving the problem is by using the pivot_keys option explicitly to define the boundary keys between which jobs must be started.

Q: I am attempting to use sorted MapReduce output. And using input keys for the output. The jobs are crashing and returning "Output table ... is not sorted: job outputs have overlapping key ranges" or "Sort order violation". What seems to be the problem?

A: Sorted operation output is only possible if jobs produce collections of rows in non-intersecting ranges. In MapReduce, input rows are grouped based on a hash of the key. Therefore, in the scenario described, job ranges will intersect. To work around the issue, you need to use Sort and Reduce in combination.

Q: When I start an operation, I get "Maximum allowed data weight ... exceeded». What do I do about it?

A: The error means that the system has spawned a job with an input that is too big: over 200 GB of input data. This job would take too long to process, so the YTsaurus system is preemptively protecting users from this type of error.

Q: When I launch a Reduce or a MapReduce, I can see that the amount of data coming in to the reduce jobs varies greatly. What is the reason, and how do I make the split more uniform?

The large amounts of data may be the result of skewed input meaning that some keys have noticeably more data corresponding to them than others. In this case, you might want to come up with a different solution for the problem being worked, such as try using combiners.

If the error arises in a Reduce whose input includes more than one table (normally, dozens or hundreds), the scheduler may not have enough samples to break input data down more precisely and achieve uniformity. It is a good idea to utilize MapReduce instead of Reduce in this case.

Q: A running Sort or MapReduce generates "Intermediate data skew is too high (see "Partitions" tab). Operation is likely to have stragglers». What should I do?

A: This means that partitioning broke data down in a grossly non-uniform manner. For a MapReduce, this makes an input data skew highly likely (some keys have noticeably more data corresponding to them than others).

This is also possible for a Sort and is related to the nature of the data and the sampling method. There is no simple solution for this issue as far as Sort is concerned. We recommend contacting the system administrator.

Q: How do I reduce the limit on job crashes that causes the entire operation to end in an error?

A: The limit is controlled by the max_failed_job_count setting. For more information, see Operation settings.

Q: The operation page displays "Average job duration is smaller than 25 seconds, try increasing data_size_per_job in operation spec"?

A: The message is an indication that the operation jobs are too short, and their launch overhead is slowing the operation down and reducing cluster resource performance. To correct the situation, you need to increase the amount of data being fed to the job as inputs. To do this, you need to increase the relevant settings in the operation spec:

Map, Reduce, JoinReduce, Merge: data_size_per_job.
MapReduce:
- For map/partition jobs: data_size_per_map_job.
- For reduce jobs: partition_data_size.
Sort:
- For partition jobs: data_size_per_partition_job.
- For final_sort jobs: partition_data_size.

The default values are listed in the sections on specific operation types.

Q: The operation page is displaying "Aborted jobs time ratio ... is is too high. Scheduling is likely to be inefficient. Consider increasing job count to make individual jobs smaller"?

A: The message means that the jobs are too long. Given that the pool's cluster resource allocation changes constantly with the arrival and departure of other users that run operations, an operation's jobs launch and are subsequently displaced. That is why the percentage of time wasted by the jobs becomes very large. The overall recommendation is to make jobs reasonably short. The best job duration is in the single minutes. You can accomplish this by reducing the amount of data fed to a single job either using the data_size_per_job or by optimizing and accelerating the code.

Q: The operation page is displaying the following message: "Average CPU wait time of some of your job types is significantly high..."?

A: The message means that the jobs spent significant amounts of time (on the order tenths of the total job runtime) waiting for data from YTsaurus or were hung up reading data from local disk/over the network. In the general case, it means that you are not utilizing the CPU efficiently. If there is waiting for data from YTsaurus, you can look to reducing your jobs' cpu_limit or try moving your data to SSD for faster reading. If this is a feature of your process because it reads something large from the job's local disk and goes somewhere online, you should either consider optimizing your process or also reducing cpu_limit. Optimization implies the restructuring of the user process in a job to prevent disk reads or network requests from becoming a bottleneck.

Q: What is the easiest method of sampling a table?

A: In the YTsaurus system, you can request input sampling for any operation.
In particular, you can start a trivial map and get what you need as follows:
yt map cat --src _path/to/input --dst _path/to/output --spec '{job_io = {table_reader = {sampling_rate = 0.001}}}' --format yson

Or simply read the data:
yt read '//path/to/input' --table-reader '{sampling_rate=0.001}' --format json

Q: A running operation generates the following warning: "Account limit exceeded" and stops. What does it mean?

A: The message indicates that the suspend_operation_if_account_limit_exceeded specification parameter is enabled. Also, the account that hosts the operation's output tables is out of one of its quotas. Such as, the disk space quota. You need to figure out why this happened and resume the operation. You can view the account's quotas on the Accounts page of the web interface.

Q: A running operation remains in pending mode a long time. When will it execute?

A: The YTsaurus system has a limitation on the number of concurrently executing operations in each pool (as opposed to operations launched, or accepted for execution). By default, this limit is not large (around 10). Whenever a pool's limit on the number of executing operations is reached, new operations are queued. Queued operations will proceed when previous operations in the same pool exit. The limit on the number of executing operations is applicable at all levels of the pool hierarchy, that is to say, that if an operation is launched in pool A, it may be classified as pending not only if the limit is reached in pool A itself but also in any of pool A's parents. For more information on pools and pool configuration, please see Scheduler and pools. If there is a reasonable need to run more operations concurrently, you need to send a request to the system administrator.

Q: A running operation generates the following warning: "Excessive job spec throttling is detected". What does it mean?

A: This message is an indication that an operation is computationally intensive from the standpoint of the resources used by the scheduler itself in the operation setup. This situation is normal behavior for a cluster at load. If you believe that the operation is taking unacceptably long to complete and continues in a starving state a long time, you need to advise your system administrator accordingly.

Q: A running operation generates the following message: "Average cpu usage... is lower than requested 'cpu_limit'». What does it mean?

A: The message means that the operation is using much less CPU than requested. By default, a single HyperThreading core is requested. This results in an operation blocking more CPU resources than it is using thereby making the use of the pool's CPU quota inefficient. If this behavior is expected, you should reduce the operation's cpu_limit (you can set it to a fraction), or else, you might review the operation jobs' runtime statistics profiling the job while it is running to understand what it is doing.

Q: A running operation displays the following warning: "Estimated duration of this operation is about ... days". What does it mean?

A: The message is an indication that the expected time to complete the operation is too long. Expected completion time is computed as an optimistic estimate to complete running and pending jobs. As a cluster will update from time to time, and operations may restart, a large amount of utilized resources may go to waste. We recommend breaking operations down into smaller ones or looking for ways significantly to increase the quota under which an operation launches.

Q: A running operation generates the following warning: "Scheduling job in controller of operation <operation_id> timed out". What does it mean?

A: The warning means that an operation controller is not able to launch an operation job in the time allotted. This may occur if the operation is very heavy or if the scheduler is under heavy load. If you believe that an operation is taking very long to run and continues in a starving state for a long time, you should advise your system administrator accordingly.

Q: A running operation displays warning "Failed to assign slot index to operation". What does it mean?

A: If this happens, contact the administrator.

Q: A running operation generates warning "Operation has jobs that use less than F% of requested tmpfs size". What does it mean?

A: You request a tmpfs for jobs in the specification (you can view the warning attributes to find out which specific jobs) but are not using the entire file system (apart from certain thresholds). tmpfs size is included into the memory limit, which means that a job requests a lot of memory but does not use it in the end. First, this reduces actual memory utilization in your cluster. Second, large tmpfs requests may slow down job scheduling since it is much more likely that the cluster will have a slot with 1 GB of memory than one with 15 GB. You should order as much tmpfs as your jobs actually need. You can review warning attributes or look at the statistic for user_job/tmpfs_size to find out about actual use of tmpfs by jobs.

Q: When I lunch an operation, I get error "No online node can satisfy the resource demand". What do I do?

A: The message is an indication that the cluster does not have a single suitable node to start the operation job. This normally happens in the following situations, for instance:

The operation has CPU or memory requirements so large that a single cluster node's resources are not sufficient. For example, if there is an order for 1 TB of memory and 1000 CPUs for a job, this operation will not run returning an error to the client since YTsaurus clusters do not have nodes with these properties.
This specifies a scheduling_tag_filter that none of the cluster nodes match.

Q: When I start a Merge, Reduce operation, I get error "Maximum allowed data weight violated for a sorted job: xxx > yyy"

A: When jobs are bing built, the scheduler estimates that one job is getting too much data (hundreds of gigabytes), and the scheduler is unable to make a smaller job. The following options are available:

When you are using Reduce, and the input table has a monster key meaning that a single row in the first table corresponds to a large number of rows in another, as a result of the Reduce guarantee, all rows with this key must go into a single job, and the job will run indefinitely. You should use MapReduce with the trivial mapper and the reduce combiner to pre-process monster keys.
There are very many input tables being fed to an operation (100 or more) because chunks at the range boundary are not being counted precisely. The general observation is that the more input tables the less efficient the use of sorted input. You may want to use MapReduce.
When using Merge, this error may result from suboptimal scheduler operation. You should contact the mailbox community@ytsaurus.tech.

The above recommendations notwithstanding, if you are certain that you would like to launch the operation anyway and are ready for it to take a very long time, you can increase the value of the max_data_weight_per_job parameter, which will start the operation.

Q: A running operation produces the following warning: "Legacy live preview suppressed", and live preview is not available. What does it mean?

A: Live preview is a mechanism that creates heavy load for the master servers; therefore, by default, it is disabled for operations launched under robotic users.

If you wish to force activate live preview, use the enable_legacy_live_preview = %true option in the operation spec.

If you wish to disable this warning, use the enable_legacy_live_preview = %false option in the operation spec.

Dynamic table questions

Q: When working with dynamic tables from Python API or C++ API, I get the "Sticky transaction 1935-3cb03-4040001-e8723e5a is not found" error, what should I do?

A: The answer depends on how the transactions are used. If a master transaction is used, then this is a pointless action and you must set the query outside the transaction. To do this, you can either create a separate client or explicitly specify that the query must be run under a null transaction (with client.Transaction(transaction_id="0-0-0-0"): ...).
Full use of tablet transactions in Python API is only possible via RPC-proxy (yt.config['backend'] = 'rpc'). Use of tablet transactions via HTTP is impossible in the current implementation. In this case, write to yt@ and describe your task.

Q: When writing to a dynamic table, the "Node is out of tablet memory; all writes disabled" or "Active store is overflown, all writes disabled" error occurs. What does it mean and how should I deal with it?

**A:**The error occurs when the cluster node runs out of memory to store data not written to the disk. The input data stream is too large and the cluster node has no time to compress and write data to the disk. Queries with such errors must be repeated, possibly with an increasing delay. If this is a recurring error, this may mean (except in off-nominal situations) that individual tablets are overloaded with writes or that the cluster's capacity is not sufficient to cope with this load. Increasing the number of tablets can also help (the reshard-table command).

Q: What does "Too many overlapping stores" mean? What should I do?

A: This error is an indication that tablet structure is such that the dynamic store coverage of the key range being serviced by the tablet is too dense. Dense coverage with stores leads to degradation of read performance, so in this situation a protection mechanism is activated that prevents new data from being written. The background compaction and partitioning processes should gradually normalize the tablet structure. If this does not happen, the cluster may be failing to cope with the load.

Q: When querying a dynamic table, I get the "Maximum block size limit violated" error

A: The query involves a dynamic table once converted from a static table. The block_size parameter was not specified. If you receive an error like this, make sure you follow all the instructions from the section about converting a static table into a dynamic table. If block size is large, you need to increase max_unversioned_block_size to 32 MB and re-mount the table. This can happen, if the table's cells store large binary data that are stored in a single block in their entirety.

Q: When querying a dynamic table, I get the "Too many overlapping stores in tablet" error

A: Most likely, the tablet can't cope with the write flow and new chunks don't have time to compact. Check that the table was sharded for a sufficient number of tablets. When writing data to an empty table, disable auto-sharding, because small tablets will be combined into one.

Q: When querying a dynamic table, I get the "Active store is overflown, all writes disabled" error

A: The tablet can't cope with the write flow — it doesn't have time to dump the data to the disk or it can't do it for some reason. Check for errors in the @tablet_errors table attribute and if there are none, check sharding as above.

Q: When querying a dynamic table, I get the "Too many stores in tablet, all writes disabled" error

A: The tablet is too large. Get the table to have more tablets. Note that auto-sharding limits the number of tablets to the number of cells multiplied by the value of the tablet_balancer_config/tablet_to_cell_ratio parameter.

Q: When querying a dynamic table, I get the error "Tablet ... is not known"

A: The client sent a query to a cluster node no longer serving the tablet. This usually occurs as a result of automatic balancing of tablets or restarting of cluster nodes. You need to resend the query to make the error disappear after a cache update or to disable balancing.

Q: When querying a dynamic table, I get the "Service is not known" error

A: The client sent a query to a cluster node no longer serving the cell tablet. This usually happens when cells are rebalanced. You need to resend the query to make the error disappear after a cache update.

Q: When querying a dynamic table, I get the "Chunk data is not preloaded yet" error

A: The message is specific to a table with the in_memory_mode parameter at other than none. Such a table is always in memory in a mounted state. In order to read from such a table, all data must be loaded into memory. If the table was recently mounted, the tablet was moved to a different cell, or the YTsaurus process restarted, the data is no longer in memory, which will generate this type of error. You need to wait for the background process to load data into memory.

Q: In tablet_errors, I see the "Too many write timestamps in a versioned row" or "Too many delete timestamps in a versioned row" error

A: Sorted dynamic tables store many versions of the same value at the same time. In lookup format, each key can have no more than 2^16 versions. A simple solution is to use a column format (@optimize_for = scan). In reality, such a large number of versions is not necessary and they occur as a result of misconfiguration or a programming error. For example, when specifying atomicity=none, you can update the same table key with great frequency (in this mode, there is no row locking and transactions with overlapping time ranges can update the same key). This is not recommended. If writing a large number of versions results from a product need, such as frequent delta writes in aggregation columns, set the @merge_rows_on_flush=%true table attribute and correctly configure TTL deletion so that only a small number of actually needed versions are written to the chunk in a flush.

Q: When querying Select Rows, I get the "Maximum expression depth exceeded" error

A: This error occurs if the expression tree depth is too large. This usually happens when writing expressions like
FROM [...] WHERE (id1="a" AND id2="b") OR (id1="c" AND id2="d") OR ... <a few thousand conditions>
Instead, you need to set them in the form FROM [...] WHERE (id1, id2) IN (("a", "b"), ("c", "b"), ...)
There are a few problems with the first option. The problem is that queries are compiled into machine code. Machine code for queries is cached so that query structure not including constants serves as the key.

In the first case, as the number of conditions increases, query structure will constantly change. There will be code generation for each query.
The first query will generate a very large amount of code.
Compiling the query will be very slow.
The first case will work simply by checking all the conditions. No transformation into a more complex algorithm is envisaged.
Besides that, if the columns are key, read ranges are displayed for them so that only the relevant data can be read. The read range output algorithm will work more optimally for the IN variant.
In case of IN, checking the condition will perform search in the hash table.

Q: When working with a dynamic table, I get the "Value is too long" error

A: There are quite strict limits on the size of values in dynamic tables. Single value (table cell) size must not exceed 16 MB, while the length of an entire row should stay under 128 and 512 MB taking into account all versions. There can be a maximum of 1024 values in a row, taking all versions into account. There is also a limit on the number of rows per query, which defaults to 100,000 rows per transaction when inserting, one million rows for selects, and 5 million for a lookup. Please note that operation may become unstable in the vicinity of the threshold values. If you are over limit, you need to change data storage in a dynamic table and not store such long rows.

YTsaurus CHYT questions

FAQ

Q: Why does CHYT have cliques, while regular ClickHouse has nothing analogous? What is a clique?

A: There is a dedicated article about this.

Q: I am getting the "DB::NetException: Connection refused" or the "DB::Exception: Attempt to read after eof: while receiving packet" error. What does it mean?

A: This normally means that the CHYT process inside the Vanilla transaction crashed. You can view the aborted/failed job counters in the operation UI. If there are recent jobs aborted due to preemption, it means that the clique is short on resources. If there are recent failed jobs, please contact your system administrator.

Q: I am getting the "Subquery exceeds data weight limit: XXX > YYY" error. What does it mean?

Q: How do I save to a table?

A: There are INSERT INTO and CREATE TABLE functions. Learn more in the section on Differences from ClickHouse.

Q: How do I load geo-dicts in my own clique?

A: When starting any clique, you can specify the --cypress-geodata-path option that enables you to specify the path to geo-dicts in Cypress.

Q: Can CHYT handle dates in TzDatetime format?

A: CHYT can handle dates in TzDatetime format just as well as conventional ClickHouse. You will have to store data as strings or numbers and convert them for reading and writing. Example date extraction by @gri201:

toDate(reinterpretAsInt64(reverse(unhex(substring(hex(payment_dt), 1, 8)))))

Q: How do I move a table to an SSD?

A: First, make sure that your YTsaurus account has a quota for the ssd_blobs medium. To do this, go to the account page, switch your medium type to ssd_blobs, and enter your account name. If you have no quota for the ssd_blobs medium, you can request it via a special form.

After obtaining the quota for the ssd_blobs medium, you will need to change the value of the primary_medium attribute, and the data will be moved to the corresponding medium in the background. Learn more in the section on storage.

For static tables, you can force a move using the Merge operation:

yt set //home/dev/test_table/@primary_medium ssd_blobs
yt merge --mode auto --spec '{"force_transform"=true;}' --src //home/dev/test_table --dst //home/dev/test_table

If the table is dynamic, to change the medium, you must first unmount the table,
set the attribute, and then re-mount it:

yt unmount-table //home/dev/test_table --sync
yt set //home/dev/test_table/@primary_medium ssd_blobs
yt mount-table //home/dev/test_table --sync

You can speed up the move further with forced_compaction but using this method creates a heavy load in the cluster and is strongly discouraged.

To verify that the table has in fact changed its medium, use the command below:

yt get //home/dev/test_table/@resource_usage

{
    "tablet_count" = 0;
    "disk_space_per_medium" = {
        "ssd_blobs" = 930;
    };
    "tablet_static_memory" = 0;
    "disk_space" = 930;
    "node_count" = 1;
    "chunk_count" = 1;
}

Q: Is the SAMPLE construct of the ClickHouse language supported?

A: CHYT supports the Sample construction. The difference is that CHYT ignores the OFFSET ... command, so you cannot get a sample from another part of the selected data.

Example:

SELECT count(*) FROM "//tmp/sample_table" SAMPLE 0.05;

SELECT count(*) FROM "//tmp/sample_table" SAMPLE 1/20;

SELECT count(*) FROM "//tmp/sample_table" SAMPLE 100500;

Q: How do I get the table name in a query?

A: You can use the $table_name and $table_path virtual columns. For more information about the virtual columns, see Working with YTsaurus tables.