Input/output settings
This section lists operation input/output configuration options as well as table read and write command options.
Note
This section considers only important options that might help a user. In actuality, the options are much more numerous and provide fairly fine control of the various data read/write parameters. More often that not, this type of configuration is not required, and using non-default values might actually do harm.
You can control reading and writing via table data formats, TableReader/TableWriter
, and ControlAttributes
.
You control data reads via settings in two sections: table_reader
and control_attributes
.
You can control data writes in the table_writer
section.
TableReader section
These settings can be specified when reading tables using the table_reader
option or in the table_reader
section under job_io
in the operation specification.
Sampling
Under table_reader
, you can specify sampling settings sampling_seed
and sampling_rate
:
...
spec = {"job_io": {"table_reader": {"sampling_seed": 42, "sampling_rate": 0.3}}}
yt.run_map(map_function, input, output, spec=spec)
...
yt.read_table(table_path, table_reader={"sampling_seed": 42, "sampling_rate": 0.3})
...
The value of "sampling_rate": 0.3
will result in the operation (map_function
) receiving 30% of all the input table rows as input.
The sampling_seed
option controls a random number generator used to sample rows. It guarantees the same output for the same sampling_seed
, a pure map_function
, and the same collection of input chunks. If sampling_seed
is not included in the specification, it will be random.
When sampling is used all the data are read from disk, the specified probability notwithstanding.
Nonetheless, this could be less costly than independent data sampling since the sampling itself will occur system-side immediately following the disk read.
To obtain maximum sampling read performance in an operation at a reduced sampling quality, use the sampling
option in the operation specification.
Buffer size for data fed to a job
You control the job data feed buffer via the window_size
parameter expressed in bytes. The default value is 20 MB.
Configuring the job feed data buffer:
...
spec = {"job_io": {"table_reader": {"window_size": 20971520}}}
yt.run_map(map_function, input, output, spec=spec)
...
yt.read_table(table_path, table_reader={"window_size": 20971520})
...
TableWriter section
These settings can be included in the table_writer
section when writing table data or in the table_writer
section under operation job_io
settings.
Limit on table row size
When writing table data, the YTsaurus system checks their size. If this size is over a certain limit, writing stops and the relevant job or write
command fails.
The default string size cap is 16 MB; however, it can be configured under table_writer
.
Updating the table row size constraint:
...
spec = {"job_io": {"table_writer": {"max_row_weight": 32 * 1024 * 1024}}}
yt.run_map(map_function, input, output, spec=spec)
...
yt.write_table(table_path, table_writer={"max_row_weight": 32 * 1024 * 1024})
...
max_row_weight
is specified in bytes. It cannot exceed 128 MB.
table_writer
controls the various storage settings at table creation time or when calling merge
.
Chunk size
The desired_chunk_size
specification setting defines the desired chunk size in bytes.
To modify a table to achieve a chunk size close to 1 KB, you can call the merge command.
Modifying chunk size:
yt merge --mode auto --spec '{"force_transform"=true; "job_io"={"table_writer"={"desired_chunk_size"=1024}}}' --src <> --dst <>
Data fed to a job cannot be smaller than block size. Block size is defined by block_size
in bytes. The minimum value is 1 KB, and the default is 16 MB.
Replication factor
The upload_replication_factor
specification setting defines the number of table or file chunk replicas. The default is 2 and the maximum is 10.
To modify the number of replicas, simply call the merge command.
Modifying the replication factor:
yt merge --mode auto --spec '{"force_transform"=true; "job_io"={"table_writer"={"upload_replication_factor"=10}}}' --src <> --dst <>
For the modification to take place in the background at a later time, call set
on the table replication_factor
. The same is true for a table write with a pre-defined parameter.
Sampling rate
The sample_rate
specification setting defines the percentage of the rows to be selected and placed in chunk metadata.
This row selection affects the construction of buckets for the sort's partition
stage. Increasing the sample_rate
value helps the system compute a key distribution more precisely but increases system load by increasing the amount of metadata and reducing read and write performance.
To avoid skewed bucket sizes when putting values in a single partition for a column with few values, increase sample_rate
prior to the sort
or the reduce stage
.
The default value is 1/10,000.
If a table has, say, 100,000 rows, the sample will include 10. Consequently, the sort will have no more than 10 partitions.
To obtain a sample of 0.1% of the table (maximum sample size), run a merge.
Modifying the sampling rate:
yt merge --mode auto --spec '{"force_transform"=true; "job_io"={"table_writer"={"sample_rate"=0.001}}}' --src <> --dst <>
ControlAttributes section
When reading several tables or several ranges of the same table, it is useful to have housekeeping information to provide insight which table or read range data is coming from.
To support this information in jobs or when reading tables, we introduced the notion of system management (control) entries containing such information.
This concept is internal, and its representation in a job's input data stream depends on the requested format.
Let's look at control attributes using the YSON format. For this format, the stream might include special records of the Yson Entity type rather than regular data records that are of the Yson Map type. Their attributes contain housekeeping information.
For instance, if record <table_index=2>#
comes in, it would mean that all subsequent records in the stream will be from the second input table until a new control record is received with a different table_index
.
The control_attributes
section supports the following options:
The values are specified in parentheses by default.
enable_table_index
(false
): send control records with an input table index (table_index
field of typeint64
in the control record attributes).enable_row_index
(false
): send control records with an input table record index (row_index
field of typeint64
in the control record attributes).enable_range_index
(false
): send control records with a range number from among the requested input table ranges. Please see ranges (range_index
field of typeint64
in the control record attributes).enable_key_switch
(false
): send records on input data key switch (key_switch
field of typebool
in the control record attributes). The setting is only valid for thereduce
phase of the Map-Reduce and Reduce operations.
Note
The processing of control attributes through the Python API and the C++ API is transparent for the user. That is to say that their analysis and conversion to a more convenient representation are handled by the API.
Options that control the transmission of table indexes in the input stream are available at different system layers.
The enable_input_table_index
option is found in the operation spec section describing the user process.
The value of this option overwrites the value in enable_table_index
under control_attributes
.
The enable_table_index
option enables and disables table index
at the format level.
Many formats have these indexes disabled by default; therefore, they must be enabled. Many formats have options responsible for representing table_index
in this one. For more information on table_index
in various formats, see Table switching.
The row_index
, range_index
, and key_switch
representations are only supported by the YSON and the JSON formats.
Input data stream showing all the control attributes in a YSON format reduce job:
<"table_index"=0;>#;
<"range_index"=0;>#;
<"row_index"=2;>#;
{"a"="2";};
<"key_switch"=%true;>#;
{"a"="3";};
<"key_switch"=%true;>#;
<"row_index"=0;>#;
{"a"="1";};
Enabling control attributes for reading in an operation:
yt read --control-attributes="{enable_row_index=true;}" "//path/to/table[#10:#20]"
{"$value": null, "$attributes": {"row_index": 10}}
...
...
spec = {"job_io": {"control_attributes": {"enable_row_index": True}}}
yt.run_map(map_function, input, output, spec=spec) # will place "@row_index" key to input records
...
yt.read_table(path, control_attributes={"enable_row_index": True}, format=yt.YsonFormat())
...