Files
This section contains information on YTsaurus system objects designed to store large amounts of binary data. You can also store binary data in tables.
General information
A file is a Cypress node of type file designed to store large binary data in the system.
You can store binary data both in files and in tables.
Sometimes, jobs need to save large binary data. The obvious solution is to write the file from each job to Cypress. However, this method has some drawbacks:
- It creates high proxy server load.
- A large number of objects appears in Cypress which makes working with them less efficient.
There is another solution that we propose: writing binary data to a single table.
Usage
Files support appends via the write_file
command. This performs a write by adding chunks; therefore, we recommend against performing many small writes.
To read files, use the read_file
command. You can either read an entire file or specify a read range in the form of an offset in bytes. To do this, use the offset
selector in the YPath language.
Each read requires a request for metadata from the master server, so multiple reads are not recommended.
Using in jobs
One way to use files is to deliver the same data, such as lookups, for jobs. This causes file chunks to be downloaded into the local cache of every node requesting the file.
If a file consists of multiple chunks or has compression_codec
set, the cache rebuilds the original file from the chunks. This file is referred to as an artifact. There are two ways to deliver an artifact into a file in a job sandbox
, which is a special directory that every job creates:
-
By default, a link to the artifact will be created in the job
sandbox
. This behavior is preferred since it works faster and creates less hard drive IO load. -
If you include the
copy_file
option in the specification, the file will be copied into thesandbox
. It does not make sense to use this behavior unless you need to mount the entiresandbox
in tmpfs.
If a file's executable
attribute is set to true
, it will be executable in the job sandbox
. By default, the attribute is set to false
.
You can use the file_name
attribute to manage the name of a file or link created in the sandbox
. If this attribute is absent, the Cypress filename is used instead.
System attributes
Any file has the attributes shown in the table below.
The number of bytes in a file is written in the uncompressed_data_size
attribute that every chunk owner has.
Attribute | Type | Description | Mandatory |
---|---|---|---|
executable |
bool |
Whether the file is executable. | No |
file_name |
string |
Filename when a file is moved into a job sandbox . |
No |
All the files are chunk owners, so they get the appropriate attributes.