Data schema
General description
The table schema describes the known columns and their types.
The schema can be read from the @schema
system attribute that each table has.
To set the schema, set this attribute when creating a table.
To change the schema of an existing table, use the alter_table
command.
For more information, see Examples of working with a schema.
The TableSchema
type is a list of ColumnSchema
elements.
Each such element is a dict with a set structure:
Name | Type | Description | Mandatory |
---|---|---|---|
name |
string |
Column name. | Yes |
type |
string |
Type of column elements. | Yes, if type_v3 is not set |
type_v3 |
any |
Alternative type representation. This field enables you to specify columns with composite types. | Yes, if type is not set |
sort_order |
string |
The column sort order - either a missing value or the ascending value is allowed. If the value is set, the column is key. |
No |
lock |
string |
Lock name. For non-key columns only. | No |
expression |
string |
An expression defining the value of the computed column field. For key columns only. For more information, see Sharding. | No |
aggregate |
string |
Aggregate function name. For non-key columns only. | No |
required |
boolean |
If set to %true , the value in the column must not be null . The default value is required=%false . A column of the any type cannot be required=%true . |
No |
group |
string |
If two columns have the same group name, the data from those columns will get into the same blocks. You can change the group attribute at any time using the alter-table command, the changes will be applied to the new chunks. |
No |
Schema properties:
- The column name must be a non-empty string. The maximum length is 256 characters;
- The column name cannot begin with system prefix
@
; - Column names cannot be repeated within a table;
- Key columns must form the schema prefix;
- All key columns must be specified in the schema;
- Up to 32,000 columns are supported, but we recommend limiting the number of columns to four figures;
- A column must always have an explicitly specified type: in the
type
ortype_v3
attribute; - When reading a schema, both attributes are always returned:
type
andtype_v3
regardless of how the schema was created; - Computed columns can only be key. Computed columns can depend only on key non-computed columns.
Schema attributes:
strict
of theboolean
type. The default value istrue
. If the table is strict, that is, the attribute has thetrue
value, you cannot write data to the table column that is not in the schema. Otherwise, you are allowed to write to a missing column which will automatically appear, but only in a new row.unique_keys
of theboolean
type. The default value isfalse
. If the attribute has thetrue
value and the table is sorted, all keys in the table must be unique.
Supported column types
YTsaurus supports the column types described in the Data types section.
Among them are both primitive and composite types.
Values in the table cells always have the type specified in the schema.
You can use one of the following methods to specify a type in the table schema:
- Using the
type
and (optional)required
keys, you can specify only primitive or optional primitive types; - Using the
type_v3
key.
The type
key expects a string.
The type_v3
key expects either a string (for primitive types) or a yson dict.
A yson dict always has the type_name
key that stores the type name.
Other keys depend on the specific type and are described on the Data types page.
Standard order relations apply for elementary types other than any
and yson
.
Order relations also apply for optional<T>
types where T
is a primitive comparable type.
It is guaranteed that missing values — values of the null
type — are smaller than any value of any type.
Comparing elements of comparable types is supported.
The comparison is first made by type and then if the types match, values are compared.
The order between different types is not specified. For example, it makes no sense to compare 100
and 3.14
, because the result is not specified.
Schema examples
Schema of a static table after creation (unless explicitly stated otherwise)
<strict=%false>[]
Automatically output schema for a table sorted by key1; key2
.
<strict=%false>[
{
name = "key1";
type = "any";
sort_order = "ascending"
type_v3= {
type_name=optional;
item=yson;
};
};{
name = "key2";
type = "any";
sort_order = "ascending"
type_v3= {
type_name=optional;
item=yson;
};
}
]
Example of a schema for a sorted table.
[
{
name = "key";
type = "string";
sort_order = "ascending"
};{
name = "subkey";
type = "string";
sort_order = "ascending"
};{
name = "value";
type = "string"
}
]
Example of a schema with grouping for columns that are often used together.
[
{
name = "key";
type = "string";
sort_order = "ascending";
group = "g1"
};{
name = "subkey";
type = "string";
sort_order = "ascending";
group = "g1"
};{
name = "value";
type = "string";
sort_order = "ascending";
}
]
Example of a schema for a table with logs.
<strict=%false>[
{
"name" = "id";
"type" = "int64";
};{
"name" = "class";
"type" = "int64";
};{
"name" = "uid";
"type" = "string";
};{
"name" = "ip";
"type" = "int64";
};{
"name" = "iso_eventtime";
"type" = "string";
};{
"name" = "error";
"type" = "string"
};{
"name" = "ip6";
"type" = "string"
};{
"name" = "port";
"type" = "int64"
};{
"name" = "comment";
"type" = "string"
}
]
Weak and strong schema
A table has the schema_mode
attribute that defines the source of the schema origin.
If the schema was not set explicitly by the user, the @schema_mode
attribute will have the weak
value. This means that the table schema was created automatically and is only needed to mark up the key columns.
This schema automatically changes if the table loses its sorting property or changes the set of key columns.
For example, an entry with the <append=%true>
flag to a sorted table without specifying the sorted_by
attribute causes the table to lose sorting property, which means that its schema changes — at least the sort_order
attribute.
If the table schema is set explicitly by the user, then @schema_mode == strong
. The schema of such a table is saved and checked when writing, both in append
and overwrite
mode. Implicit transformations with a strong schema are impossible.
# Create a table without a schema
$ yt --proxy <cluster-name> create table //tmp/table_1
213c-a01da-3fc0191-54a9a802
$ yt --proxy <cluster-name> get //tmp/table_1/@schema_mode
"weak"
$ yt --proxy <cluster-name> get //tmp/table_1/@schema
<
"unique_keys" = %false;
"strict" = %false;
> []
# Create a table with explicit indication of a schema
$ yt --proxy <cluster-name> create table //tmp/table_2 --attributes '{schema = [{name = a; type = int64}; {name = b; type = string}]}'
213c-b75ef-3fc0191-1e85fe66
$ yt --proxy <cluster-name> get //tmp/table_2/@schema_mode
"strong"
$ yt --proxy <cluster-name> get //tmp/table_2/@schema
<
"unique_keys" = %false;
"strict" = %true;
> [
{
"name" = "a";
"type" = "int64";
};
{
"name" = "b";
"type" = "string";
};
]
Strict schematization advantages
- Performance:
- If you additionally enable the columnar format of storing chunks
optimize_for = scan
, data will be stored more compactly and reading a subset of columns will be faster. - Some table readers can work faster for tables with a schema.
- If you additionally enable the columnar format of storing chunks
- Convenience and more options:
- YQL is more convenient for working with schematized tables;
- In CHYT, you cannot work with unschematized data.
Inheriting a schema in system operations
System operations such as Sort, Erase, Merge, and Remote copy can independently output the output table schema. Schema output rules for operations are regulated by the schema_inference_mode = (auto|from_input|from_output)
specification. Examples of using the options can be found in the Examples section.
Ordered merge, Erase
Option | Description |
---|---|
from_input |
The schemas of all tables must match with an accuracy of the sort_order attribute and the column order. Then the output table schema matches the input table schema. If there is one input table, sorting is preserved. If input tables had a strong schema, the output table will have a strong schema as well. |
from_output |
The output table schema is used, data is validated in relation to this schema. |
auto |
If the output table has a strong schema, its compatibility by types with the schema of all highly schematized input tables is checked. If the output table has a weak schema, the schema is derived from the input table schemas similarly to the from_input option. |
Unordered merge
Option | Description |
---|---|
from_input |
The number and types of columns must match, then the output table schema matches the input table schema except for sorting. If input tables had a strong schema, the output table will have a strong schema as well. |
from_output |
The output table schema is used, data is validated in relation to this schema. |
auto |
If the output table has a strong schema, its compatibility by types with the schema of all highly schematized input tables is checked. If the output table has a weak schema, the schema is derived from the input table schemas (similarly to the from_input option). |
Sorted merge
Option | Description |
---|---|
from_input |
The number and types of columns must match, then the output table schema matches the input table schema, key columns are set according to the merge_by parameter of the specification. If input tables had a strong schema, the output table will have a strong schema as well. |
from_output |
It is checked that key columns of the output table match the merge_by key columns from the specification. |
auto |
If the output table has a strong schema, its compatibility by types with the schema of all highly schematized input tables is checked. It is checked that key columns coincide with the merge_by parameter. If the output table has a weak schema, the schema is derived from the input table schemas similarly to the from_input option. |
Sort
Sort
is the only operation that can change a strong table schema. The column order is changed: key columns are moved to the beginning and the sort_order
attribute is set. This is due to the common scenario of replacing a table with its sorted copy.
Option | Description |
---|---|
from_input |
If all input tables have strong schemas and the schemas match with an accuracy of the sort_order attribute and column order, then the output table column types match the types in the input table schemas. The key columns are derived from the operation specification (sort_by ). If all input tables have a weak schema, the schema will be derived from the key columns. |
from_output |
The output table schema is saved, the key columns are changed according to the sort_by specification parameter. All key columns must be present in the schema. |
auto |
If the output table has a strong schema, its compatibility by types with the schema of all highly schematized input tables is checked. The key columns change according to the specification. If the output table has a weak schema, the schema is derived from the input table schemas (similarly to the from_input option). |
Changing a schema
Changing a schema is an easy operation with metainformation that does not require reading data. To change a schema, use the alter_table
command.
The schema of an empty table can be changed in a random manner. For non-empty tables, there are several allowed schema change scenarios:
- Adding a non-key column to a strict schema:
strict = %true
; - Adding a key column to the end of the key in a strict schema:
strict = %true
; - Deleting a column from a non-strict schema:
strict = %false
; - Converting a schema from strict (
strict = %true
) to non-strict (strict = %false
).
Other schema changes are prohibited, because they require data revalidation and such changes can be made through Merge
or Map
operations.
There are a number of restrictions on changing the schema:
- Changing the schema does not necessarily result in changing the contents of the table. For example, changing the
strict
attribute of the table fromfalse
totrue
requires checking that each row in the table contains only the values related to the columns mentioned in the schema and such a check is equivalent to reading the table completely. - You cannot delete columns if
strict = true
: there is no way to clear the corresponding value in each row. You cannot add columns ifstrict = false
: when adding a column of a certain type, you cannot be sure that there are no rows in the table that already contain a value corresponding to this column, and this value may already belong to a different data type. - You cannot delete key columns from the middle, because this can potentially violate the sort order.
- You cannot enter new computed columns or change the expression or aggregate function of existing columns, because this would require recomputing all rows in the table. But, for example, you can delete the last key columns, because this only weakens the sorting condition for future insertions without changing it for the existing set of rows.
Examples of working with a schema
Reading a schema
-
CLI
yt --proxy <cluster-name> get //home/tutorial/links_sorted_schematized/@schema
-
Python
yt.get("//home/tutorial/links_sorted_schematized/@schema")
Creating a table with a schema
-
CLI
yt --proxy <cluster-name> create table //tmp/table --attributes '{schema = [{name = a; type = int64}; {name = b; type = string}]}'
-
Python
yt.create_table("//tmp/table", attributes={"schema" : [{"name" : "a", "type" : "int64"}, {"name" : "b", "type" : "string"}]})
-
C++
auto schemaNode = TNode::CreateList() .Add(TNode()("name", "a")("type", "int64")) .Add(TNode()("name", "b")("type", "string")); client->Create( "//tmp/table", NYT::NT_TABLE, TCreateOptions().Attributes(TNode()("schema", schemaNode)));
The schema attribute in RichYPath
You can use the @schema
attribute of the path when writing data using the write_table
command or at the output of MapReduce operations.
- CLI
cat data | yt --proxy <cluster-name> write '<schema=[{name = a; type = int64}; {name = b; type = string}]>//tmp/table'
Setting a schema and changing a format for typified and structured data
CLI
# Enable columnar storage format
yt --proxy <cluster-name> set //tmp/table/@optimize_for scan
# Set a new schema and transfer data to the new storage format
yt --proxy <cluster-name> merge --mode ordered --src //tmp/table --dst '<schema=[{name = a; type = int64; sort_order = ascending}; {name = b; type = string}]>//tmp/table' --spec '{schema_inference_mode = from_output; force_transform = %true}'
There are more automatic solutions:
- Run the query
INSERT INTO "path/to/somewhere" SELECT * FROM "source/table"
via YQL. YQL will try to infer the schema by reading the first row or the first few rows.