Protobuf representation of tables

This section describes the use of the protobuf structured data transfer protocol for working with tables in the C++ API.

Introduction

The C++ API enables you to use protobuf classes (messages) to read and write tables both by the client and inside the job.

We recommend using the proto2 version. If you use proto3, errors may occur: for example, when trying to write 0 in the required=%true field.

How it works

The user describes the proto structure in the .proto file. The proto structure can be marked with different flags. Flags affect how YTsaurus will generate or interpret the fields specified within messages.

Elementary types

To work with the YTsaurus data type from the first column, you can use the corresponding protobuf type from the second column in the table below.

YTsaurus Protobuf
string, utf8 string, bytes
int{8,16,32,64} int{32,64}, sint{32,64}, sfixed{32,64}
uint{8,16,32,64} uint{32,64}, fixed{32,64}
double double, float
bool bool
date, datetime, timestamp uint{32,64}, fixed{32,64}
interval int{32,64}, sint{32,64}, sfixed{32,64}

If the range of the integer YTsaurus type does not correspond to the range of the protobuf type, a check is performed at the time of encoding or decoding.

Optional/required properties

The optional/required properties in protobuf do not have to correspond to the options of columns in YT. The system always checks when protobuf messages are encoded or decoded.

For example, if the foo column in YTsaurus has the int64 type (required=%true), the following field can be used for its representation:

  optional int64 foo = 42;

There will be no errors as long as all protobuf messages that are written to the table have the filled in foo field.

Embedded messages

Historically, all embedded protobuf structures are written in YTsaurus as a byte string storing the serialized representation of the embedded structure.

Embedded messages are quite efficient, but do not enable you to conveniently represent values in the web interface or work with them using other methods (without the help of protobuf).

This method can also be specified explicitly by setting a special flag in the field:

optional TEmbeddedMessage UrlRow_1 = 1 [(NYT.flags) = SERIALIZATION_PROTOBUF];

You can specify an alternative flag, then YTsaurus will refer the field to the struct type:

optional TEmbeddedMessage ColumnName = 1 [(NYT.flags) = SERIALIZATION_YT];

To use embedded messages:

  • The table must have a schema.
  • The corresponding column (in the example: ColumnName) must have the struct YTsaurus type.
  • The struct type fields must correspond to the fields of the embedded message (in the example: TEmbeddedMessage).

Warning

The flag in embedded messages is not inherited by default. If the SERIALIZATION_YT flag is set for the field with the T type, the default behavior for structures embedded in T will still correspond to the SERIALIZATION_PROTOBUF flag.

Repeated fields

To work with repeated fields, you must explicitly specify the SERIALIZATION_YT flag:

repeated TType ListColumn = 1 [(NYT.flags) = SERIALIZATION_YT];

In YT, such a field will have the list type. To use repeated fields:

  • The table must have a schema.
  • The corresponding column (in the example: ListColumn) must have the list YTsaurus type.
  • The element of the list YTsaurus type must correspond to the protobuf column type (in the example: TType). This can be an elementary type or an embedded message.

Warning

The SERIALIZATION_PROTOBUF flag for repeated fields is not supported.

Oneof fields

By default, fields within a oneof group correspond to the variant YTsaurus type. For example, the message below will correspond to the structure with the x field of the int64 type and the my_oneof field of the variant<y: string, z: bool> type:

message TMessage {
    optional int64 x = 1;
    oneof my_oneof {
        string y = 2;
        bool z = 3;
    }
}

If you infer the schema using CreateTableSchema<T>(), a similar type will be inferred.

To make oneof groups correspond to the fields of the structure where they are described, use the (NYT.oneof_flags) = SEPARATE_FIELDS flag:

message TMessage {
    optional int64 x = 1;
    oneof my_oneof {
        option (NYT.oneof_flags) = SEPARATE_FIELDS;
        string y = 2;
        bool z = 3;
    }
}

This message will correspond to the structure with optional fields x, y, and z.

Map fields

There are 4 options for displaying such a field in a column in a table. See the example below:

message TValue {
    optional int64 x = 1;
}

message TMessage {
    map<string, TValue> map_field = 1 [(NYT.flags) = SERIALIZATION_YT];
}

Depending on its flags, the map_field field can correspond to:

  • The list of structures with the key field of the string type and the value field of the string type, which will contain the serialized protobuf TValue (as if the SERIALIZATION_PROTOBUF flag is set for the value field). In this case, the default flag is MAP_AS_LIST_OF_STRUCTS_LEGACY.
  • The list of structures with the key field of the string type and the value field of the Struct<x: Int64> type (as if the SERIALIZATION_PROTOBUF flag is set for the value field). In this case, the default flag is MAP_AS_LIST_OF_STRUCTS.
  • The Dict<String, Struct<x: Int64>>> dict: the MAP_AS_DICT flag.
  • The optional Optional<Dict<String, Struct<x: Int64>>>> dict: the MAP_AS_OPTIONAL_DICT flag.

Flags

You can use flags to customize the protobuf behavior. To do this, connect the library .proto file.

import "mapreduce/yt/interface/protos/extension.proto";

Flags can correspond to messages, oneof groups, and message fields.

You can specify flags at the level of the .proto file, message, oneof group, and message field.

SERIALIZATION_YT, SERIALIZATION_PROTOBUF

Behavior of these flags is described above.

By default, SERIALIZATION_PROTOBUF is implied where relevant. You can change the flag for a single message:

message TMyMessage
{
    option (NYT.default_field_flags) = SERIALIZATION_YT;
    ...
}

OTHER_COLUMNS

You can mark the field of the bytes type with the OTHER_COLUMNS flag. This field can contain a YSON map that contains representations of all fields not described by other fields of this protobuf structure.

message TMyMessage
{
    ...
    optional bytes OtherColumns = 1 [(NYT.flags) = OTHER_COLUMNS];
    ...
}

ENUM_STRING / ENUM_INT

You can mark the fields of the enum type with the ENUM_STRING/ENUM_INT flags:

  • If the field is marked ENUM_INT, it will be saved to the column as an integer.
  • If the field is marked ENUM_STRING, it will be saved to the column as a string.

By default, ENUM_STRING is implied.

enum Color
{
    WHITE = 0;
    BLUE = 1;
    RED = -1;
}

...
optional Color ColorField = 1 [(NYT.flags) = ENUM_INT];
...

ANY

You can mark the fields of the bytes type with the ANY flag. Such fields contain a YSON representation of a column of any simple type.
For example, you can write the following code for the string type column:

// message.proto
message TMyMessage
{
    ...
    optional bytes AnyColumn = 1 [(NYT.flags) = ANY];
    ...
}
// main.cpp
TNode node = "Linnaeus";
TMyMessage m;
m.SetAnyColumn(NodeToYsonString(node));