Data type mapping: SPYT, Spark, and YTsaurus

This article describes the mapping between YTsaurus data types and Apache Spark data types for read and write.

Notes

Type hints: Some mappings can be overridden by type hints in field metadata. This enables precise control of type conversion.
Arrow support: Some date/time related types do not support the Arrow format and use alternative serialization.
Decimal precision: YTsaurus limits decimal precision to 35. Learn more about Decimal

String types: The configuration option stringToUtf8 controls whether strings map to string or utf8 in YTsaurus. Example:

import spyt
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, StringType, BinaryType

spark.createDataFrame(
    spark.sparkContext.parallelize([Row("string", b"binary")]),
    StructType([
        StructField("string", StringType(), nullable=False),
        StructField("binary", BinaryType(), nullable=False),
    ])
).write \
.option("string_to_utf8", "true") \
.yt("//path/to/table")

Composite types: Complex types such as structs, arrays, and maps are mapped recursively using the same rules for their element types.

For special techniques related to data types, see Read options and Write options.

Type mapping when reading from YTsaurus

YTsaurus type	Spark type (Basic)	Spark type (Extended)	Notes
null	NullType	NullType
int64	LongType	LongType
uint64	DecimalType(20, 0)	UInt64Type
float	FloatType/DoubleType	FloatType/DoubleType	Top-level: FloatType. Nested elements: DoubleType.
double	DoubleType	DoubleType
boolean	BooleanType	BooleanType
string	StringType	StringType
binary	BinaryType	BinaryType
any/yson	BinaryType	YsonType
int8	ByteType	ByteType
uint8	ShortType	ShortType
int16	ShortType	ShortType
uint16	IntegerType	IntegerType
int32	IntegerType	IntegerType
uint32	LongType	LongType
utf8	StringType	StringType
date	DateType	DateType	Arrow is not supported
datetime	DatetimeType	DatetimeType	Arrow is not supported
timestamp	TimestampType	TimestampType	Arrow is not supported
interval	LongType	LongType	Arrow is not supported
date32	Date32Type	Date32Type	Arrow is not supported
datetime64	Datetime64Type	Datetime64Type	Arrow is not supported
timestamp64	Timestamp64Type	Timestamp64Type	Arrow is not supported
interval64	Interval64Type	Interval64Type	Arrow is not supported
uuid	StringType	StringType	Arrow is not supported
json	StringType	StringType	Arrow is not supported
void	NullType	NullType

Composite types

YTsaurus type	Mapping to Spark type
Optional(inner)	inner type with nullable = true
Dict(key, value)	MapType(key.innerLevel, value.innerLevel, value.nullable)
Array(inner)	ArrayType(inner.innerLevel, inner.nullable)
Struct(fields)	StructType with corresponding field mappings
Tuple(elements)	StructType with fields named `_1`, `_2`, etc.
VariantOverStruct	StructType with fields prefixed with `_v` and nullable = true
VariantOverTuple	StructType with fields named `_v_1`, `_v_2`, etc., and nullable = true

Type mapping when writing to YTsaurus

Spark type	YTsaurus type	Notes
NullType	null
ByteType	int8
ShortType	int16 (or uint8 when hinted)
IntegerType	int32 (or uint16 when hinted)
LongType	int64 (or uint32 when hinted)
StringType	string (or utf8/json/uuid when hinted, or utf8 when stringToUtf8 is enabled)
FloatType	float
DoubleType	double
BooleanType	boolean
DecimalType	Decimal(precision, scale)	Precision is limited to 35
ArrayType	Array(inner)
StructType (tuple)	Tuple(elements)
StructType (variant)	VariantOverStruct/VariantOverTuple
StructType	Struct(fields)
MapType	Dict(key, value)
BinaryType	binary
DateType	date
DatetimeType	datetime
TimestampType	timestamp
Date32Type	date32
Datetime64Type	datetime64
Timestamp64Type	timestamp64
Interval64Type	interval64
UInt64Type	uint64
YsonType	any

SPYT in Java

Configuration parameters