Data type mapping: SPYT, Spark, and YTsaurus
This article describes the mapping between YTsaurus data types and Apache Spark data types for read and write.
Notes
-
Type hints: Some mappings can be overridden by type hints in field metadata. This enables precise control of type conversion.
-
Arrow support: Some date/time related types do not support the Arrow format and use alternative serialization.
-
Decimal precision: YTsaurus limits decimal precision to 35. Learn more about Decimal
-
String types: The configuration option stringToUtf8 controls whether strings map to string or utf8 in YTsaurus. Example:
import spyt from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, StringType, BinaryType spark.createDataFrame( spark.sparkContext.parallelize([Row("string", b"binary")]), StructType([ StructField("string", StringType(), nullable=False), StructField("binary", BinaryType(), nullable=False), ]) ).write \ .option("string_to_utf8", "true") \ .yt("//path/to/table") -
Composite types: Complex types such as structs, arrays, and maps are mapped recursively using the same rules for their element types.
For special techniques related to data types, see Read options and Write options.
Type mapping when reading from YTsaurus
| YTsaurus type | Spark type (Basic) | Spark type (Extended) | Notes |
|---|---|---|---|
| null | NullType | NullType | |
| int64 | LongType | LongType | |
| uint64 | DecimalType(20, 0) | UInt64Type | |
| float | FloatType/DoubleType | FloatType/DoubleType | Top-level: FloatType. Nested elements: DoubleType. |
| double | DoubleType | DoubleType | |
| boolean | BooleanType | BooleanType | |
| string | StringType | StringType | |
| binary | BinaryType | BinaryType | |
| any/yson | BinaryType | YsonType | |
| int8 | ByteType | ByteType | |
| uint8 | ShortType | ShortType | |
| int16 | ShortType | ShortType | |
| uint16 | IntegerType | IntegerType | |
| int32 | IntegerType | IntegerType | |
| uint32 | LongType | LongType | |
| utf8 | StringType | StringType | |
| date | DateType | DateType | Arrow is not supported |
| datetime | DatetimeType | DatetimeType | Arrow is not supported |
| timestamp | TimestampType | TimestampType | Arrow is not supported |
| interval | LongType | LongType | Arrow is not supported |
| date32 | Date32Type | Date32Type | Arrow is not supported |
| datetime64 | Datetime64Type | Datetime64Type | Arrow is not supported |
| timestamp64 | Timestamp64Type | Timestamp64Type | Arrow is not supported |
| interval64 | Interval64Type | Interval64Type | Arrow is not supported |
| uuid | StringType | StringType | Arrow is not supported |
| json | StringType | StringType | Arrow is not supported |
| void | NullType | NullType |
Composite types
| YTsaurus type | Mapping to Spark type |
|---|---|
| Optional(inner) | inner type with nullable = true |
| Dict(key, value) | MapType(key.innerLevel, value.innerLevel, value.nullable) |
| Array(inner) | ArrayType(inner.innerLevel, inner.nullable) |
| Struct(fields) | StructType with corresponding field mappings |
| Tuple(elements) | StructType with fields named _1, _2, etc. |
| VariantOverStruct | StructType with fields prefixed with _v and nullable = true |
| VariantOverTuple | StructType with fields named _v_1, _v_2, etc., and nullable = true |
Type mapping when writing to YTsaurus
| Spark type | YTsaurus type | Notes |
|---|---|---|
| NullType | null | |
| ByteType | int8 | |
| ShortType | int16 (or uint8 when hinted) | |
| IntegerType | int32 (or uint16 when hinted) | |
| LongType | int64 (or uint32 when hinted) | |
| StringType | string (or utf8/json/uuid when hinted, or utf8 when stringToUtf8 is enabled) | |
| FloatType | float | |
| DoubleType | double | |
| BooleanType | boolean | |
| DecimalType | Decimal(precision, scale) | Precision is limited to 35 |
| ArrayType | Array(inner) | |
| StructType (tuple) | Tuple(elements) | |
| StructType (variant) | VariantOverStruct/VariantOverTuple | |
| StructType | Struct(fields) | |
| MapType | Dict(key, value) | |
| BinaryType | binary | |
| DateType | date | |
| DatetimeType | datetime | |
| TimestampType | timestamp | |
| Date32Type | date32 | |
| Datetime64Type | datetime64 | |
| Timestamp64Type | timestamp64 | |
| Interval64Type | interval64 | |
| UInt64Type | uint64 | |
| YsonType | any |