Read options
Multi-cluster reads
By default, the computation process accesses data on the cluster that provides the computational resources (when running directly or when using a standalone cluster). SPYT runs its own RPC proxies to offload shared cluster proxies.
VersionĀ 2.2.0 introduced the option to read data from different YTsaurus clusters. To do this, you must explicitly indicate the cluster address in the table path.
spark.read.yt('<cluster="localhost:8000">//home/table').show() # Table on cluster localhost:8000
spark.read.yt('//home/table').show() # Table on home cluster
Note
Data is read from other YTsaurus clusters through shared proxies, which may put a heavy strain on them when the volume of data is high.
schema_hint
Hard-coded column type. Useful when a column is of type any
(a composite data type serialized as yson
).
The value will be deserialized as the specified type.
Python example:
spark.read.schema_hint({"value": MapType(StringType(), LongType())}).yt("//sys/spark/examples/example_yson")
Scala example:
df.write
.schemaHint(Map("a" ->
YtLogicalType.VariantOverTuple(Seq(
(YtLogicalType.String, Metadata.empty), (YtLogicalType.Double, Metadata.empty)))))
.yt(tmpPath)
transaction
Reading from a transaction. For more details, see Reading and writing within a transaction.
Scala example:
val transaction = YtWrapper.createTransaction(None, 10 minute)
df.write.transaction(transaction.getId.toString).yt(tmpPath)
transaction.commit().get(10, TimeUnit.SECONDS)
Schema v3
Read tables with schema in type_v3 instead of type_v1. It can be enabled in Spark configuration or write option.
Python example:
spark.read.option("parsing_type_v3", "true").yt("//sys/spark/examples/example_yson")