Read options

Multi-cluster reads

By default, the computation process accesses data on the cluster that provides the computational resources (when running directly or when using a standalone cluster). SPYT runs its own RPC proxies to offload shared cluster proxies.

VersionĀ 2.2.0 introduced the option to read data from different YTsaurus clusters. To do this, you must explicitly indicate the cluster address in the table path.

spark.read.yt('<cluster="localhost:8000">//home/table').show() # Table on cluster localhost:8000

spark.read.yt('//home/table').show() # Table on home cluster

Note

Data is read from other YTsaurus clusters through shared proxies, which may put a heavy strain on them when the volume of data is high.

schema_hint

Hard-coded column type. Useful when a column is of type any (a composite data type serialized as yson).
The value will be deserialized as the specified type.

Python example:

spark.read.schema_hint({"value": MapType(StringType(), LongType())}).yt("//sys/spark/examples/example_yson")

Scala example:

df.write
    .schemaHint(Map("a" ->
        YtLogicalType.VariantOverTuple(Seq(
          (YtLogicalType.String, Metadata.empty), (YtLogicalType.Double, Metadata.empty)))))
    .yt(tmpPath)

transaction

Reading from a transaction. For more details, see Reading and writing within a transaction.

Scala example:

val transaction = YtWrapper.createTransaction(None, 10 minute)
df.write.transaction(transaction.getId.toString).yt(tmpPath)
transaction.commit().get(10, TimeUnit.SECONDS)

Schema v3

Read tables with schema in type_v3 instead of type_v1. It can be enabled in Spark configuration or write option.

Python example:

spark.read.option("parsing_type_v3", "true").yt("//sys/spark/examples/example_yson")