Configuration parameters for running Spark tasks

This section contains a list of configuration parameters that can be passed when launching Spark tasks. This is done by specifying additional parameters via the --conf option of basic Spark commands, such as spark-submit and spark-shell, as well as YTsaurus wrappers for them, such as spark-submit-yt and spark-shell-yt.

Basic options

Most of the options are available starting with version 1.23.0, unless otherwise specified.

Name	Default value	Description
`spark.yt.write.batchSize`	`500000`	Size of data sent in a single `WriteTable` operation.
`spark.yt.write.miniBatchSize`	`1000`	Size of a data block sent in `WriteTable`.
`spark.yt.write.timeout`	`60 seconds`	Write timeout limit for a single data block.
`spark.yt.write.typeV3.enabled` (`spark.yt.write.writingTypeV3.enabled` до 1.75.2)	`true`	Writing of tables with a schema in type_v3 format instead of `type_v1`.
`spark.yt.read.vectorized.capacity`	`1000`	Maximum number of rows in a batch for reading via the `wire` protocol.
`spark.yt.read.arrow.enabled`	`true`	Use the `arrow` format to read data (if possible).
`spark.hadoop.yt.timeout`	`300 seconds`	Timeout on reads from YTsaurus.
`spark.yt.read.typeV3.enabled` (`spark.yt.read.parsingTypeV3.enabled` before 1.75.2)	`true`	Reading of tables with a schema in type_v3 format instead of `type_v1`.
`spark.yt.read.keyColumnsFilterPushdown.enabled`	`true`	Use Spark query filters to selectively read from YTsaurus.
`spark.yt.read.keyColumnsFilterPushdown.union.enabled`	`false`	Combine all filters into a continuous range for selective reading.
`spark.yt.read.keyColumnsFilterPushdown.ytPathCount.limit`	`100`	Maximum number of table ranges for selective reading.
`spark.yt.transaction.timeout`	`5 minutes`	Write operation transaction timeout.
`spark.yt.transaction.pingInterval`	`30 seconds`	Pinging interval of a write operation transaction.
`spark.yt.globalTransaction.enabled`	`false`	Use a global transaction.
`spark.yt.globalTransaction.id`	`None`	Global transaction ID.
`spark.yt.globalTransaction.timeout`	`5 minutes`	Global transaction timeout.
`spark.hadoop.yt.user`	-	YTsaurus user name.
`spark.hadoop.yt.token`	-	YTsaurus user token.
`spark.yt.read.ytPartitioning.enabled`	`true`	Use table partitioning by YTsaurus.
`spark.yt.read.planOptimization.enabled`	`false`	Optimize aggregations and joins on sorted input data.
`spark.yt.read.keyPartitioningSortedTables.enabled`	`true`	Use sorted table partitioning by key, required to optimize plans.
`spark.yt.read.keyPartitioningSortedTables.unionLimit`	`1`	Maximum number of partition joins when switching from reading by index to reading by key.
`spark.yt.read.transactional`	`true`	Use shapshot lock for reading if transaction is not specified. It is recommended to turn this option off when reading immutable data to improve reading perfomance

Options for launching tasks directly

Parameter	Default value	Description	Starting with version
`spark.ytsaurus.config.global.path`	`//home/spark/conf/global`	Path to the document with a global Spark and SPYT configuration on the cluster.	1.76.0
`spark.ytsaurus.config.releases.path`	`//home/spark/conf/releases` for release versions, `//home/spark/conf/pre-releases` for pre-release versions.	Path to the SPYT release configuration.	1.76.0
`spark.ytsaurus.distributives.path`	`//home/spark/distrib`	Path to the directory with Spark distributions. Within this directory, the structure looks like `a/b/c/spark-a.b.c-bin-hadoop3.tgz`.	2.0.0
`spark.ytsaurus.config.launch.file`	`spark-launch-conf`	The name of the document with the release configuration located within the directory `spark.ytsaurus.config.releases.path`.	1.76.0
`spark.ytsaurus.spyt.version`	Matches the SPYT version on the client.	The SPYT version to be used on the cluster when launching a Spark application.	1.76.0
`spark.ytsaurus.driver.maxFailures`	5	Maximum allowable number of driver failures before the operation is considered failed.	1.76.0
`spark.ytsaurus.executor.maxFailures`	10	Maximum allowable number of executor failures before the operation is considered failed.	1.76.0
`spark.ytsaurus.executor.operation.shutdown.delay`	10000	Maximum allowable time in milliseconds to wait for executors to finish when stopping the application before aborting the operation with executors.	1.76.0
`spark.ytsaurus.pool`	-	The scheduler pool where driver and executor operations should be run.	1.78.0
`spark.ytsaurus.python.binary.entry.point`	-	The function used as an entry point when using compiled Python tasks.	2.4.0
`spark.ytsaurus.python.executable`	-	Path to the Python interpreter used in the driver and executors.	1.78.0
`spark.ytsaurus.tcp.proxy.enabled`	false	Whether a TCP proxy is used to access the operation.	2.1.0
`spark.ytsaurus.tcp.proxy.range.start`	30000	Minimum port number for a TCP proxy.	2.1.0
`spark.ytsaurus.tcp.proxy.range.size`	1000	Size of the range of ports that can be allocated for a TCP proxy.	2.1.0
`spark.ytsaurus.cuda.version`	-	CUDA version used for Spark applications. Makes sense if the computations consume GPU.	2.1.0
`spark.ytsaurus.redirect.stdout.to.stderr`	false	Redirect user script output from stdout to stderr.	2.1.0
`spark.ytsaurus.remote.temp.files.directory`	`//tmp/yt_wrapper/file_storage`	Path to cache on Cypress to load local scripts.	2.4.0
`spark.ytsaurus.annotations`	-	Annotations for driver and executor operations.	2.2.0
`spark.ytsaurus.driver.annotations`	-	Annotations for a driver operation.	2.2.0
`spark.ytsaurus.executors.annotations`	-	Annotations for an executor operation.	2.2.0
`spark.ytsaurus.driver.watch`	true	Flag for monitoring a driver operation executed in cluster mode.	2.4.2
`spark.ytsaurus.network.project`	-	Name of the network project where a Spark application is launched.	2.4.3
`spark.hadoop.yt.mtn.enabled`	false	Flag for enabling MTN support	2.4.3
`spark.ytsaurus.squashfs.enabled`	false	Use squashFS layers instead of porto layers in a YTsaurus job.	2.6.0
`spark.ytsaurus.client.rpc.timeout`	-	Timeout used in an RPC client to start YTsaurus operations.	2.6.0
`spark.ytsaurus.rpc.job.proxy.enabled`	true	Flag of using an RPC proxy embedded in a job proxy.	2.6.0
`spark.ytsaurus.java.home`	`/opt/jdk[11,17]`	Path to the JDK home directory used in cluster containers. Depends on the JDK used on the client side. Allowed versions: JDK11 and JDK17.	2.6.0
`spark.ytsaurus.shuffle.enabled`	false	Use the YTsaurus Shuffle service	2.7.2

Configuration options for the YTsaurus Shuffle service

Parameter	Default value	Description	Starting with version
`spark.ytsaurus.shuffle.transaction.timeout`	5m	Timeout for the transaction processing shuffle chunk writes. In regular operation mode, the transaction is periodically pinged by the driver, and the timeout sets the time between the last ping and transaction rollback with chunk deletion.	2.7.0
`spark.ytsaurus.shuffle.account`	intermediate	Account used for writing shuffle chunks.	2.7.0
`spark.ytsaurus.shuffle.medium`	-	Medium used for writing shuffle chunks. Defaults to the system-wide setting.	2.7.0
`spark.ytsaurus.shuffle.replication.factor`	-	Shuffle chunk replication factor. Defaults to the system-wide setting.	2.7.0
`spark.ytsaurus.shuffle.partition.column`	partition	The name of the chunk column used to store the target partition index.	2.7.0
`spark.ytsaurus.shuffle.write.row.size`	8m	The maximum size of a single row in a chunk containing shuffle data. This value is not directly related to the size of shuffle data rows but serves to partition serialized shuffle data into chunk rows. Reducing this value increases the number of rows in the chunk, while raising the value may result in exceeding the maximum allowed chunk row size.	2.7.0
`spark.ytsaurus.shuffle.write.buffer.size`	10	Shuffle data write buffer size (in rows) in YTsaurus. Set this parameter together with `spark.ytsaurus.shuffle.write.row.size` to avoid RAM overflow.	2.7.0
`spark.ytsaurus.shuffle.write.config`	-	Additional parameters for writing shuffle data in YTsaurus in YSON format. Matches the TableWriter configuration.	2.7.0
`spark.ytsaurus.shuffle.read.config`	-	Additional parameters for reading shuffle data in YTsaurus in YSON format. Matches the TableReader configuration.	2.7.0

Options for running tasks in an internal cluster

To run tasks in an internal cluster, use the spark-submit-yt wrapper. Its parameters match those of the spark-submit command from the Spark distribution, with the following exception:

Instead of --master, you should use the parameters --proxy and --discovery-path. They determine which YTsaurus cluster will be used to run computations and which internal Spark cluster on that YTsaurus cluster the task will be sent to, respectively.

SPYT in Java

Read options