Configuration parameters for running Spark tasks

This section contains a list of configuration parameters that can be passed when launching Spark tasks. This is done by specifying additional parameters via the --conf option of basic Spark commands, such as spark-submit and spark-shell, as well as YTsaurus wrappers for them, such as spark-submit-yt and spark-shell-yt.

Basic options

Most of the options are available starting with version 1.23.0, unless otherwise specified.

Name Default value Description
spark.yt.write.batchSize 500000 Size of data sent in a single WriteTable operation.
spark.yt.write.miniBatchSize 1000 Size of a data block sent in WriteTable.
spark.yt.write.timeout 60 seconds Write timeout limit for a single data block.
spark.yt.write.typeV3.enabled (spark.yt.write.writingTypeV3.enabled до 1.75.2) true Writing of tables with a schema in type_v3 format instead of type_v1.
spark.yt.read.vectorized.capacity 1000 Maximum number of rows in a batch for reading via the wire protocol.
spark.yt.read.arrow.enabled true Use the arrow format to read data (if possible).
spark.hadoop.yt.timeout 300 seconds Timeout on reads from YTsaurus.
spark.yt.read.typeV3.enabled (spark.yt.read.parsingTypeV3.enabled before 1.75.2) true Reading of tables with a schema in type_v3 format instead of type_v1.
spark.yt.read.keyColumnsFilterPushdown.enabled true Use Spark query filters to selectively read from YTsaurus.
spark.yt.read.keyColumnsFilterPushdown.union.enabled false Combine all filters into a continuous range for selective reading.
spark.yt.read.keyColumnsFilterPushdown.ytPathCount.limit 100 Maximum number of table ranges for selective reading.
spark.yt.transaction.timeout 5 minutes Write operation transaction timeout.
spark.yt.transaction.pingInterval 30 seconds Pinging interval of a write operation transaction.
spark.yt.globalTransaction.enabled false Use a global transaction.
spark.yt.globalTransaction.id None Global transaction ID.
spark.yt.globalTransaction.timeout 5 minutes Global transaction timeout.
spark.hadoop.yt.user - YTsaurus user name.
spark.hadoop.yt.token - YTsaurus user token.
spark.yt.read.ytPartitioning.enabled true Use table partitioning by YTsaurus.
spark.yt.read.planOptimization.enabled false Optimize aggregations and joins on sorted input data.
spark.yt.read.keyPartitioningSortedTables.enabled true Use sorted table partitioning by key, required to optimize plans.
spark.yt.read.keyPartitioningSortedTables.unionLimit 1 Maximum number of partition joins when switching from reading by index to reading by key.
spark.yt.read.transactional true Use shapshot lock for reading if transaction is not specified. It is recommended to turn this option off when reading immutable data to improve reading perfomance

Options for launching tasks directly

Parameter Default value Description Starting with version
spark.ytsaurus.config.global.path //home/spark/conf/global Path to the document with a global Spark and SPYT configuration on the cluster. 1.76.0
spark.ytsaurus.config.releases.path //home/spark/conf/releases for release versions, //home/spark/conf/pre-releases for pre-release versions. Path to the SPYT release configuration. 1.76.0
spark.ytsaurus.distributives.path //home/spark/distrib Path to the directory with Spark distributions. Within this directory, the structure looks like a/b/c/spark-a.b.c-bin-hadoop3.tgz. 2.0.0
spark.ytsaurus.config.launch.file spark-launch-conf The name of the document with the release configuration located within the directory spark.ytsaurus.config.releases.path. 1.76.0
spark.ytsaurus.spyt.version Matches the SPYT version on the client. The SPYT version to be used on the cluster when launching a Spark application. 1.76.0
spark.ytsaurus.driver.maxFailures 5 Maximum allowable number of driver failures before the operation is considered failed. 1.76.0
spark.ytsaurus.executor.maxFailures 10 Maximum allowable number of executor failures before the operation is considered failed. 1.76.0
spark.ytsaurus.executor.operation.shutdown.delay 10000 Maximum allowable time in milliseconds to wait for executors to finish when stopping the application before aborting the operation with executors. 1.76.0
spark.ytsaurus.pool - The scheduler pool where driver and executor operations should be run. 1.78.0
spark.ytsaurus.python.binary.entry.point - The function used as an entry point when using compiled Python tasks. 2.4.0
spark.ytsaurus.python.executable - Path to the Python interpreter used in the driver and executors. 1.78.0
spark.ytsaurus.tcp.proxy.enabled false Whether a TCP proxy is used to access the operation. 2.1.0
spark.ytsaurus.tcp.proxy.range.start 30000 Minimum port number for a TCP proxy. 2.1.0
spark.ytsaurus.tcp.proxy.range.size 1000 Size of the range of ports that can be allocated for a TCP proxy. 2.1.0
spark.ytsaurus.cuda.version - CUDA version used for Spark applications. Makes sense if the computations consume GPU. 2.1.0
spark.ytsaurus.redirect.stdout.to.stderr false Redirect user script output from stdout to stderr. 2.1.0
spark.ytsaurus.remote.temp.files.directory //tmp/yt_wrapper/file_storage Path to cache on Cypress to load local scripts. 2.4.0
spark.ytsaurus.annotations - Annotations for driver and executor operations. 2.2.0
spark.ytsaurus.driver.annotations - Annotations for a driver operation. 2.2.0
spark.ytsaurus.executors.annotations - Annotations for an executor operation. 2.2.0
spark.ytsaurus.driver.watch true Flag for monitoring a driver operation executed in cluster mode. 2.4.2
spark.ytsaurus.network.project - Name of the network project where a Spark application is launched. 2.4.3
spark.hadoop.yt.mtn.enabled false Flag for enabling MTN support 2.4.3
spark.ytsaurus.squashfs.enabled false Use squashFS layers instead of porto layers in a YTsaurus job. 2.6.0
spark.ytsaurus.client.rpc.timeout - Timeout used in an RPC client to start YTsaurus operations. 2.6.0
spark.ytsaurus.rpc.job.proxy.enabled true Flag of using an RPC proxy embedded in a job proxy. 2.6.0
spark.ytsaurus.java.home /opt/jdk[11,17] Path to the JDK home directory used in cluster containers. Depends on the JDK used on the client side. Allowed versions: JDK11 and JDK17. 2.6.0
spark.ytsaurus.shuffle.enabled false Use the YTsaurus Shuffle service 2.7.2

Configuration options for the YTsaurus Shuffle service

Parameter Default value Description Starting with version
spark.ytsaurus.shuffle.transaction.timeout 5m Timeout for the transaction processing shuffle chunk writes. In regular operation mode, the transaction is periodically pinged by the driver, and the timeout sets the time between the last ping and transaction rollback with chunk deletion. 2.7.0
spark.ytsaurus.shuffle.account intermediate Account used for writing shuffle chunks. 2.7.0
spark.ytsaurus.shuffle.medium - Medium used for writing shuffle chunks. Defaults to the system-wide setting. 2.7.0
spark.ytsaurus.shuffle.replication.factor - Shuffle chunk replication factor. Defaults to the system-wide setting. 2.7.0
spark.ytsaurus.shuffle.partition.column partition The name of the chunk column used to store the target partition index. 2.7.0
spark.ytsaurus.shuffle.write.row.size 8m The maximum size of a single row in a chunk containing shuffle data. This value is not directly related to the size of shuffle data rows but serves to partition serialized shuffle data into chunk rows. Reducing this value increases the number of rows in the chunk, while raising the value may result in exceeding the maximum allowed chunk row size. 2.7.0
spark.ytsaurus.shuffle.write.buffer.size 10 Shuffle data write buffer size (in rows) in YTsaurus. Set this parameter together with spark.ytsaurus.shuffle.write.row.size to avoid RAM overflow. 2.7.0
spark.ytsaurus.shuffle.write.config - Additional parameters for writing shuffle data in YTsaurus in YSON format. Matches the TableWriter configuration. 2.7.0
spark.ytsaurus.shuffle.read.config - Additional parameters for reading shuffle data in YTsaurus in YSON format. Matches the TableReader configuration. 2.7.0

Options for running tasks in an internal cluster

To run tasks in an internal cluster, use the spark-submit-yt wrapper. Its parameters match those of the spark-submit command from the Spark distribution, with the following exception:

  • Instead of --master, you should use the parameters --proxy and --discovery-path. They determine which YTsaurus cluster will be used to run computations and which internal Spark cluster on that YTsaurus cluster the task will be sent to, respectively.