Definitions in SPYT
SPYT sits at the intersection of two systems:
- Apache Spark as a compute engine;
- YTsaurus as a data storage system, resource manager, and process launcher.
Because of this, the same terms used in documentation and discussions may refer to different entities depending on the context. For example, the terms “cluster”, “job”, “operation”, “partition”, “table”, etc. are used differently in Spark and YTsaurus.
The goal of this section is to provide a unified set of definitions and help distinguish overlapping concepts in Spark, SPYT, and YTsaurus.
SPYT overview and launch modes
SPYT overview
SPYT (Spark over YTsaurus) is the integration of Apache Spark with the YTsaurus infrastructure.
SPYT lets you run Spark applications on data stored in YTsaurus and use YTsaurus as an environment for resource allocation, data storage, logging, and storing service information.
SPYT is not a separate compute engine: Spark performs the computations, while YTsaurus provides the infrastructure layer.
SPYT cluster
SPYT cluster is a pre‑launched Spark Standalone cluster inside YTsaurus.
Such a cluster consists of a Spark Master, one or more Spark Workers, and usually a Spark History Server. In YTsaurus, it is launched as one or more Vanilla operations.
It’s important not to confuse an SPYT cluster with a YTsaurus cluster:
- YTsaurus cluster is the entire compute cluster;
- SPYT cluster is a Spark Standalone cluster deployed inside a YTsaurus cluster.
Direct Submit
Direct Submit is a mode for launching a Spark application without a pre‑launched SPYT cluster.
In this mode, separate Spark Master and Spark Worker instances are not used. The driver is launched directly in YTsaurus, and executors are allocated on demand. From an architectural perspective, the YTsaurus scheduler largely takes on the role of resource management.
Direct Submit is contrasted with the standalone cluster mode, where the application connects to an already running SPYT cluster.
Spark components in the context of YTsaurus
Spark Master
Spark Master is a component of a Spark Standalone cluster that manages workers and allocates resources among Spark applications.
In SPYT, Spark Master exists only in standalone cluster mode. It is launched inside YTsaurus as part of an SPYT cluster and publishes information about itself in discovery_path.
There is no Spark Master in Direct Submit mode.
Spark Worker
Spark Worker is a component of a Spark Standalone cluster that manages the resources of a specific node and launches executors when instructed by the Spark Master.
Spark Worker exists only in an SPYT cluster. Workers are absent in Direct Submit mode.
It’s important not to confuse Spark Worker and Executor:
- Worker is part of the Spark cluster that manages node resources;
- Executor is a process of a specific Spark application that performs computations.
Spark Driver
Spark Driver is the coordinating process of a Spark application.
The driver builds an execution plan, breaks down computations into stages and tasks, requests resources, sends tasks to executors, and collects results.
In SPYT, the driver can be launched:
- as a separate process inside YTsaurus;
- locally on a client machine if client mode is used.
It’s important not to confuse the driver and Spark Master:
- Spark Master manages the cluster;
- Spark Driver manages one specific Spark application.
Executor
Executor is a Spark application process that directly performs tasks on data.
Executors are launched upon request from the driver and operate within a single Spark application. They read data, perform computations, participate in shuffle, and can store cached data.
In SPYT, executors are launched inside YTsaurus and use resources allocated by YTsaurus.
It’s important not to confuse an executor with Spark Worker:
- Worker provides resources;
- Executor consumes those resources to perform application tasks.
Spark History Server
Spark History Server (SHS) is a service for viewing logs of completed Spark applications.
It lets you analyze completed applications, their stages, tasks, resource consumption, and errors after they finish running.
In SPYT, SHS is usually launched as part of an SPYT cluster. Its data source is the event logs of Spark applications.
Data — YTsaurus tables vs Spark abstractions
Spark Partition and YTsaurus Chunk/Partition
In the context of SPYT, the word “partition” can refer to different entities, so it’s important to distinguish at least three concepts: Spark Partition, YTsaurus Chunk, and YTsaurus Partition.
Spark Partition is a unit of parallelism in Spark.
One Spark task usually processes one Spark partition. Spark Partition determines how data is distributed across tasks and define the parallelism level for reads and computations.
YTsaurus Chunk is a unit of physical data storage in YTsaurus.
A chunk is a low‑level storage and replication object. A YTsaurus table physically consists of chunks, and reading data in batch scenarios ultimately relies on them.
YTsaurus Partition is a logical data partitioning in YTsaurus, depending on the table type and processing scenario.
When reading YTsaurus tables via SPYT, these entities do not directly correspond:
- chunk refers to the physical storage of data in YTsaurus;
- YTsaurus partition refers to the logical or internal data partitioning in YTsaurus;
- Spark partition refers to computation execution in Spark.
DataFrame
DataFrame is the main structured data abstraction in Spark SQL.
A DataFrame is a collection of rows with named columns and a schema. In SPYT, a DataFrame is usually created by reading YTsaurus tables.
It’s important not to confuse a DataFrame and a YTsaurus table:
- YTsaurus table is a data storage object in YTsaurus;
- DataFrame is a data representation inside a Spark application.
Infrastructure and integration
discovery_path
discovery_path is a path in YTsaurus Cypress that Spark applications use to find a running SPYT cluster.
In cluster mode, the Spark Master publishes service information in this path: service addresses, connection parameters, and other cluster metadata.
discovery_path is a concept specific to SPYT. Standard Spark does not have such a mechanism.
Vanilla operation in YTsaurus
Vanilla operation in YTsaurus is a universal type of YTsaurus operation for launching arbitrary user processes.
In SPYT, Vanilla operations are used to launch Spark infrastructure components:
Spark Master;Spark Worker;Spark Driver;Executor;Spark History Server.
Shuffle
Shuffle is a mechanism for redistributing data among executors in Spark.
Shuffle occurs when data needs to be regrouped by key or redistributed among nodes for further computations. Typical examples include join, groupBy, distinct, sorts, and some window operations.
Shuffle is one of the most expensive operations in Spark because it involves network communication, serialization, writing intermediate data, and additional load on memory and disk.
YTsaurus Shuffle Service
YTsaurus Shuffle Service is an implementation of the shuffle infrastructure for SPYT on top of YTsaurus.
It is used for storing and transferring intermediate shuffle data in the integration of Spark with YTsaurus. This is an infrastructure component that helps adapt Spark’s shuffle model to the YTsaurus environment.
It’s important to distinguish between:
- shuffle is a Spark mechanism;
- YTsaurus Shuffle Service is an infrastructure implementation for storing and transferring shuffle data in YTsaurus.
Work units — Spark vs YTsaurus
Spark Application
Spark Application is a unit of a user program in Spark.
A Spark application consists of:
- one driver;
- a set of executors;
- a logical computation plan and related job, stage, and task.
From a practical perspective, one launch of spark-submit, spark-submit-yt, or one user Spark session corresponds to one Spark Application.
It’s important not to confuse Spark Application with a YTsaurus operation, Spark Job, or YTsaurus Job:
- Spark Application is a logical execution unit in Spark;
- YTsaurus operation is an infrastructure unit for launching processes in YTsaurus.
One Spark application in SPYT can use one or more YTsaurus operations to launch its components. Conversely, one YTsaurus operation can be used to launch several Spark applications (standalone cluster).
Spark Job
Spark Job is a unit of computation inside a Spark application, typically spawned by a single action.
For example, calls to count(), collect(), show(), or writing a result can launch separate Spark Jobs.
A Spark Job is divided into:
- stage — an execution stage;
- task — an individual task within a stage.
YTsaurus Job
YTsaurus Job is a process launched by the YTsaurus scheduler within a YTsaurus operation.
This is an infrastructure execution unit in YTsaurus. For example, in the context of SPYT, a YTsaurus Job may correspond to a process where an executor or another Spark component runs.
The difference between Spark Job and YTsaurus Job is fundamental:
- Spark Job is a logical unit of the Spark execution plan;
- YTsaurus Job is an infrastructure process launched by YTsaurus.