Exactly-once guarantee
By default, streaming in SPYT provides an at-least-once guarantee — writing to the output table and committing the consumer offset are performed independently. Therefore, duplicate records may appear in the table if a micro‑batch is reprocessed after a failure.
If duplicates are not acceptable, SPYT offers two ways to ensure exactly‑once delivery:
|
Approach |
How it works |
When to use |
|
Transactional mode (SPYT 2.10+) |
|
When data accuracy is critical: financial analytics, ML features, incremental data mart construction |
|
|
When you want to avoid the additional RPC proxy load caused by the transactional mode, or when maintaining legacy code |
Warning
Both approaches guarantee exactly‑once only within Spark jobs in YTsaurus: the guarantee covers writing to the output table and committing the consumer offset. It does not apply to external systems — such as writing to an external database, sending messages to other queues, or calling external APIs. For such scenarios, you need additional measures at the application level.
If duplicate output data is acceptable, no additional configuration is required — non‑transactional streaming is enabled by default.
Performance
Below are the throughput measurement results for the transactional mode compared to other approaches: non‑transactional streaming and the idempotent receiver.
Conditions: synthetic NEXMark‑style loads, a queue of 5 million rows (16 tablets, average row size ~70 bytes), 8 executors × 2 cores, shared RPC proxy in both modes for a fair comparison.
|
Load type |
Mode |
Throughput |
|
Stream‑static |
Non‑transactional streaming |
~22 700 rows/s |
|
Transactional streaming |
~26 200 rows/s (+13 %) |
|
|
Stateful aggregation in 1‑minute event windows |
Non‑transactional streaming |
~13 900 rows/s |
|
Transactional streaming |
~14 000 rows/s (within noise) |
|
|
Comparison with idempotent receiver (passthrough, 1 million rows, 1 executor × 1 core) |
Idempotent receiver (sorted table) |
~2 000 rows/s |
|
Transactional streaming |
~2 600 rows/s (+23 %) |
The transactional mode performs better when there are many small writes or shuffle operations (join, groupBy, aggregations): a single transaction instead of dozens reduces the overhead for commits. For stateful operations with a small output volume, there is little to amortize, so there is practically no difference.
Compared to the idempotent receiver (which requires creating a sorted table), the transactional mode is consistently faster: data is written to an ordered table without maintaining a sorted index.
See also
- Transactional mode — setup instructions
- Idempotent receiver — an alternative for stateless 1:1 transformations
- Structured Streaming — overview and key use cases
- Streaming options — options reference