Overview

What is Spark?

Apache Spark is an engine for computations using big data (joins, groups, filters, ets.).

Spark uses RAM to process its data. The key difference between computing in memory and the "conventional" 2005 MapReduce model is that the data has minimum disk impact, which minimizes I/O costs as the slowest part of computing. For a single Map transaction, the effect from using Spark will be negligible. However, even a single Map and Reduce sequence saves on writing intermediate results out to disk provided there is enough memory.

For each subsequent MapReduce sequence the efficiencies mount, and you can cache the output. For large and complex analytics pipelines, efficiency will increase many fold.

Spark is also equipped by a full-scale Catalyst query optimizer that plans execution and takes into consideration:

  • Input data location and size.
  • Predicate pushdown to the file system.
  • Expediency and step sequence in query execution.
  • Collection of final table attributes.
  • Use of local data for processing.
  • Potential for computation pipelining.

What is SPYT?

Spark over YTsaurus (SPYT) enables a Spark cluster to be started with YTsaurus computational capacity. The cluster is started in a YTsaurus Vanilla operation, then takes a certain amount of resources from the quota and occupies them constantly. Spark can read static, as well as dynamic YTsaurus tables, perform computations on them, and record the result in the static table.

Current underlying Spark version is 3.2.2.

When to use SPYT

SPYT is an optimal choice for:

  • Developing in Java and using MapReduce in YTsaurus.
  • Optimizing pipeline performance on YTsaurus with two or more joins or groupings.
  • Writing integrational ETL pipelines from other storage systems.
  • Ad-hoc analytics in interactive mode using Jupyter, pyspark or spark-shell.

Do not use SPYT if:

  • You need to process over 10 TB of data in a single transaction.
  • Your processing boils down to individual Map or MapReduce operations.

Submitting Spark applications to YTsaurus

  • Submitting directly to YTsaurus using spark-submit command Details.
  • Launching an inner standalone Spark cluster inside YTsaurus using Vanilla operation Details.

Languages to code in

Spark supports following programming languages and environments: