What is Spark?
Apache Spark is an engine for computations using big data (joins, groups, filters, ets.).
Spark uses RAM to process its data. The key difference between computing in memory and the "conventional" 2005 MapReduce model is that the data has minimum disk impact, which minimizes I/O costs as the slowest part of computing. For a single Map transaction, the effect from using Spark will be negligible. However, even a single Map and Reduce sequence saves on writing intermediate results out to disk provided there is enough memory.
For each subsequent MapReduce sequence the efficiencies mount, and you can cache the output. For large and complex analytics pipelines, efficiency will increase many fold.
Spark is also equipped by a full-scale Catalyst query optimizer that plans execution and takes into consideration:
- Input data location and size.
- Predicate pushdown to the file system.
- Expediency and step sequence in query execution.
- Collection of final table attributes.
- Use of local data for processing.
- Potential for computation pipelining.
What is SPYT?
Spark over YTsaurus (SPYT) enables a Spark cluster to be started with YTsaurus computational capacity. The cluster is started in a YTsaurus Vanilla operation, then takes a certain amount of resources from the quota and occupies them constantly. Spark can read static, as well as dynamic YTsaurus tables, perform computations on them, and record the result in the static table.
Current underlying Spark version is 3.2.2.
Languages to code in
In Spark, you can code in one of three languages: Python, Java, and Scala.
When to use SPYT
Pick SPYT if you are:
- Developing in Java and using MapReduce in YTsaurus.
- Optimizing pipeline performance on YTsaurus with two or more joins or groupings.
Do not use SPYT if:
- You need to process over 10 TB of data in a single transaction.
- Your processing boils down to individual Map or MapReduce operations.