SPYT 2.0

The latest updates in SPYT, a tool allowing you to run tasks on Apache Spark® inside YTsaurus

Over the past few months, we’ve made quite a few important changes to our project. The changes are quite significant, so we decided to mark it as the next stage in the development of the project — SPYT 2.0. Here’s what’s new:

Direct submission of Spark tasks to the YTsaurus scheduler

In previous versions, in order to run Spark tasks in YTsaurus, it was necessary to raise the Spark Standalone cluster inside the Vanilla operation of YTsaurus. This approach had a number of serious drawbacks: firstly, a running Spark Standalone cluster always takes up resources, even if it does not perform any tasks. Secondly, it complicates the process of launching the application itself due to the need for an additional operation to launch the internal cluster.

Starting from version 1.76.0, SPYT allows you to run Spark applications directly through the YTsaurus Scheduler. This approach allows you to immediately free up computing resources after completing the Spark application.

Currently, this method works without supporting dynamic allocation of resources. In the future, we plan to develop a single external Shuffle service in YTsaurus, which will support dynamic resource allocation

How to run Spark tasks using direct submission: documentation

It’s now possible to work with other versions of Spark apart from the 3.2.2 fork

Starting with version SPYT 2.0.0, we will use the original distributions instead of the forked version of Apache Spark®. This will allow us to support any compatible Spark distribution instead of being tightly bound to version 3.2.2.

In the near future, we plan to ensure compatibility with all releases of 3.x.x and in the future with the upcoming 4.0.0. At the moment, SPYT is compatible with versions 3.2.2-3.2.4. The list of compatible versions will constantly increase.

SPYT is moved to a separate repository

Unlike YTsaurus, which is written in C++, SPYT is written in Scala. The release cycle of SPYT is also generally independent of YTsaurus, so we decided to move SPYT to a separate repository, which will allow us to set up autotesting in GitHub for all pull requests in the near future. You can find SPYT here.

Deeper integration with Query Tracker

Unlike the original SPYT+QT product, which we released last year, the current version differs significantly, both in terms of stability and functionality. To correctly comply with the ACL, we have implemented temporary tokens, for an instant start, new requests re-use the Spark session of the previous ones. In addition, commands for creating, writing or deleting tables have become available. And now we are actively working on visualizing the request and the progress of execution.

Some other additional improvements

  • Executing SQL queries using Spark SQL both using the internal Spark cluster and without it;

  • Python 3.12 support;

  • Support for Spark Structured Streaming using ordered dynamic tables as queues.

How to work with Structured Streaming: documentation

To update SPYT to the latest version, you can use the Kubernetes operator. The latest up-to-date version must be specified as the docker image (the image parameter). Today it is equal to ghcr.io/ytsaurus/spyt:2.0.0. An example can be viewed here.

In the near future, we are going to support Spark 4.0, improve integration with external data warehouses built on Hadoop, S3 and so on, and also teach SPYT to work with GPU.

Come try out SPYT in action and bring your bug reports to our backlog. We would also be glad to receive any other contributions to our project.

Sign in to save this post