How it works

The YTsaurus architecture has three layers.

  • The lowest layer is a distributed file system and metadata storage (Cypress).
  • The middle is Scheduler for distributed computations, supports MapReduce model.
  • The outer is high-level compute engines: YQL, CHYT, SPYT.

The system supports SDKs for С++, Java, Golang, and Python.

YTsaurus architecture image

Data storage: Cypress

We have a distributed file system and metainformation tree that stores files, documents, static and dynamic tables, and metadata. It supports transactions and manages data chunks. In addition to storage Cypress can serve as a coordination service. Fault tolerance is driven by our own consensus algorithm (RSM) similar to RAFT.

YTsaurus web interface image of navigation page

Dynamic tables

Dynamic tables are built on top of the file system as a key-value store. They support: distributed ACID transactions within or across tables, automatic data caching in memory for low read latency, TTL data retention rules, isolation between projects on different cluster nodes, replication between clusters

Data processing: Scheduler

YTsaurus scheduler manages resources of the cluster: CPU cores, RAM, GPUs. It allocates resources to jobs following the Dominant Resource Fairness approach. It supports of hierarchy of compute pools, and different types of guarantees. Its point-in-time and integral guarantees achieve fairness on different time scales. Operations managed by the Scheduler can be written in MapReduce, or in a dialect of SQL provided by YQL. Tables in YTsaurus are schematized, allowing simple and concise queries.

YTsaurus web interface image of scheduler page

Operations on the cluster can be started with YQL, a dialect of SQL with UDF, window functions and more. You can use it to build complex data processing pipelines that store subqueries in variables and create chains of dependent queries.

CHYT allows you to launch ClickHouse® clusters to work with data in YTsaurus. It can operate as a data source for visualization and BI tools, making it excellent for ad hoc queries.

SPYT clusters run Apache Spark inside YTsaurus Vanilla operations. They are great for building ETL pipelines.

What is YTsaurus?

An introduction into the platform.