How it works

The YTsaurus architecture has three layers.

  • Storage Layer: Includes queues, static and dynamic tables, and is responsible for storing data and metadata.
  • Compute Layer: Responsible for data processing, distributed computing using various processing engines like MapReduce, YQL, CHYT, SPYT.
  • Access Layer (the data access layer): a user-friendly UI, CLI, and SDK in Java, Python, GoLang, and C++.
Full screen image

We have a distributed file system and metainformation tree that stores files, documents, static and dynamic tables, and metadata. It supports transactions and manages data chunks.
In addition to storage Cypress can serve as a coordination service. Fault tolerance is driven by our own consensus algorithm (RSM) similar to RAFT.

Full screen image

Dynamic tables

Dynamic tables are built on top of the file system as a key-value store. They support: distributed ACID transactions within or across tables, automatic data caching in memory for low read latency, TTL data retention rules, isolation between projects on different cluster nodes, replication between clusters.

Full screen image

Queues in YTsaurus

A distributed, replicated message log based on dynamic tables, with support for sharding and cross-cluster replication, and compatible with the Apache Kafka protocol.

Flexible table-based storage

Export messages to a static table for long-term storage, and delete them via TTL after processing

An all-in-one transaction space

With a single transaction space with dynamic tables, you can process events exactly-once without complex integrations and third-party tools

Scalability and fault tolerance

Sharding and distributed storage allow you to scale the system horizontally without a single point of failure

Data processing: Scheduler

YTsaurus scheduler manages resources of the cluster: CPU cores, RAM, GPUs. It allocates resources to jobs following the Dominant Resource Fairness approach. It supports of hierarchy of compute pools, and different types of guarantees. Its point-in-time and integral guarantees achieve fairness on different time scales. Operations managed by the Scheduler can be written in MapReduce, or in a dialect of SQL provided by YQL. Tables in YTsaurus are schematized, allowing simple and concise queries.

YTsaurus web interface image of scheduler page

Operations on the cluster can be started with YQL, a dialect of SQL with UDF, window functions and more. You can use it to build complex data processing pipelines that store subqueries in variables and create chains of dependent queries.

CHYT allows you to launch ClickHouse® clusters to work with data in YTsaurus. It can operate as a data source for visualization and BI tools, making it excellent for ad hoc queries.

SPYT clusters run Apache Spark inside YTsaurus Vanilla operations. They are great for building ETL pipelines.

What is YTsaurus?

An introduction into the platform.