How to try YTsaurus

This guide offers a look at YTsaurus in action and describes the process of installing and starting a cluster. You'll deploy a local YTsaurus cluster, create a table, and run a simple SELECT query. Then, you'll deal with a slightly more complex challenge and solve the classic Word Count problem with a MapReduce operation.

Note

The fastest way to familiarize yourself with the product's features is by using the Demo Stand. It provides temporary access to a demo cluster that includes all the required YTsaurus components. All you need is a web browser.

Before you start

  • We recommend using an x86_64 Linux operating environment for YTsaurus. If you're using MacOS with an Apple Silicon processor, you'll need to use the x86 emulation mode to install YTsaurus locally. This can be achieved by using the Docker Desktop virtualization platform with Rosetta 2 enabled. However, keep in mind that YTsaurus isn't guaranteed to work in emulation mode.

  • In this guide, you'll start a YTsaurus cluster with a minimal configuration, which means it will offer no guarantees of fault tolerance. Don't use this configuration in a production environment or for performance testing. Examples of how to configure your cluster configuration are available in the Administrator manual.

  • For the examples to work correctly, Python 3.8+ must be installed on your system.

Installing and starting a YTsaurus cluster

This guide offers two methods for installing a YTsaurus cluster: using Docker, Minikube and using Kind.

Regardless of the installation method, the required system components will be deployed in the process, including the master server, scheduler, YQL, Query Tracker, and others. All examples in this guide — table creation, data upload, and running MapReduce — apply regardless of your preferred installation method and will be the same for both Docker and Minikube.

Docker
  • Docker
  • Minikube
  • Kind

Installing the YTsaurus CLI

The most convenient way to interact with the YTsaurus system is through the console. The YTsaurus CLI utility isn't installed as part of the cluster deployment process. You need to install it on your system separately.

First, install the pip3 package manager if you don't already have it installed:

sudo apt update
sudo apt install python3-pip

Make sure that everything worked correctly:

$ pip3 --version
pip 22.0.2 from ...

Install the ytsaurus-client utility:

pip3 install --user ytsaurus-client

Add the path to $HOME/.local/bin to the PATH variable:

export PATH="$PATH:$HOME/.local/bin"
How to save this change after system reboot
echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc # This command appends the string to the end of the ~/.bashrc file
source ~/.bashrc  # Apply the changes now

Check if the YTsaurus CLI was successfully installed:

$ yt --version
Version: YT wrapper 0.13.20

For more information about working with the CLI, watch the introductory webinar (the timestamp is 24:30).

Executing examples

In this guide, you'll create a table, write data to it, and run a simple SELECT query. The section concludes with a more complex example: running a MapReduce operation.

Set environment variables

You'll need this to access the cluster via the CLI for the examples that follow.

Docker
  • Docker
  • Minikube
  • Kind

Create a table

In YTsaurus, all data is stored in tables. Let's create one!

$ yt create table //home/input_table --attributes '{schema = [{name = id; type = int64}; {name = text; type = string}]}'
> 16-64ca-10191-47007b7d

The value 16-64ca-10191-47007b7d is the ID of the created Cypress node. Node IDs are useful when working with transactions in YTsaurus. You won't need these IDs in this example.

You can view the created table in the web interface. In your web browser, open the address that you received when starting the cluster, go to the Navigation tab, and click the created table:

Write data

Now write some data to the table by calling the write-table command:

echo '{ "id": 0, "text": "Hello" } { "id": 1, "text": "World!" }' | yt write-table //home/input_table --format json

Read the result

To verify that the data has indeed been written to the table, run the following command:

$ yt read-table //home/input_table --format json
{"id":0,"text":"Hello"}
{"id":1,"text":"World!"}

Another way to read a table is by running a SELECT query in the web interface. To do this, go to the Queries tab and enter the following query:

SELECT * FROM `//home/input_table`;

I get the error 'Attribute "cluster_name" is not found'

If you deployed your YTsaurus cluster via Docker, follow these steps:

  1. In the web interface, go to the Queries tab.

  2. Click the settings icon at the top right of the page. Delete your current Settings.

  3. Click Add setting and specify the field values "cluster" and "primary", respectively. Click the checkmark.

If you deployed your YTsaurus cluster via Minikube, please let us know about this error in the community chat.

Advanced example: running MapReduce

This section explains how to run a MapReduce operation, using a Word Count problem as an example.

  1. Prepare the data
  2. Create a table and write the data to it
  3. Run MapReduce
  4. Read the result
How MapReduce works

For a Word Count problem, the MapReduce operation is executed according to the following algorithm:

  1. The source text is split into strings, with each string written to the table as a separate record.
  2. A Map operation is performed for each record, emitting a pair of columns for each word: (<word>, 1).
  3. The output of the previous step is sorted by the first column.
  4. A Reduce operation is performed on the first column, summing the values from the second column. The resulting output is a set of pairs: (<word>, <number of mentions of the word>).

1. Prepare the data

Download the source text and convert it into a tab-separated format:

curl -s https://raw.githubusercontent.com/ytsaurus/ytsaurus/refs/heads/main/yt/docs/code-examples/source/moem.txt > source.txt
awk '{gsub(/\t/, "\\t"); print "lineno="NR"\ttext="$0}' source.txt > source.tsv
About the tab-separated format
  • Table rows are separated by line breaks, \n.
  • Columns are separated by tabs, \t.
  • Column names and their corresponding contents are separated by an equals sign =.

For example, the string lineno=1\tsize=6\tvalue=foobar describes a row with columns lineno, size, and value, which contain the values 1, 6, and foobar, respectively. Tab characters are escaped.

Prepare the source code of the program that will run the MapReduce operation. Download the Python 3 script and save it locally:

curl -s https://raw.githubusercontent.com/ytsaurus/ytsaurus/refs/heads/main/yt/docs/code-examples/python/word-count.py > word-count.py

2. Create a table

Create two tables, one for the source data and another for the results of executing the MapReduce operation:

yt create table //home/mapreduce_input --attributes '{schema = [{name = lineno; type = string}; {name = text; type = string}]}'
yt create table //home/mapreduce_result --attributes '{schema = [{name = count; type = string}; {name = word; type = string}]}'

If you get 'Cannot determine backend type: either driver config or proxy url should be specified,' set the environment variable YT_PROXY.

Now write some data to the source table by calling the write-table command:

cat source.tsv | yt write-table //home/mapreduce_input --format dsv

To verify that the data has been written to the table, use the read-table command. The half-interval specified in the square brackets indicates that we want to get the first six rows of the table:

yt read-table '//home/mapreduce_input[:#6]' --format dsv

3. Run MapReduce

Run the MapReduce operation using the map-reduce command:

yt map-reduce --mapper "python3 word-count.py map" --reducer "python3 word-count.py reduce" --map-local-file word-count.py --reduce-local-file word-count.py --src //home/mapreduce_input --dst //home/mapreduce_result --reduce-by word --format dsv

You can track the status of a running operation in the Operations section of the web interface.

Where to find this section

4. Read the result

Now you can read the resulting table by executing a simple SELECT query. In the web interface, go to the Queries tab and enter the following query:

SELECT * FROM `//home/mapreduce_result`
ORDER BY count
LIMIT 30;

Deleting a cluster

Docker
  • Docker
  • Minikube
  • Kind

Demo Stand

This is an online demonstration of the capabilities offered by YTsaurus. To get access to the demo cluster, fill out this form. After that, an email with information for accessing the cluster will be sent to your specified address.

The demo stand features several environments for interacting with YTsaurus:

Jupyter Notebook

The notebook provides numerous examples for working with YTsaurus, including operations for creating tables, uploading data, and using CHYT, SPYT, and YQL, as well as SDK examples. For an overview of all available examples, see About YTsaurus demo, the notebook home page.

A link to a deployed Jupyter Notebook will be included in the email.

Web interface

Here you can test out the features of the YTsaurus web interface: explore the file system, see the list of pools, and run queries in Query Tracker.

A link to the web interface of a deployed cluster will be included in the email.

For more information about using the web interface, watch the introductory webinar (the timestamp is 15:30).

Troubleshooting

If you're having trouble getting something to work, don't hesitate to ask your questions in our community Telegram chat. We'll do our best to help you.

To share your suggestions or comments on the documentation, create an issue in the project's GitHub repository. Your feedback is always welcome; it helps us make the documentation more informative.

Cypress is a distributed file system and metadata storage. Cypress stores tables and files. To learn more, see the documentation.

Container images are collected from the Dockerfile available here.

A tool for running SQL-like queries. Query Tracker is available in the web interface, under Queries. To learn more, see the documentation.

A declarative query language for data storage and processing systems. It's based on SQL syntax.

The master server is responsible for fault-tolerant storage of the cluster's metadata. This includes information about system users, stored objects, and the location of the data itself. To learn more, see the documentation.

The scheduler is responsible for allocating resources between operations as well as for their execution on the cluster. To learn more, see the documentation.