Configuring External Access to YTsaurus in Kubernetes

By default, a YTsaurus cluster deployed in Kubernetes is isolated from external networks. LoadBalancer or Ingress mechanisms are typically used to publish services. These work well for individual web services, providing a single entry point for clients, but they cannot efficiently handle large volumes of network traffic.

In this guide, you'll learn how to:

Solve the network isolation problem—configure the cluster so that external clients can directly connect to cluster nodes for efficient reading and writing of large data volumes.
Split load between clients—isolate resources of different projects and split traffic into "light" (metadata, UI) and "heavy" (table reading and writing).
Configure access for SPYT—set up TCP proxying for direct connections between an external Spark driver and workers inside the cluster.

Proxy overview

Users do not interact with the YTsaurus server directly; all communication goes through proxies. These are YTsaurus components that act as a unified entry point and abstract away the cluster's internal topology and inter-component communication—for example, master and data node addresses.

From a Kubernetes perspective, proxies are typically deployed as a StatefulSet consisting of several pods. Their number and allocated resources (CPU, RAM) are specified in the YTsaurus operator's specification. When the cluster starts, each proxy pod is automatically registered in Cypress (in the system directories //sys/http_proxies and //sys/rpc_proxies).

YTsaurus has two proxy types:

HTTP proxies—implement the YTsaurus HTTP API. They are used by SDKs, the web interface, and the CLI.
RPC proxies—implement a faster binary protocol (YT RPC). They're primarily needed where low request latency is required (for example, during intensive streaming operations with dynamic tables). HTTP proxies are recommended for all other scenarios.

The concept of a role

You can split proxies into functional groups using roles. Typically, two main groups are distinguished:

Control proxies—handle "light" requests (UI navigation, working with Cypress metadata). They're usually assigned the control role.
Heavy (data) proxies—handle "heavy" requests (streaming reads and writes of large tables). In Kubernetes installations, they most often operate under the default role.

This separation allows flexible cluster resource management: actively reading a huge table through data proxies won't slow down the web interface or interfere with other users browsing the Cypress tree.

Technically, a role is a string label (the @role attribute in Cypress) assigned to a proxy instance at startup. By default, all proxies in the cluster start with the default role. You can assign a role in the Ytsaurus specification.

Discovery mechanism

When initializing a client (SDK), the developer specifies the cluster's primary address (for example, yt.example.com). Usually, a balancer (Ingress or LoadBalancer) behind this address distributes requests among available proxy servers.

However, routing gigabytes of read/write traffic through a single central balancer isn't efficient. To handle large data volumes efficiently, SDKs automatically send heavy requests directly to data proxies, bypassing the central entry point. The developer doesn't need to specify dozens of addresses in the code—the SDK discovers them automatically through the built-in Discovery mechanism. It works as follows:

Before executing a heavy request, the SDK sends an HTTP GET request to /api/v4/discover_proxies at the primary balancer.
The server responds with a list of addresses (FQDNs) of active data proxies.
The SDK selects one address from the list and sends the heavy request directly to that pod.

Below is a diagram of the Discovery mechanism when calling write_table via HTTP proxies:

Diagram explanation

The client library (SDK) sends an HTTP GET request to the discover_proxies endpoint at the primary cluster address (balancer yt.example.com).
The balancer accepts the request and redirects it into the cluster to one of the available control proxy servers (Control Proxy).
The control proxy forms a list of FQDNs of active data proxies (for example, hp-0.svc.local, hp-1.svc.local) and returns it to the balancer.
The balancer returns this list to the client.
The SDK selects one specific address from the list (in our example—hp-0) and sends a data write request (write_table) directly to that pod, bypassing the central balancer.
A direct connection is established between the client and the data proxy, over which the data stream is transmitted.

Proxy roles in Discovery

When requesting discover_proxies, a client can optionally specify a role. The following logic applies:

If a role is explicitly specified (for example, role=heavy), the balancer returns only addresses of proxies dedicated to that role.
If no role is specified, proxies with the default role are requested.

Discovery in different protocols

HTTP uses a "lazy" approach. The request to discover_proxies is made just before starting a file read or write.
RPC uses a "greedy" approach. The client calls discover_proxies immediately on startup, receives a list of RPC proxy addresses, and establishes persistent TCP connections with them.

Note

In older API versions (< v4), entry points to the Discovery service differed.

Differences in Discovery between API v3 and API v4

In API versions lower than v4:

To get a list of all HTTP proxies, clients accessed the /v3/entry endpoint.
To get RPC proxies, they accessed /v3/discover_proxies.

Starting with v4, both client types use a single universal endpoint /api/v4/discover_proxies (with the type=rpc parameter for RPC clients).

Why the access problem occurs

In a standard Kubernetes configuration, pod addresses are internal (for example, hp-0.http-proxies.default.svc.cluster.local).

When an external SDK calls discover_proxies, the cluster returns a list of internal FQDNs. The SDK, being outside the cluster perimeter, can't resolve these DNS names to IP addresses. As a result, light commands through the balancer succeed, but attempting to write data ends with various network errors—from DNS resolution failures to connection failures (Temporary failure in name resolution, Connection refused, Connection timed out).

Example: how to identify the problem

Consider a scenario: a YTsaurus cluster is deployed in Kubernetes, and you need to test access from a local machine.

For quick access to the control proxy API, a port is exposed via kubectl port-forward:

$ kubectl port-forward service/http-proxies-control-lb 8080:80
Forwarding from 127.0.0.1:8080 -> 80

Let's perform a light operation—create a table.

$ export YT_PROXY=127.0.0.1:8080
$ yt create table //home/my-table
30-56c4-10191-712a11b3

The command worked: the table was created. The Discovery mechanism wasn't involved; the request went directly to the address specified in the YT_PROXY variable.

Now let's try to write data to this table (write-table):

$ echo '{ "id": 0, "text": "Hello" }' | yt write-table //home/my-table --format json

WARNING HTTP PUT request http://hp-0.http-proxies.default.svc.cluster.local/api/v4/write_table failed with error NewConnectionError...
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

What happened:
When executing write-table, the SDK requested a list of data proxies. The cluster returned the internal pod address: hp-0.http-proxies.default.svc.cluster.local. The SDK tried to connect to this FQDN directly, but the name doesn't resolve from the local machine.

As a temporary debugging solution, you can disable Discovery on the client side. You can do this via environment variables; all traffic will then go through port-forward:

# Via a config patch:
export YT_CONFIG_PATCHES='{proxy={enable_proxy_discovery=%false}}'
# Or via a shorter, more popular alias for the CLI:
export YT_USE_HOSTS=0

echo '{ "id": 0, "text": "Hello" }' | yt write-table //home/my-table --format json

If writing succeeds after this—the problem is indeed with Discovery routing.