Cluster monitoring setup
YTsaurus allows you to export cluster object metrics to various monitoring systems.
Prometheus is used to collect metrics. Grafana can be used to view metrics, as well as built-in dashboards in the YTsaurus UI.
Automated setup of dashboards and alerts is done using the dedicated Helm chart monitoring-chart.
Installing and configuring Prometheus
Prometheus Operator is used to collect metrics. YTsaurus components and Odin are automatically labeled for metric collection. The Odin Helm chart automatically creates a ServiceMonitor resource for its metrics during installation. To collect metrics from cluster components, we will create a separate ServiceMonitor manually.
-
Install the Prometheus operator according to the instructions.
-
Make sure the operator pod is in the
Runningstate:kubectl get pods -l app.kubernetes.io/name=prometheus-operator -
Create a
prometheus.yamlfile:prometheus.yaml
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus spec: serviceAccountName: prometheus resources: requests: memory: 400Mi enableAdminAPI: true storage: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi serviceMonitorSelector: matchLabels: yt_metrics: "true" additionalArgs: - name: log.level value: debug --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - services - endpoints - pods - namespaces verbs: ["get", "list", "watch"] - apiGroups: - "discovery.k8s.io" resources: - endpointslices verbs: - "get" - "list" - "watch" --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: default --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: ytsaurus-metrics labels: yt_metrics: "true" spec: namespaceSelector: any: true selector: matchLabels: yt_metrics: "true" endpoints: - port: ytsaurus-metrics path: /solomon/all relabelings: - sourceLabels: [__meta_kubernetes_pod_label_ytsaurus_tech_cluster_name] targetLabel: cluster metricRelabelings: - targetLabel: service sourceLabels: - service regex: (.*)-monitoring replacement: ${1}If necessary, you can modify the ServiceMonitor based on your requirements.
For
ClusterRoleBindingin thesubjects[0].namespacesection, you need to specify the namespace in which you plan to deploy Prometheus. -
Apply the
prometheus.yamlfile:kubectl -n <namespace> apply -f prometheus.yaml -
Make sure the Prometheus pod is in the
Runningstate:kubectl -n <namespace> get pods -l app.kubernetes.io/name=prometheus -
Make sure the Prometheus service is created:
kubectl -n <namespace> get svc -l managed-by=prometheus-operator -
Execute a simple query and see which pods metrics are collected from:
Open access to the Prometheus service:
kubectl -n <namespace> port-forward service/prometheus-operated 9090:9090Via Prometheus UIVia `curl`If possible, open the Prometheus UI: http://localhost:9090. If not possible, use the
curlapproach.In the
Querysection, execute a simple query:yt_accounts_chunk_count{account="sys"}We see the number of chunks for the "sys" account:

Fig. 1. Result of querying the number of chunks for the "sys" account in the Prometheus UI.
It is important to make sure that
clusteris set in the metrics.In the
Status->Target healthsection, you can find a list of all monitored components.Execute a simple PromQL query:
curl 'http://localhost:9090/api/v1/query?query=yt_accounts_chunk_count\{account="sys"\}' | jqWe see the number of chunks for the "sys" account:
{ "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "yt_accounts_chunk_count", "account": "sys", "cluster": "ytsaurus", "container": "ytserver", "endpoint": "ytsaurus-metrics", "instance": "10.244.0.178:10010", "job": "yt-master-monitoring", "namespace": "ytsaurus-dev", "pod": "ms-0", "service": "yt-master" }, "value": [ 1766656488.985, "605" ] } ] } }Also, execute a request to get a list of pods from which metrics are collected:
curl 'http://localhost:9090/api/v1/targets?state=active' | jq ' { target_count: (.data.activeTargets | length), targets: [ .data.activeTargets[] | { pod: .labels.pod, namespace: .labels.namespace, job: .labels.job, health: .health, lastError: .lastError, scrapeUrl: .scrapeUrl, scrapePool: .scrapePool } ] }'Example of expected result:
{ "target_count": 16, "targets": [ { "pod": "end-0", "namespace": "ytsaurus-dev", "job": "yt-exec-node-monitoring", "health": "up", "lastError": "", "scrapeUrl": "http://10.244.0.200:10029/solomon/all", "scrapePool": "serviceMonitor/default/ytsaurus-metrics/0" }, ... ] } -
If you have Odin installed, check if its metrics are being collected:
Collection of qualitative metrics from Odin is carried out through a separate
ServiceMonitorcreated by the Odin chart itself.Via Prometheus UIVia `curl`In the
Target healthsection, it will be displayed like this:
Fig. 2. Example of Odin service display in Prometheus.
Execute a request to get a list of pods containing
odinin the name, from which metrics are collected:curl 'http://localhost:9090/api/v1/targets?state=active' | jq ' .data.activeTargets | map(select(.labels.pod | contains("odin"))) | { targets: map({ pod: .labels.pod, namespace: .labels.namespace, job: .labels.job, health: .health, lastError: .lastError, scrapeUrl: .scrapeUrl, scrapePool: .scrapePool }) }'Example of expected result:
{ "targets": [ { "pod": "odin-odin-chart-web-6f8f5cbb7f-n5slb", "namespace": "default", "job": "odin-odin-chart-web-monitoring", "health": "up", "lastError": "", "scrapeUrl": "http://10.244.0.33:9002/prometheus", "scrapePool": "serviceMonitor/default/odin-odin-chart-metrics/0" } ] }If it is not displayed, check for the presence of
ServiceMonitorin the same namespace as Prometheus:kubectl -n <namespace> get servicemonitor -l app.kubernetes.io/name=odin-chartIf it is missing, you need to enable
ServiceMonitorcreation in the chart settings.
Done! Prometheus is installed and configured to collect qualitative and quantitative metrics from Odin and YTsaurus components.
Installing and configuring Grafana
-
Create a
grafana.yamlfile:grafana.yaml
--- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: grafana-pvc labels: app: grafana spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi --- apiVersion: v1 kind: Secret metadata: name: grafana-secret labels: app: grafana stringData: admin-user: admin admin-password: password type: Opaque --- apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasources labels: app: grafana data: prometheus.yaml: |- apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus-operated.<namespace>.svc.cluster.local:9090 access: proxy isDefault: true editable: true --- apiVersion: apps/v1 kind: Deployment metadata: name: grafana labels: app: grafana spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: securityContext: fsGroup: 472 containers: - name: grafana image: grafana/grafana:12.1.4 ports: - containerPort: 3000 name: http env: - name: GF_SECURITY_ADMIN_USER valueFrom: secretKeyRef: name: grafana-secret key: admin-user - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana-secret key: admin-password volumeMounts: - mountPath: /var/lib/grafana name: grafana-storage - mountPath: /etc/grafana/provisioning/datasources name: grafana-datasources readOnly: true resources: requests: cpu: 250m memory: 750Mi limits: cpu: 250m memory: 750Mi volumes: - name: grafana-storage persistentVolumeClaim: claimName: grafana-pvc - name: grafana-datasources configMap: name: grafana-datasources --- apiVersion: v1 kind: Service metadata: name: grafana labels: app: grafana spec: type: ClusterIP ports: - port: 3000 targetPort: http selector: app: grafanaIt is worth specifying a secure password in the Secret and/or creating it via
kubectl create secretinstead ofapply.In the
ConfigMapin theurlfield, you need to replace<namespace>with the one you are using. -
Apply the
grafana.yamlfile:kubectl -n <namespace> apply -f grafana.yaml -
Make sure the pod and service for Grafana are running:
kubectl -n <namespace> get all -l app=grafana -
Go to the Grafana interface, execute a simple query and create a service account:
Open access to the UI:
kubectl -n <namespace> port-forward service/grafana 3000:3000Go to the UI: http://localhost:3000.
In the left collapsible window, go to the
Connections->Data sourcessection.If the
Prometheusdatasource already exists, go to it and clickSave & testat the very bottom. If the response is "Successfully queried the Prometheus API.", then Grafana has successfully connected to Prometheus.If any error occurred, check the specified "Prometheus server URL". Next, update the ConfigMap from the previous stage so that the URL and other parameters in it are correct. Also, for this datasource, save the uid:
How to get the datasource UID?
Go to the page with the UID:
http://localhost:3000/connections/datasources/edit/prometheusThe last part of the URL, namely
prometheus, in this case will be the UID we need.The dashboard generator interacts with Grafana using a service account. You can get the token in
Administration->Users and access->Service accounts. The service account role must be at least “Editor”. Save the service account token to an environment variable (you’ll need it in the Secret creation step):export GRAFANA_TOKEN="<your-grafana-token>" # example: glsa_bk1LYYY -
To let your cluster users access the Grafana UI, configure network access to it:
Warning
Grafana dashboards, unlike dashboards in the YTsaurus UI, don’t check access rights to cluster objects. Configure access rights and accounts on the Grafana side.
If you’ve restricted the Grafana user group, you can hide the Grafana button in the YTsaurus UI for everyone else to avoid cluttering the interface. How to configure button visibility by ACL is described in the Displaying the Grafana navigation button section.
-
Via Ingress / LoadBalancer:
Configure public access to the
grafanaservice using your cluster tools (for example, assign thehttps://grafana.ytsaurus.techdomain). -
Locally:
If you’re just testing the system locally, the public address will be the same as for port forwarding:
http://localhost:3000/. Port forwarding must be active while you’re using the UI.
-
Configuring the YTsaurus interface
For monitoring and Grafana integration to work properly, you need to pass the required addresses to the YTsaurus UI installed via the Helm chart. When installing the monitoring chart, these variables will be automatically checked for existence.
Specify the PROMETHEUS_BASE_URL and GRAFANA_BASE_URL variables in the ui.env section of your values.yaml file for the YTsaurus UI Helm chart:
ui:
env:
# Internal Prometheus address that the UI will use
- name: PROMETHEUS_BASE_URL
value: "http://prometheus-operated.<namespace>.svc.cluster.local:9090/"
# Public Grafana address for navigation from the UI
- name: GRAFANA_BASE_URL
value: "https://grafana.ytsaurus.tech" # or http://localhost:3000
Update the YTsaurus UI Helm chart settings.
Install dashboards and alerts (monitoring-chart)
Use the monitoring-chart Helm chart to automatically load pre‑built dashboards into Cypress and Grafana, and to create standard alerts.
Step 1: Prepare a user and grant permissions
Create a robot user robot-monitoring in YTsaurus:
yt create user --attr "{name=robot-monitoring}"
Create a directory in Cypress to store dashboards (by default, //sys/interface_monitoring), and grant write and account‑use permissions for the directory owner account (by default, sys) to the robot-monitoring user:
Note
If you are using UI version 3.12.2 or earlier, create interface-monitoring instead of interface_monitoring.
yt create map_node //sys/interface_monitoring --ignore-existing
yt set //sys/interface_monitoring/@acl/end '{action=allow; subjects=[robot-monitoring;]; permissions=[read;write;remove;];}'
yt set //sys/accounts/sys/@acl/end '{action=allow; subjects=[robot-monitoring]; permissions=[use]}'
Step 2: Create a Kubernetes Secret with tokens
The chart uses a Kubernetes Job to load dashboards. It needs tokens to authenticate with Grafana and YTsaurus.
Create a Secret, inserting the YTsaurus token and the Grafana token (saved in the GRAFANA_TOKEN variable during the Grafana setup step):
kubectl create secret generic ytsaurus-monitoring \
--from-literal=YT_TOKEN=$(yt issue-token robot-monitoring) \
--from-literal=GRAFANA_TOKEN="$GRAFANA_TOKEN" \
-n <namespace>
Step 3: Prepare values.yaml for monitoring‑chart
Create a values.yaml file with the following settings. Enter your data:
cluster:
# Internal HTTP proxy address of the cluster for loading dashboards
proxy: "http://http-proxies.<namespace>.svc.cluster.local"
name: "ytsaurus"
dashboards:
grafana:
# Internal address for loading dashboards
url: "http://grafana.<namespace>.svc.cluster.local:3000"
datasource:
# UID obtained during Grafana setup
uid: <your-uid, for example, PBFA97CFB590B2093>
ui:
chart:
name: "ytsaurus-ui"
namespace: "default"
# These addresses are used to verify UI configuration correctness
prometheusInternalUrl: "http://prometheus-operated.<namespace>.svc.cluster.local:9090/"
grafanaPublicUrl: "https://grafana.ytsaurus.tech"
You can view the full list of parameters in values.yaml.
Note
If you use the //sys/interface-monitoring directory (for UI versions 3.12.2 and earlier), you must additionally specify its path in values.yaml:
dashboards:
cypress:
path: "//sys/interface-monitoring"
Step 4: Install the chart
Install the monitoring chart:
helm install ytsaurus-monitoring oci://ghcr.io/ytsaurus/monitoring-chart \
--version 0.0.1 \
-f values.yaml \
-n <namespace>
During installation, the chart will automatically create PrometheusRule resources with alerts and run a Job that loads all dashboards into Grafana and the YTsaurus UI. After that, you can see monitoring tabs in the cluster interface.
Note
Some dashboards require access permissions to the objects being viewed. For example, the master-accounts dashboard requires the use permission on the requested account.
Configure alerts
The monitoring-chart comes with a ready‑made set of alerts (PrometheusRule).
You can use the default values or override them selectively in values.yaml. View the full list of alerts in values.yaml.
For example, to disable an entire group of alerts or change the trigger threshold for a specific rule:
alerts:
groups:
master:
rules:
# Disable a specific rule
MasterAutomatonThreadOverload:
enabled: false
# Override the trigger time (for)
MediumAlmostOutOfSpace:
for: 30m
# Disable the entire group of Controller Agent alerts
controllerAgent:
enabled: false
Display the Grafana navigation button
Since you already added the GRAFANA_BASE_URL variable during UI setup, a Grafana button appears in the YTsaurus interface (to the right of the time range selector).
Clicking it takes the user to the same dashboard with the same parameters for the same time period directly in the Grafana interface.

Fig. 3. Button demonstration in the cluster’s internal UI.

Fig. 4. Grafana interface with the same parameters as in the internal UI (Fig. 3).
By default, the button is available to all cluster users. If you create the //sys/interface_monitoring/allow_grafana_url document (or //sys/interface-monitoring/allow_grafana_url for UI versions 3.12.2 and earlier), the button will only be visible to users who have the use permission on this document.
Supported dashboards
Currently, the following dashboards are automatically loaded and supported:
master-accountsscheduler-operationbundle-ui-user-loadbundle-ui-resourcebundle-ui-cpubundle-ui-memorybundle-ui-diskbundle-ui-lsmbundle-ui-networkbundle-ui-efficiencybundle-ui-rpc-proxy-overviewbundle-ui-rpc-proxyscheduler-internalscheduler-poolcluster-resourcesmaster-globalmaster-localqueue-metricsqueue-consumer-metricshttp-proxies