Cluster monitoring setup
YTsaurus allows you to export cluster object metrics to various monitoring systems.
Prometheus is used to collect metrics. Grafana can be used to view metrics, as well as built-in dashboards in the YTsaurus UI.
Installing and configuring Prometheus
Prometheus Operator is used to collect metrics. YTsaurus components and Odin are automatically labeled for metric collection. The Odin Helm chart automatically creates a ServiceMonitor resource for its metrics during installation. To collect metrics from cluster components, we will create a separate ServiceMonitor manually.
-
Install the Prometheus operator according to the instructions.
-
Make sure the operator pod is in the
Runningstate:kubectl get pods -l app.kubernetes.io/name=prometheus-operator -
Create a
prometheus.yamlfile:prometheus.yaml
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus spec: serviceAccountName: prometheus resources: requests: memory: 400Mi enableAdminAPI: true storage: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi serviceMonitorSelector: matchLabels: yt_metrics: "true" additionalArgs: - name: log.level value: debug --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - services - endpoints - pods - namespaces verbs: ["get", "list", "watch"] - apiGroups: - "discovery.k8s.io" resources: - endpointslices verbs: - "get" - "list" - "watch" --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: default --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: ytsaurus-metrics labels: yt_metrics: "true" spec: namespaceSelector: any: true selector: matchLabels: yt_metrics: "true" endpoints: - port: ytsaurus-metrics path: /solomon/all relabelings: - sourceLabels: [__meta_kubernetes_pod_label_ytsaurus_tech_cluster_name] targetLabel: cluster metricRelabelings: - targetLabel: service sourceLabels: - service regex: (.*)-monitoring replacement: ${1}If necessary, you can modify the ServiceMonitor based on your requirements.
For
ClusterRoleBindingin thesubjects[0].namespacesection, you need to specify the namespace in which you plan to deploy Prometheus. -
Apply the
prometheus.yamlfile:kubectl -n <namespace> apply -f prometheus.yaml -
Make sure the Prometheus pod is in the
Runningstate:kubectl -n <namespace> get pods -l app.kubernetes.io/name=prometheus -
Make sure the Prometheus service is created:
kubectl -n <namespace> get svc -l managed-by=prometheus-operator -
Execute a simple query and see which pods metrics are collected from:
Open access to the Prometheus service:
kubectl -n <namespace> port-forward service/prometheus-operated 9090:9090Via Prometheus UIVia `curl`If possible, open the Prometheus UI: http://localhost:9090. If not possible, use the
curlapproach.In the
Querysection, execute a simple query:yt_accounts_chunk_count{account="sys"}We see the number of chunks for the "sys" account:

Fig. 1. Result of querying the number of chunks for the "sys" account in the Prometheus UI.
It is important to make sure that
clusteris set in the metrics.In the
Status->Target healthsection, you can find a list of all monitored components.Execute a simple PromQL query:
curl 'http://localhost:9090/api/v1/query?query=yt_accounts_chunk_count\{account="sys"\}' | jqWe see the number of chunks for the "sys" account:
{ "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "yt_accounts_chunk_count", "account": "sys", "cluster": "ytsaurus", "container": "ytserver", "endpoint": "ytsaurus-metrics", "instance": "10.244.0.178:10010", "job": "yt-master-monitoring", "namespace": "ytsaurus-dev", "pod": "ms-0", "service": "yt-master" }, "value": [ 1766656488.985, "605" ] } ] } }Also, execute a request to get a list of pods from which metrics are collected:
curl 'http://localhost:9090/api/v1/targets?state=active' | jq ' { target_count: (.data.activeTargets | length), targets: [ .data.activeTargets[] | { pod: .labels.pod, namespace: .labels.namespace, job: .labels.job, health: .health, lastError: .lastError, scrapeUrl: .scrapeUrl, scrapePool: .scrapePool } ] }'Example of expected result:
{ "target_count": 16, "targets": [ { "pod": "end-0", "namespace": "ytsaurus-dev", "job": "yt-exec-node-monitoring", "health": "up", "lastError": "", "scrapeUrl": "http://10.244.0.200:10029/solomon/all", "scrapePool": "serviceMonitor/default/ytsaurus-metrics/0" }, ... ] } -
If you have Odin installed, check if its metrics are being collected:
Collection of qualitative metrics from Odin is carried out through a separate
ServiceMonitorcreated by the Odin chart itself.Via Prometheus UIVia `curl`In the
Target healthsection, it will be displayed like this:
Fig. 2. Example of Odin service display in Prometheus.
Execute a request to get a list of pods containing
odinin the name, from which metrics are collected:curl 'http://localhost:9090/api/v1/targets?state=active' | jq ' .data.activeTargets | map(select(.labels.pod | contains("odin"))) | { targets: map({ pod: .labels.pod, namespace: .labels.namespace, job: .labels.job, health: .health, lastError: .lastError, scrapeUrl: .scrapeUrl, scrapePool: .scrapePool }) }'Example of expected result:
{ "targets": [ { "pod": "odin-odin-chart-web-6f8f5cbb7f-n5slb", "namespace": "default", "job": "odin-odin-chart-web-monitoring", "health": "up", "lastError": "", "scrapeUrl": "http://10.244.0.33:9002/prometheus", "scrapePool": "serviceMonitor/default/odin-odin-chart-metrics/0" } ] }If it is not displayed, check for the presence of
ServiceMonitorin the same namespace as Prometheus:kubectl -n <namespace> get servicemonitor -l app.kubernetes.io/name=odin-chartIf it is missing, you need to enable
ServiceMonitorcreation in the chart settings.
Done! Prometheus is installed and configured to collect qualitative and quantitative metrics from Odin and YTsaurus components.
Installing dashboards in YTsaurus UI
YTsaurus provides ready-made dashboards for monitoring. They can be displayed directly in the YTsaurus web interface.
-
Pass the
PROMETHEUS_BASE_URLenvironment variable with the internal Prometheus address to the UI:To configure the integration, you need a UI installed via the Helm chart.
In the
PROMETHEUS_BASE_URLenvironment variable, you need to pass the internal Prometheus URL, for example:http://prometheus-operated.<namespace>.svc.cluster.local:9090/. Add the variable to theui.envsection of yourvalues.yamlfile:ui: env: - name: PROMETHEUS_BASE_URL value: "http://prometheus-operated.<namespace>.svc.cluster.local:9090/"Update the chart settings:
helm upgrade --install yt-ui ytsaurus-ui/packages/ui-helm-chart/ -f values.yaml -
Dashboards displayed in the UI are stored in Cypress in
//sys/interface_monitoring. To create and upload them, thegenerate_dashboardsutility is used. Go to the directory with the utility and compile it:git clone https://github.com/ytsaurus/ytsaurus cd ytsaurus/yt/admin/dashboards/yt_dashboards/bin ../../../../../ya make -
Specifying the cluster proxy and token, create the
//sys/interface_monitoringnode and upload themaster-accountsdashboard to Cypress:export YT_PROXY=<proxy> export YT_TOKEN=<token> yt create map_node //sys/interface_monitoring ./generate_dashboards submit-cypress master-accounts --backend grafanaAfter that, the dashboard will appear in the YTsaurus web interface.
To view some dashboards, access rights to the viewed objects are required. For example, for the master-accounts dashboard to work, the use permission on the requested account is required.
Installing and configuring Grafana
-
Create a
grafana.yamlfile:grafana.yaml
--- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: grafana-pvc labels: app: grafana spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi --- apiVersion: v1 kind: Secret metadata: name: grafana-secret labels: app: grafana stringData: admin-user: admin admin-password: password type: Opaque --- apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasources labels: app: grafana data: prometheus.yaml: |- apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus-operated.<namespace>.svc.cluster.local:9090 access: proxy isDefault: true editable: true --- apiVersion: apps/v1 kind: Deployment metadata: name: grafana labels: app: grafana spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: securityContext: fsGroup: 472 containers: - name: grafana image: grafana/grafana:12.1.4 ports: - containerPort: 3000 name: http env: - name: GF_SECURITY_ADMIN_USER valueFrom: secretKeyRef: name: grafana-secret key: admin-user - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana-secret key: admin-password volumeMounts: - mountPath: /var/lib/grafana name: grafana-storage - mountPath: /etc/grafana/provisioning/datasources name: grafana-datasources readOnly: true resources: requests: cpu: 250m memory: 750Mi limits: cpu: 250m memory: 750Mi volumes: - name: grafana-storage persistentVolumeClaim: claimName: grafana-pvc - name: grafana-datasources configMap: name: grafana-datasources --- apiVersion: v1 kind: Service metadata: name: grafana labels: app: grafana spec: type: ClusterIP ports: - port: 3000 targetPort: http selector: app: grafanaIt is worth specifying a secure password in the Secret and/or creating it via
kubectl create secretinstead ofapply.In the
ConfigMapin theurlfield, you need to replace<namespace>with the one you are using. -
Apply the
grafana.yamlfile:kubectl -n <namespace> apply -f grafana.yaml -
Make sure the pod and service for Grafana are running:
kubectl -n <namespace> get all -l app=grafana -
Go to the Grafana interface, execute a simple query and create a service account:
Open access to the UI:
kubectl -n <namespace> port-forward service/grafana 3000:80Go to the UI: http://localhost:3000.
In the left collapsible window, go to the
Connections->Data sourcessection.If the
Prometheusdatasource already exists, go to it and clickSave & testat the very bottom. If the response is "Successfully queried the Prometheus API.", then Grafana has successfully connected to Prometheus.If any error occurred, check the specified "Prometheus server URL". Next, update the ConfigMap from the previous stage so that the URL and other parameters in it are correct. Also, for this datasource, save the uid:
How to get the datasource UID?
Go to the page with the UID:
http://localhost:3000/connections/datasources/edit/prometheusThe last part of the URL, namely
prometheus, in this case will be the UID we need.The dashboard generator interacts with Grafana using a service account. You can get a token in
Administration->Users and access->Service accounts. The service account role must be at least "Editor". Save the service account token, for example,glsa_bk1LYYY. -
Using the
generate_dashboardsutility assembled in the previous section, generate and upload themaster-accountsdashboard to Grafana:./generate_dashboards \ --dashboard-id ytsaurus-master-accounts \ --grafana-api-key glsa_bk1LYYY \ --grafana-base-url http://localhost:3000/ \ --grafana-datasource '{"type":"prometheus","uid":"prometheus"}' \ submit master-accounts \ --backend grafanaIt is recommended to pass the dashboard name with the
ytsaurus-prefix to the--dashboard-idparameter. This will be required for further linking the YTsaurus UI with Grafana.The uploaded dashboard will be visible in the
Dashboardssection.
Redirects from YTsaurus UI to Grafana
-
Open public access to Grafana, for example at
https://grafana.ytsaurus.tech/. -
Pass the
GRAFANA_BASE_URLenvironment variable with the external Grafana address to the YTsaurus UI in the same way as in the previous section of the documentation. -
To the right of the time range selection, a "Grafana" button will appear, clicking on which the user will be taken to the same dashboard with the same parameters for the same time interval.

Fig. 3. Demonstration of the button in the internal cluster UI.

Fig. 4. Grafana interface with the same parameters as in the internal UI from Fig. 3.
By default, the button is available to all cluster users.
If you create a document //sys/interface_monitoring/allow_grafana_url, the button will be visible only to users who have the use permission on this document.
Supported dashboards
Currently, the following dashboards are supported:
master-accountsscheduler-operationbundle-ui-user-loadbundle-ui-resourcebundle-ui-cpubundle-ui-memorybundle-ui-diskbundle-ui-lsmbundle-ui-networkbundle-ui-efficiencybundle-ui-rpc-proxy-overviewbundle-ui-rpc-proxyscheduler-internalscheduler-poolcluster-resourcesmaster-globalmaster-localqueue-metricsqueue-consumer-metricshttp-proxies