Query anatomy before and after the ClickHouse engine

This article describes the CHYT query architecture and where you can get the query execution information.

Query IDs: trace id, query id, and datalens request id

There are several IDs associated with any query which allow you to restore all the necessary query execution information post factum.

  • query id is the ID assigned to each query in ClickHouse. Unlike original ClickHouse, this ID in CHYT is not controlled by the user and always has the YTsaurus GUID form — four hexadecimal uint32 separated by hyphens.
  • trace id is the ID that allows you to chain calls in distributed systems, creating a query execution "trace". This is a part of the opentracing protocol, specifically its implementation called Jaeger Tracing that is used in YTsaurus. trace id is also a YTsaurus GUID that in some cases coincides with query id and in other cases does not as will be detailed below.

Query execution journey from the client to heavy proxies

To better understand the logic by which these IDs exist and are assigned, figure out through which components and in which order the query passes to CHYT. Currently, the only public API to access CHYT is the HTTP API, so the following description applies specifically to the HTTP protocol.

The first important point in the query execution journey is the so-called heavy YTsaurus proxies, which all YTsaurus users have network access to. Heavy proxies find out where instances of the clique with a given alias currently reside, then send a query to a randomly selected instance.

The three most popular ways to access YTsaurus are shown below.

When you access CHYT directly from a script or from the command line using the cURL utility, an SLB balancer (for example, http://$YT_PROXY) acts as the endpoint. Behind it lies a complex construction of balancers that directs the query to so-called control proxies, which respond with an HTTP redirect to the heavy proxies servicing all the heavy load in YTsaurus. With this access interface, the query id matches the trace id: you can see them in the X-Yt-Trace-Id and X-ClickHouse-Query-Id headers.

Below is an example of interaction with CHYT using the cURL utility, with the most notable response headers indicated.

Note

In YTsaurus 24.1 and lower, a clique alias is specified with an asterisk at the beginning. For example, *ch_public. In the current version, *ch_public and ch_public are the same alias.

When CHYT is accessed using the deprecated /query path, the alias is passed via the database url parameter. When CHYT is accessed using the new /chyt path, the alias is taken from user or the chyt.clique_alias URL parameter.

$ curl -v --location-trusted 'http://$YT_PROXY/chyt?chyt.clique_alias=ch_public' -d 'select max(a) from "//sys/clickhouse/sample_table"' -H "Authorization: OAuth `cat ~/.yt/token`"
*   Trying ip_address:80...
* Connected to $YT_PROXY (ip_address) port 80 (#0)
> POST /chyt?chyt.clique_alias=ch_public HTTP/1.1
> Host: $YT_PROXY
> User-Agent: curl/7.69.1-DEV
> Accept: */*
> Authorization: OAuth <i>...<my_token>...</i>
> Content-Length: 50
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 50 out of 50 bytes
* Mark bundle as not supporting multiuse
<b> // Getting a redirect to a heavy proxy.</b>
< HTTP/1.1 307 Temporary Redirect
< Content-Length: 0
< Location: http://sas4-9923-proxy-$YT_PROXY/chyt?chyt.clique_alias=ch_public
< X-Yt-Trace-Id: 8e9bcc43-5c2be9b4-56f18c4e-117ea314
<
* Connection #0 to host $YT_PROXY left intact
* Issue another request to this URL: 'http://sas4-9923-$YT_PROXY/chyt?chyt.clique_alias=ch_public'
*   Trying ip_address:80...
* Connected to sas4-9923-$YT_PROXY (ip_address) port 80 (#1)
> POST /chyt?chyt.clique_alias=ch_public HTTP/1.1
> Host: sas4-9923-$YT_PROXY
> User-Agent: curl/7.69.1-DEV
> Accept: */*
> Authorization: OAuth <i>...<my_token>...</i>
> Content-Length: 50
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 50 out of 50 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
<b>  // Note: query id = trace id. </b>
< <b>X-ClickHouse-Query-Id:</b> 3fa9405e-15b29877-524e3e67-2be50e94
< <b>X-Yt-Trace-Id:</b> 3fa9405e-15b29877-524e3e67-2be50e94
<b>  // For technical reasons, X-Yt-Trace-Id will appear twice.</b>
< X-Yt-Trace-Id: 3fa9405e-15b29877-524e3e67-2be50e94
< Keep-Alive: timeout=10
<b>  // Address of the query coordinator instance. </b>
< <b>X-ClickHouse-Server-Display-Name:</b> sas2-1374-node-$YT_PROXY
< X-Yt-Request-Id: 3fa9405d-26285349-db14531a-2a12b9f9
< Date: Sun, 05 Apr 2020 18:49:57 GMT
< Content-Type: text/tab-separated-values; charset=UTF-8
< X-ClickHouse-Summary: {"read_rows":"0","read_bytes":"0","written_rows":"0","written_bytes":"0","total_rows_to_read":"0"}
<b>  // Address of the heavy proxy that served the query. </b>
< <b>X-Yt-Proxy:</b> sas4-9923-$YT_PROXY
<
1100
* Connection #1 to host sas4-9923-$YT_PROXY left intact

Executing a query within a clique

A heavy proxy selects one instance from the clique and delegates query execution to it. This instance is called the query "coordinator" (or "initiator"). If the query doesn't use YTsaurus tables (for example, when querying system tables starting with the system. prefix, or when it's a primitive query like SELECT 1), then it's executed on the coordinator normally, just like it would be done in ClickHouse.

However, if the query affects YTsaurus tables , then it's always executed in a distributed manner. Let's examine the execution of a simple scan query that supports filtering, projection, and aggregation:

SELECT someColumns, someFunctions(someColumns) FROM "//path/to/table" WHERE somePredicate(someColumns) [GROUP BY someColumns].

This query splits //path/to/table into roughly equal sections, with each section being processed independently on one of the clique nodes. Essentially, each instance receives a modified query, where //path/to/table in the FROM clause is replaced by an expression in the form ytSubquery(...), with the description of the next table section encoded in the argument.

Thus, query execution boils down to executing a number of remote subqueries on the clique's instances. The image above illustrates this mechanism.

It's worth noting that there are two kinds of queries:

  • QueryKind: InitialQuery: Initial queries received from clients.
  • QueryKind: SecondaryQuery: Modified remote subqueries that reference specific YTsaurus table sections in the FROM clause.

Each secondary query inherits the trace id of its parent. And even though each secondary query has its own query id, they still remember their initial query. In system logs, the parent query id of the initial query is indicated as InitialQueryId for secondary queries.

I/O and compression in CHYT

Below is some useful information and recommendations on reading data from YTsaurus. Take these into account when using CHYT.

Network

Instances of any clique are started on random machines of the cluster, so data reading is likely to be done over the network. You can expect that roughly each successive gigabyte of data that needs to be read will reside on a different machine in the cluster.

Note that even though the YTsaurus cluster network rarely becomes a bottleneck, this can happen in rare cases.

Compression

In YTsaurus, data is typically stored in compressed form. Compression is handled by the /@compression_codec table attribute, which is also found in the table view web interface. This is further described in a dedicated documentation article.

Instances read data from compressed tables in compressed form. This data is then decompressed just before being processed by the ClickHouse engine. When choosing a compression codec for your data, take into account that the stronger the codec, the more costly it will be to decompress the data.

Uncompressed block cache

CHYT has an uncompressed block cache, which is allocated 20 GiB by default. The cache size can be controlled with the --uncompressed-block-cache-size parameter in the clique start string.

This cache stores uncompressed blocks of recently read tables, which significantly speeds up query execution on data recently read by another query, because some of this data may already be ready and available in instance memory. To ensure that this cache is used effectively, make sure that the following conditions are met:

  • The cache works without issues as long as the number of instances in the clique remains the same. If an instance disappears (for example, due to preemption, a YTsaurus node failure, or a CHYT crash), future subqueries are arbitrarily redistributed across instances, and most of the data blocks cached in instance memory may become useless, increasing the block cache miss value. However, in this case, the very next query will restore the necessary cache data.
  • If table data changes, the table sections are arbitrarily redistributed across subqueries, which similarly leads to block cache misses.
  • Finally, if the amount of processed data per instance is significantly larger than the block cache size (in uncompressed data size terms), then, quite obviously, the cache won't be of much use for any subsequent queries on the same data, because most of it will have to be re-read over the network anyway.

HDD vs SSD

Even though SSDs offer faster read speeds than HDDs, there is no need to rush to migrate your data to an SSD the moment your queries begin to slow down. It's possible that the bottleneck is actually the CPU, not I/O: for example, the data may already be in the block cache, while you still need to process a few dozens of gigabytes of decompressed data to respond to the query. In this situation, moving your tables from HDD to SSD won't help.

If you believe that migrating your data to an SSD will speed up query execution, follow the action plan provided in the FAQ.

About columnar storage

In most cases when working with ClickHouse, you want to use the columnar format for table storage, which you can implement by setting the /@optimize_for attribute to scan.

Columnar storage lets you efficiently read individual table columns without having to load the entire table into memory and helps conserve disk space, as YTsaurus will apply different codecs for columnar storage, which will optimize column representation.

However, there may be situations requiring the row-oriented format (/@optimize_for = lookup). The main disadvantage of the columnar format is high memory consumption, especially for tables with a large number of columns, because each column requires its own buffer. In addition, it's worth noting that reading lookup tables is slightly less CPU-intensive.