Query Tracker

Introduction

Query Tracker is a YTsaurus component that allows working with the system using human-readable queries in SQL-like languages. The user sends queries to Query Tracker, they are executed by the engine, and the execution result is returned to the user.

With Query Tracker, you can:

Send ad hoc queries for execution.
Track query execution.
Save query results.
View query history.
Share queries with others.

See the example of working with Query Tracker.

Queries are defined by engine and text. The engine controls the execution of the query. The query text is engine-specific.

In this way, you can process data using different languages and runtime environments whilst honoring the guarantees and performance targets.

Engines

Currently supported execution engines include:

YT QL
- A query language built in YT. Only supports dynamic tables.
YQL
- Executes the query on YQL agents, which break the query into individual YT operations (map, reduce, ...), start them, then retrieve and return the result.
CHYT
- Executes the query on a clique.
SPYT
- Executes the query on the Spark cluster.

API

All operations accept the optional query_tracker_stage parameter that can be used to select the Query Tracker installation where you want your queries to be run. The default value is production.

start_query

Sends the query for execution. Returns the query ID.

Required parameters:

engine: Query execution engine. Supported values are ql, yql, chyt, and spyt.
query: Query text.

Optional parameters:

files: List of files for the query in YSON format.
settings: Additional query parameters in YSON format.
- In CHYT, you must set a clique alias using the clique parameter. Default value is the public clique.
- In SPYT, you must set the housekeeping directory of an existing Spark cluster using the discovery_path parameter.
- In YQL, you can run queries in different modes using the execution_mode parameter. Acceptable values are run, validate, and optimize. The default value is run.
draft: Used to mark draft queries. These queries are terminated automatically without execution.
annotations: Arbitrary annotations to the query. They can make it easier to search for queries. Specified in YSON format.
access_control_objects: List of objects at //sys/access_control_object_namespaces/queries/ that controls access to the query for other users.

Example: start_query(engine="yql", query="SELECT * FROM my_table", access_control_objects=["my_aco"])

abort_query

Aborts query execution. Doesn't return anything.

Required parameters:

query_id: Query ID.

Optional parameters:

message: Abort message.

Example: abort_query(query_id="my_query_id").

get_query_result

Returns meta information about query execution results.

Required parameters:

query_id: Query ID.
result_index: Result ID.

Example: get_query_result(query_id="my_query_id", result_index=0).

read_query_result

Returns the query results.

Required parameters:

query_id: Query ID.
result_index: Result ID.

Optional parameters:

columns: List of columns to read.
lower_row_index: First row from which to start reading the results.
upper_row_index: Last row at which to stop reading the results.

Example: read_query_result(query_id="my_query_id", result_index=0).

get_query

Returns information about the query.

Required parameters:

query_id: Query ID.

Optional parameters:

attributes: Filters the returned attributes.
timestamp: Returned query information will be consistent with the system as at the specified point in time.

Example: get_query(query_id="my_query_id").

list_queries

Gets a list of queries that match the set filters.

Optional parameters:

from_time: Lower threshold for the query start time.
to_time: Upper threshold for the query start time.
cursor_direction: Direction in which to sort queries by start time.
cursor_time: Cursor stop time. Only applicable if cursor_direction is specified.
user_filter: Filter by query creator.
state_filter: Filter by query state.
engine_filter: Filter by query engine.
substr_filter: Filter by query ID, annotations, and access control object.
limit: Number of items to return. The default value is 100.
attributes: Filters the returned attributes.

Example: list_queries().

alter_query

Modifies the query. Doesn't return anything.

Required parameters:

query_id: Query ID.

Optional parameters:

annotations: New annotations for the query.
access_control_objects: New access control object list for the query.

Example: alter_query(query_id="my_query_id", access_control_object=["my_new_aco"]).

Access control

To manage access to queries and their results, the query includes a list of access_control_objects strings that point to //sys/access_control_object_namespaces/queries/[access_control_object].

An Access Control Object (ACO) is an object with the @principal_acl attribute. It sets access rules in the same manner as @acl does for Cypress nodes. For more information, see Access control.

You can create an ACO through the user interface or by calling the create command:

yt create access_control_object --attr '{namespace=queries;name=my_aco}'.

All APIs use ACOs to verify access:

Use grants access to the query.
Read grants access to the query results.
Administer grants the permission to modify and stop the query.

Detailed information on each API method:

start_query checks whether the passed ACO exists.
alter_query checks whether the query ACO grants the Administer permission to the user and whether the passed ACO exists.
get_query checks whether the query ACO grants the Use permission to the user.
list_queries checks whether the query ACO grants the Use permission to the user.
get_query_result checks whether the query ACO grants the Read permission to the user.
read_query_result checks whether the query ACO grants the Read permission to the user.
abort_query checks whether the query ACO grants the Administer permission to the user.

A few things to keep in mind:

The query creator always has access to their queries.
If a query has multiple ACOs, you only need to have access to one of them to get access to the query.

For convenience, "ytsaurus-k8s-operator" creates the following ACOs:

nobody. Doesn't grant any permissions.
everyone. Grants Use and Read permissions to all users.
everyone-use. Grants Use permissions to all users.
everyone-share. The same as "everyone", but is not returned in list_queries.

Example

A typical usage scenario for Query Tracker is as follows:

Sending a query for execution.
- start_query(engine="yql", query="SELECT * from //home/me/my_table", access_control_objects=["my_team_aco"]).
Getting a list of our available queries.
- list_queries().
Getting information about our query.
- get_query(query_id="my_query_id").
Waiting for the query to complete.
- get_query(query_id="my_query_id").state == "completed".
Getting meta information about the results (result size, table schema).
- get_query_result(query_id="my_query_id", result_index=0).
Reading the result.
- read_query_result(query_id="my_query_id", result_index=0).
Changing the ACO of the query to make it accessible to everyone.
- alter_query(query_id="my_query_id", access_control_objects=["everyone"]).

Working with Microsoft Excel tables

Python User Defined Functions