Python API with examples
- Basic level
- Reading and writing tables
- Simple map
- Sorting a table and a simple reduce operation
- Reduce with multiple input tables
- Reduce with multiple input and output tables
- Table schemas
- MapReduce
- MapReduce with multiple intermediate tables
- Decorators for job classes
- Working with files on the client and in operations
- Grep
- Advanced level
- Different examples
Before using the examples, read the instructions for obtaining the token.
We recommend running all programs on Linux
.
Basic level
The modern method of working with the Python API is typed.
Data stored in tables and processed by operations is represented in the code by classes with typed fields (similar to dataclasses). An unrecommended (but sometimes unavoidable, especially when working with old tables) method is untyped when table rows are represented by dicts.
This method is much slower, fraught with errors, and inconvenient when working with composite types. Therefore, this section of the documentation provides examples of working with the typed API and the untyped examples can be found in the corresponding section.
Reading and writing tables
YTsaurus enables you to write data to tables, as well as to append data to the end of existing tables. Several modes are available for reading tables: reading the entire table, reading individual ranges by row number or key.
The example is located at yt/python/examples/table_read_write_typed.
Simple map
Suppose there is a table with the login, name, and uid columns at //home/dev/tutorial/staff_unsorted.
You need to make a table with email addresses, calculating them from the login: email = login + "@ytsaurus.tech"
.
A simple mapper is suitable for this job.
The example is located at yt/python/examples/simple_map_typed
Sorting a table and a simple reduce operation
Using the same table as in the previous example, you can calculate the statistics: how many times you can see this or that name or which is the longest username among all people with the same name. A Reduce operation is well suited for this job. Since the Reduce operation can only be run on sorted tables, you must first sort the original table.
The example is located at yt/python/examples/simple_reduce_typed.
Reduce with multiple input tables
Suppose there is another table besides the table with users and it records which user is a service one. The is_robot field can be true or false.
The following program generates a table in which only service users remain.
The example is located at yt/python/examples/multiple_input_reduce_typed.
Reduce with multiple input and output tables
This example repeats the previous one, with the difference that two output tables will be written to at once, both with human users and service users.
The example is located at yt/python/examples/multiple_input_multiple_output_reduce_typed.
Table schemas
All tables in YTsaurus have a schema.
The code demonstrates examples of working with the table schema.
The example is located at yt/python/examples/table_schema_typed.
MapReduce
YTsaurus implements the MapReduce merge operation that works faster than Map + Sort + Reduce. Let's try to use the table with users once again to calculate the statistics of how many times you can see this or that name. Before calculating the statistics, we will normalize the names by converting them to lowercase so that people with the names ARCHIE
and Archie
are merged in our statistics.
The example is located at yt/python/examples/map_reduce_typed.
MapReduce with multiple intermediate tables
The intermediate data (between the map and reduce stages) in a MapReduce operation can "flow" in multiple streams and have different types. In this example, there are two tables on the operation input: the first displays uid in the name and the second contains information about the events associated with the user with that uid. The mapper selects events like "click" and sends them to one output stream, and all users to another. The reducer counts all clicks on this user.
The example is located at yt/python/examples/map_reduce_multiple_intermediate_streams_typed.
Decorators for job classes
You can mark functions or job classes with special decorators that change the expected interface of interaction with jobs.
Examples of decorators: with_context
, aggregator
, reduce_aggregator
, raw
, or raw_io
. You can find a full description in the documentation
The example is located at yt/python/examples/job_decorators_typed.
Working with files on the client and in operations
For more information, see the documentation.
For more information about files in Cypress, see the section.
The example is located at yt/python/examples/files_typed.
Grep
The typed API enables you to work with fairly arbitrary data using a single data class and a single operation class. As an example, let's consider the job of filtering a table based on a regular expression matching a given row field.
The example is located at yt/python/examples/grep_typed.
Advanced level
Batch queries
You can execute "light" queries (create/delete a table, check its existence, and others) in groups. It is reasonable to use this method if you need to perform a large number of single-type operations: batch queries can significantly save execution time.
The example is located at yt/python/examples/batch_client.
Using the RPC
Using the RPC with the CLI.
yt list / --proxy cluster_name --config '{backend=rpc}'
cooked_logs
home
...
A similar example (full code at yt/python/examples/simple_rpc) can be built in Python (with ya-make build):
Pay attention to the additional PEERDIR(yt/python/client_with_rpc)
in ya.make
.
You can work with dynamic tables using the RPC.
The example is located at yt/python/examples/dynamic_tables_rpc.
Specifying row types using prepare_operation
In addition to using type hints, the prepare_operation
method where all types are specified using special methods can be defined to specify Python table row types. If there is a prepare_operation
method in the job class, the library will use the types specified inside the method and there will be no attempts to derive row types from type hints.
The example is located at yt/python/examples/prepare_operation_typed.
Different examples
Data classes
This example demonstrates the features and peculiarities of working with data classes in more detail.
The example is located at yt/python/examples/dataclass_typed.
Context and managing writes to output tables
To select which output table to write a row to, you must use the OutputRow
wrapper class, namely the table_index
argument of its constructor. Having any iterator on data classes (returned from read_table_structured()
or passed to the __call__()
method of the job), you can use the .with_context()
method to make an iterator on pairs (row, context)
from it. The context
object has the .get_table_index()
, .get_row_index()
, and .get_range_index()
methods.
When writing a job class to whose __call__()
method a separate row (for example, mappers) is passed rather than an iterator, you can add the @yt.wrapper.with_context
decorator to the class. In this case, the __call__()
method must take a third context
argument (see the documentation).
In the reducer, mapper aggregators, and when reading tables — whenever there is an iterator on rows, use the with_context
method of the iterator.
The example is located at yt/python/examples/table_switches_typed.
Spec builders
Use the spec builder to describe the specification of operations to avoid errors.
The example is located at yt/python/examples/spec_builder_typed.
Using gevent
The documentation is contained in the separate section.
The example is located at yt/python/examples/gevent.