
Flink Connector for YTsaurus Goes Open Source
We have open-sourced the connector for integrating Apache Flink with YTsaurus. It enables working with YTsaurus sorted dynamic tables directly from Flink. The connector can be used both for streaming data directly to YTsaurus and for enriching existing streams through lookup operations in dynamic tables. The connector source code is available on GitHub.
Key features
-
Writing to sorted dynamic tables. This is the primary use case for the connector. It enables implementing pipelines that transfer data from queues to sorted dynamic tables with delays of just a few seconds. Since insertion into such tables occurs by key, this simultaneously solves the problems of data deduplication and maintaining actual state.
-
Partitioning support. Data is often not stored in a single table but is partitioned. The connector enables this using various partitioning intervals from hour to year. The connector handles situations where multiple partitions are involved during writing without issues. It can write data to all partitions at the same time and create new ones as needed.
-
Proactive table resharding. When writing data streams of several megabytes per second, situations may arise where the table becomes locked for resharding. For certain tasks, such delays can be critical. To avoid unexpected resharding, the connector can perform it immediately when creating the table. In this case, the table will be split into the required number of tablets, allowing it to easily accept the entire data volume.
-
Synchronous and asynchronous lookup operations. Sorted dynamic tables can serve for storing data that may be useful for enriching event streams. Lookup operations by key can be performed on such tables. The connector supports both synchronous mode, which processes hundreds of requests per second, and asynchronous mode, capable of processing thousands of requests per second.
-
Full and Partial caching strategies for lookup operations. Apache Flink allows integrating various strategies for caching lookup operations into connectors. These strategies can significantly reduce the load on dynamic tables and increase pipeline throughput. The Apache Flink connector for YTsaurus supports both strategies.
-
Multi-cluster YTsaurus lookup. In case of connection loss with one YTsaurus installation, the connector can redirect lookup requests to another available installation. This improves job reliability and reduces its dependency on the availability of a single cluster.
To learn more about what the connector can do, you can follow the Quick Start Guide.