YTsaurus + Excel
Two integration microservices for YTsaurus and MS Excel are now open-source. These microservices allow users to:
-
Upload an xlsx-spreadsheet to a static YTsaurus table through the YTsaurus UI
-
Download a small static YTsaurus table as an xlsx-document
The microservices are deployed alongside the YTsaurus k8s cluster using a separate helm chart. The API services and admin documentation are available in the repository, and user documentation can be found here.
Features
Limitations
-
Handling small amounts of data (up to 100 MB)
-
Requiring strict schemas on YTsaurus tables
-
Working with a single Excel spreadsheet at a time
Type Systems
YTsaurus has a rich type system, including integers of various bit widths, floats and doubles, byte and UTF8 strings, dates, timestamps, and intervals. In contrast, Excel has only three data types: Number (64-bit double), Text, and Logical (bool).
YTsaurus tables are larger, rows are longer, and numbers are more precise. Therefore, when downloading YTsaurus tables, some data may be unrepresentable in Excel. The service uses a sophisticated type mapping and allows users to choose how to handle large integer types: convert to string, lose precision, or fail with an error.
For example, Excel does not have a built-in date type. A date is represented as a Number (double) formatted as a date. Dates in Excel are similar to Unix timestamps but with two key differences:
-
The epoch in Unix starts at 1970-01-01, while in Excel, it starts at 1900-01-01.
-
Unix measures time in seconds, whereas Excel measures it in days.
For example, the time 2011-08-09 22:39:07.776 is stored in Excel as 40764.94384, which equals 40764 whole days since 1900-01-01 plus approximately 94% of a day. This example shows that the Number type cannot store very precise times, such as microseconds.
Table Sizes
-
The maximum number of rows in an Excel spreadsheet is 1 million, while in YTsaurus, it is significantly more
-
The maximum number of columns in Excel is 16k, whereas in YTsaurus, it is 32k
-
The maximum row length in Excel is 32k, whereas in YTsaurus, it can be up to 128 × 106
Excel and Streaming
The .xlsx file format is essentially a zip archive of XML files:
unzip fruits.xlsx
Archive: fruits.xlsx
inflating: _rels/.rels
inflating: xl/workbook.xml
inflating: xl/styles.xml
inflating: xl/worksheets/sheet1.xml
inflating: xl/_rels/workbook.xml.rels
inflating: xl/sharedStrings.xml
inflating: docProps/core.xml
inflating: docProps/app.xml
inflating: [Content_Types].xml
Metadata, cell data, styles, and more are stored separately within the archive. This makes streaming .xlsx files impossible and requires materializing the entire file in memory before sending it as a response. This limitation is a key reason for the file size constraint in the services.
Feedback
Join our community chat with questions and submit issues and feature requests on GitHub.