YTsaurus + Excel

Find more about using Excel spreadsheets in YTsaurus

Two integration microservices for YTsaurus and MS Excel are now open-source. These microservices allow users to:

  • Upload an xlsx-spreadsheet to a static YTsaurus table through the YTsaurus UI

  • Download a small static YTsaurus table as an xlsx-document

The microservices are deployed alongside the YTsaurus k8s cluster using a separate helm chart. The API services and admin documentation are available in the repository, and user documentation can be found here.

Features

Limitations

  • Handling small amounts of data (up to 100 MB)

  • Requiring strict schemas on YTsaurus tables

  • Working with a single Excel spreadsheet at a time

Type Systems

YTsaurus has a rich type system, including integers of various bit widths, floats and doubles, byte and UTF8 strings, dates, timestamps, and intervals. In contrast, Excel has only three data types: Number (64-bit double), Text, and Logical (bool).

YTsaurus tables are larger, rows are longer, and numbers are more precise. Therefore, when downloading YTsaurus tables, some data may be unrepresentable in Excel. The service uses a sophisticated type mapping and allows users to choose how to handle large integer types: convert to string, lose precision, or fail with an error.

For example, Excel does not have a built-in date type. A date is represented as a Number (double) formatted as a date. Dates in Excel are similar to Unix timestamps but with two key differences:

  • The epoch in Unix starts at 1970-01-01, while in Excel, it starts at 1900-01-01.

  • Unix measures time in seconds, whereas Excel measures it in days.

For example, the time 2011-08-09 22:39:07.776 is stored in Excel as 40764.94384, which equals 40764 whole days since 1900-01-01 plus approximately 94% of a day. This example shows that the Number type cannot store very precise times, such as microseconds.

Table Sizes

  • The maximum number of rows in an Excel spreadsheet is 1 million, while in YTsaurus, it is significantly more

  • The maximum number of columns in Excel is 16k, whereas in YTsaurus, it is 32k

  • The maximum row length in Excel is 32k, whereas in YTsaurus, it can be up to 128 × 106

Excel and Streaming

The .xlsx file format is essentially a zip archive of XML files:

unzip fruits.xlsx 
Archive:  fruits.xlsx
  inflating: _rels/.rels             
  inflating: xl/workbook.xml         
  inflating: xl/styles.xml           
  inflating: xl/worksheets/sheet1.xml  
  inflating: xl/_rels/workbook.xml.rels  
  inflating: xl/sharedStrings.xml    
  inflating: docProps/core.xml       
  inflating: docProps/app.xml        
  inflating: [Content_Types].xml 

Metadata, cell data, styles, and more are stored separately within the archive. This makes streaming .xlsx files impossible and requires materializing the entire file in memory before sending it as a response. This limitation is a key reason for the file size constraint in the services.

Feedback

Join our community chat with questions and submit issues and feature requests on GitHub.

Sign in to save this post