SPYT in Jupyter

Setup

Before you can use Spark in Jupyter, you need to create a cluster. Currently working with Spark using Jupyter notebooks is possible only using inner standalone cluster.

If there is one already, you need to find out the values of proxy and discovery_path to be able to use it.

Configuring Jupyter

Get network access from your Jupyter machine to the SPYT cluster, ports 27000-27200.
Get network access from the SPYT cluster to the Jupyter machine, ports 27000-27200.

Install a deb package containing java:

sudo apt-get update
sudo apt-get install openjdk-11-jdk

Install the pip package:
```
pip install ytsaurus-spyt
```

Place your YTsaurus token in ~/.yt/token:

mkdir ~/.yt
cat <<EOT > ~/.yt/token
$YOUR_YT_TOKEN
EOT

Place a file called ~/spyt.yaml with the Spark cluster location in your home directory:

cat <<EOT > ~/spyt.yaml
yt_proxy: "cluster_name"
discovery_path: "$YOUR_DISCOVERY_DIR"
EOT

Updating the client in Jupyter

Update ytsaurus-spyt in Jupyter:

pip install ytsaurus-spyt

If the second ytsaurus-spyt version component is greater than that in your cluster version, new functionality may not work. Update your cluster per the instructions.

SPYT in Python

SPYT in Scala