SPYT in Jupyter
Setup
Before you can use Spark in Jupyter, you need to create a cluster. Currently working with Spark using Jupyter notebooks is possible only using inner standalone cluster.
If there is one already, you need to find out the values of proxy and discovery_path to be able to use it.
Configuring Jupyter
-
Get network access from your Jupyter machine to the SPYT cluster, ports
27000-27200. -
Get network access from the SPYT cluster to the Jupyter machine, ports
27000-27200. -
Install a deb package containing java:
sudo apt-get update sudo apt-get install openjdk-11-jdk -
Install the pip package:
pip install ytsaurus-spyt -
Place your YTsaurus token in
~/.yt/token:mkdir ~/.yt cat <<EOT > ~/.yt/token $YOUR_YT_TOKEN EOT -
Place a file called
~/spyt.yamlwith the Spark cluster location in your home directory:cat <<EOT > ~/spyt.yaml yt_proxy: "cluster_name" discovery_path: "$YOUR_DISCOVERY_DIR" EOT
Updating the client in Jupyter
Update ytsaurus-spyt in Jupyter:
pip install ytsaurus-spyt
If the second ytsaurus-spyt version component is greater than that in your cluster version, new functionality may not work. Update your cluster per the instructions.