SPYT in Jupyter
Setup
Before you can use Spark in Jupyter, you need to create a cluster. Currently working with Spark using Jupyter notebooks is possible only using inner standalone cluster.
If there is one already, you need to find out the values of proxy
and discovery_path
to be able to use it.
Configuring Jupyter
-
Get network access from your Jupyter machine to the SPYT cluster, ports
27000-27200
. -
Get network access from the SPYT cluster to the Jupyter machine, ports
27000-27200
. -
Install a deb package containing java:
sudo apt-get update sudo apt-get install openjdk-11-jdk
-
Install the pip package:
pip install ytsaurus-spyt
-
Place your YTsaurus token in
~/.yt/token
:mkdir ~/.yt cat <<EOT > ~/.yt/token $YOUR_YT_TOKEN EOT
-
Place a file called
~/spyt.yaml
with the Spark cluster location in your home directory:cat <<EOT > ~/spyt.yaml yt_proxy: "cluster_name" discovery_path: "$YOUR_DISCOVERY_DIR" EOT
Updating the client in Jupyter
Update ytsaurus-spyt
in Jupyter:
pip install ytsaurus-spyt
If the second ytsaurus-spyt
version component is greater than that in your cluster version, new functionality may not work. Update your cluster per the instructions.