-
Notifications
You must be signed in to change notification settings - Fork 0
Notebook
DataPulse supports running Jupyter notebooks with PySpark integration on Kubernetes.
Notebook is delpoyed as a Kubernetes deployment, and exposed as a Kubernetes service named notebook.
The notebook image uses the wenyixu101/all-spark-notebook.
To create a notebook, access the following service:
Within the notebook, run the following code to automatically create a Spark session:
start()
The Spark UI link will also be given after the Spark session is created.

The notebook files are persisted in GCS.
The event logs of notebook pyspark applications are persisted in GCS.
The auto creation of Spark session is done by the startup script startup.py
It will do the following things:
- Sync the notebook files from GCS to the local directory
- Create a Spark session
- Find out the Spark UI link
- Output the Spark Session information and Spark UI link
The post-save hook is implemented in gcs_save_hook.py
It will automatically sync the notebook files to GCS after each save.