Skip to content

Notebook

xuwenyihust edited this page Jan 8, 2024 · 8 revisions

Summary

DataPulse supports running Jupyter notebooks with PySpark integration on Kubernetes.

QuickStart

Create Notebook

Notebook is delpoyed as a Kubernetes deployment, and exposed as a Kubernetes service named notebook.

The notebook image uses the wenyixu101/all-spark-notebook.

To create a notebook, access the following service:

截屏2024-01-08 下午3 46 20

Create Spark Session

Within the notebook, run the following code to automatically create a Spark session:

start()

image

Access Spark UI

The Spark UI link will also be given after the Spark session is created.

image

Notebook Persistence

The notebook files are persisted in GCS.

Event Log Persistence

The event logs of notebook pyspark applications are persisted in GCS.

Startup Script

The auto creation of Spark session is done by the startup script startup.py

It will do the following things:

  • Sync the notebook files from GCS to the local directory
  • Create a Spark session
  • Find out the Spark UI link
  • Output the Spark Session information and Spark UI link

Post-Save Hook

The post-save hook is implemented in gcs_save_hook.py

It will automatically sync the notebook files to GCS after each save.

Clone this wiki locally