From ed34ea3989a9ff26e2366d2ca29e237a5479f1a0 Mon Sep 17 00:00:00 2001
From: tfwang <tfwang@alauda.io>
Date: Fri, 24 Apr 2026 13:53:21 +0800
Subject: [PATCH 1/2] docs: improve llama stack pgvector docs

---
 docs/en/llama_stack/install.mdx               |  40 +++++++
 docs/en/llama_stack/overview/features.mdx     |   1 +
 docs/en/llama_stack/quickstart.mdx            |  11 ++
 .../llama-stack/llama-stack_quickstart.ipynb  | 105 +++++++++++++++++-
 4 files changed, 154 insertions(+), 3 deletions(-)
diff --git a/docs/en/llama_stack/install.mdx b/docs/en/llama_stack/install.mdx
index 60a709d7..6aa5f579 100644
--- a/docs/en/llama_stack/install.mdx
+++ b/docs/en/llama_stack/install.mdx
@@ -34,6 +34,8 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS
 > - **Inference URL**: `VLLM_URL` must point at a **vLLM OpenAI-compatible** HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model.
 > - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below).
 > - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.
+> - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
+> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If direct access is restricted, configure `HF_ENDPOINT` for a mirror/proxy, or pre-download the model files into the server PVC before running the first vector-store request.
 
 ```yaml
 apiVersion: llamastack.io/v1alpha1
@@ -66,6 +68,31 @@ spec:
         #       key: token
         #       name: vllm-api-token
 
+        # Optional: enable PGVector-backed vector stores.
+        # Omit the entire block below if you do not need vector store APIs.
+        # ACP-provided PostgreSQL already includes the pgvector extension.
+        # - name: ENABLE_PGVECTOR
+        #   value: "true"
+        # - name: PGVECTOR_HOST
+        #   value: "<acp-postgresql-service>"
+        # - name: PGVECTOR_PORT
+        #   value: "5432"
+        # - name: PGVECTOR_DB
+        #   value: "<database-name>"
+        # - name: PGVECTOR_USER
+        #   value: "<database-username>"
+        # - name: PGVECTOR_PASSWORD
+        #   valueFrom:
+        #     secretKeyRef:
+        #       name: <pgvector-credentials-secret>
+        #       key: password
+
+        # Optional: configure Hugging Face access for the default embedding model.
+        # Set HF_ENDPOINT when a mirror/proxy is required, or pre-populate the model
+        # cache under the server PVC before the first vector-store request.
+        # - name: HF_ENDPOINT
+        #   value: "<huggingface-mirror-or-proxy>"
+
     distribution:
       name: starter                                # Distribution name (options: starter, postgres-demo, meta-reference-gpu)
     storage:
@@ -93,3 +120,16 @@ args:
 ```
 
 Choose `--tool-call-parser` (and any related flags) according to the **served model** and the vLLM documentation for that model family.
+
+## Enable PGVector Vector Store
+
+When `ENABLE_PGVECTOR=true` is set on the server, Llama Stack can create vector stores by using `provider_id="pgvector"` from the client API.
+
+Recommended preparation:
+
+1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password.
+2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`.
+3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use.
+4. If the cluster cannot reach Hugging Face directly, either set `HF_ENDPOINT` to a reachable mirror/proxy or pre-download the embedding model files into the server PVC.
+
+After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook.
diff --git a/docs/en/llama_stack/overview/features.mdx b/docs/en/llama_stack/overview/features.mdx
index 63526883..59de4841 100644
--- a/docs/en/llama_stack/overview/features.mdx
+++ b/docs/en/llama_stack/overview/features.mdx
@@ -26,4 +26,5 @@ weight: 20
 ## Integration
 
 - **Python Client**: `llama-stack-client` for Python 3.12+ with full agent and model APIs
+- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores when the server is configured with `ENABLE_PGVECTOR=true`
 - **REST-Friendly**: Server exposes APIs for inference, agents, and tool runtime; can be wrapped in FastAPI or other web frameworks for production use
diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx
index 3ecd4c41..3cf19b32 100644
--- a/docs/en/llama_stack/quickstart.mdx
+++ b/docs/en/llama_stack/quickstart.mdx
@@ -25,9 +25,20 @@ The notebook demonstrates:
 
 - **Two tool options:** client-side tools (`@client_tool`) and MCP tools (FastMCP + `toolgroups.register`)
 - **Shared agent flow:** connect to Llama Stack Server, select a model, create an `Agent` with `tools=AGENT_TOOLS`, then run sessions and streaming turns
+- **Optional PGVector flow:** upload a file, create a `pgvector`-backed vector store, and run a hybrid search query
 - Streaming responses and event logging
 - Optional FastAPI deployment of the `agent`
 
+## PGVector Usage
+
+The downloadable notebook includes an optional PGVector section. To run it, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
+
+The notebook example covers:
+
+- Uploading a file through `client.files.create(...)`
+- Creating a vector store with `provider_id="pgvector"`
+- Running a hybrid search with `client.vector_stores.search(...)` and `search_mode="hybrid"`
+
 ## FAQ
 
 ### How to prepare Python 3.12 in Notebook \{#how-to-prepare-python-312-in-notebook}
diff --git a/docs/public/llama-stack/llama-stack_quickstart.ipynb b/docs/public/llama-stack/llama-stack_quickstart.ipynb
index e7ef89aa..954c7ed4 100644
--- a/docs/public/llama-stack/llama-stack_quickstart.ipynb
+++ b/docs/public/llama-stack/llama-stack_quickstart.ipynb
@@ -7,11 +7,12 @@
       "source": [
         "# Llama Stack Quick Start Demo\n",
         "\n",
-        "This notebook demonstrates how to use Llama Stack to run an agent with tools in two ways:\n",
+        "This notebook demonstrates how to use Llama Stack for agent workflows and PGVector-backed vector store access:\n",
         "\n",
         "- **Option A (section 2):** define a **client-side** weather tool with `@client_tool`; the cell sets **`AGENT_TOOLS`**.\n",
         "- **Option B (section 2):** run an **MCP** weather tool with **FastMCP** and register it with the server; the register cell sets **`AGENT_TOOLS`**.\n",
         "- **Section 3** uses the **same** connect / model selection / `Agent` construction / run flow for both options. The only difference is the value of **`AGENT_TOOLS`** passed into `Agent`.\n",
+        "- **Section 4** shows how to upload a file and query a **PGVector**-backed vector store.\n",
         "\n",
         "### Inference backend (`LlamaStackDistribution`)\n",
         "\n",
@@ -456,12 +457,110 @@
         "print('\\n')"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "id": "pgvector-title-md",
+      "metadata": {},
+      "source": [
+        "## 4. PGVector Vector Store Example\n",
+        "\n",
+        "This section shows how to upload a file and query a PGVector-backed vector store.\n",
+        "\n",
+        "Prerequisites:\n",
+        "- The server distribution is configured with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings.\n",
+        "- ACP-provided PostgreSQL can be used directly; it already includes the `pgvector` extension.\n",
+        "- Llama Stack includes a default embedding model configuration, but the model files are downloaded from Hugging Face on first use.\n",
+        "- If the cluster cannot reach Hugging Face directly, configure `HF_ENDPOINT` or pre-download the embedding model files into the server PVC.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "pgvector-demo-code",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import json\n",
+        "import time\n",
+        "\n",
+        "\n",
+        "def get_model_metadata(model):\n",
+        "    metadata = getattr(model, \"metadata\", None)\n",
+        "    if isinstance(metadata, dict):\n",
+        "        return metadata\n",
+        "\n",
+        "    custom_metadata = getattr(model, \"custom_metadata\", None)\n",
+        "    if isinstance(custom_metadata, dict):\n",
+        "        return custom_metadata\n",
+        "\n",
+        "    return {}\n",
+        "\n",
+        "\n",
+        "models = client.models.list()\n",
+        "embedding_model = next(\n",
+        "    (\n",
+        "        model\n",
+        "        for model in models\n",
+        "        if get_model_metadata(model).get(\"model_type\") == \"embedding\"\n",
+        "    ),\n",
+        "    None,\n",
+        ")\n",
+        "if embedding_model is None:\n",
+        "    raise RuntimeError(\"No embedding model found from client.models.list()\")\n",
+        "\n",
+        "embedding_metadata = get_model_metadata(embedding_model)\n",
+        "embedding_dimension = int(\n",
+        "    embedding_metadata.get(\"embedding_dimension\")\n",
+        "    or embedding_metadata.get(\"dimensions\")\n",
+        "    or getattr(embedding_model, \"embedding_dimension\", None)\n",
+        "    or getattr(embedding_model, \"dimensions\", None)\n",
+        "    or 768\n",
+        ")\n",
+        "\n",
+        "document = \"\"\"ACP PostgreSQL with pgvector can be used as the vector backend.\n",
+        "Unique token: pgvector-demo-token\n",
+        "This document is used to verify vector store indexing and retrieval.\n",
+        "\"\"\"\n",
+        "\n",
+        "file_object = client.files.create(\n",
+        "    file=(\"pgvector-demo.txt\", document.encode(\"utf-8\"), \"text/plain\"),\n",
+        "    purpose=\"assistants\",\n",
+        ")\n",
+        "\n",
+        "vector_store = client.vector_stores.create(\n",
+        "    name=f\"pgvector-demo-{int(time.time())}\",\n",
+        "    file_ids=[file_object.id],\n",
+        "    extra_body={\n",
+        "        \"provider_id\": \"pgvector\",\n",
+        "        \"embedding_model\": embedding_model.id,\n",
+        "        \"embedding_dimension\": embedding_dimension,\n",
+        "    },\n",
+        ")\n",
+        "\n",
+        "search_result = client.vector_stores.search(\n",
+        "    vector_store_id=vector_store.id,\n",
+        "    query=\"pgvector-demo-token\",\n",
+        "    max_num_results=3,\n",
+        "    extra_body={\"search_mode\": \"hybrid\"},\n",
+        ")\n",
+        "\n",
+        "if hasattr(vector_store, \"model_dump\"):\n",
+        "    vector_store = vector_store.model_dump(mode=\"json\")\n",
+        "if hasattr(search_result, \"model_dump\"):\n",
+        "    search_result = search_result.model_dump(mode=\"json\")\n",
+        "\n",
+        "print(\"Vector store:\")\n",
+        "print(json.dumps(vector_store, ensure_ascii=False, indent=2))\n",
+        "print(\"\\nSearch result:\")\n",
+        "print(json.dumps(search_result, ensure_ascii=False, indent=2))\n"
+      ]
+    },
     {
       "cell_type": "markdown",
       "id": "6f8d31d0",
       "metadata": {},
       "source": [
-        "## 4. FastAPI Service Example\n",
+        "## 5. FastAPI Service Example\n",
         "\n",
         "Expose the `llama-stack-client`-based `agent` as a FastAPI web service, so it can be called via HTTP.\n"
       ]
@@ -668,7 +767,7 @@
       "id": "a3ebed1f",
       "metadata": {},
       "source": [
-        "## 5. More Resources\n",
+        "## 6. More Resources\n",
         "\n",
         "For more resources on developing AI Agents with Llama Stack, see:\n",
         "\n",

From c0d9a1b27ce40827a73a70ec1185806a9217a10a Mon Sep 17 00:00:00 2001
From: tfwang <tfwang@alauda.io>
Date: Fri, 24 Apr 2026 15:42:29 +0800
Subject: [PATCH 2/2] update

---
 docs/en/llama_stack/install.mdx               | 57 +++++++++++++++++--
 docs/en/llama_stack/quickstart.mdx            |  2 +-
 .../llama-stack/llama-stack_quickstart.ipynb  | 14 +++--
 3 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/docs/en/llama_stack/install.mdx b/docs/en/llama_stack/install.mdx
index 6aa5f579..3ca1ab1d 100644
--- a/docs/en/llama_stack/install.mdx
+++ b/docs/en/llama_stack/install.mdx
@@ -35,7 +35,7 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS
 > - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below).
 > - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.
 > - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
-> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If direct access is restricted, configure `HF_ENDPOINT` for a mirror/proxy, or pre-download the model files into the server PVC before running the first vector-store request.
+> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure `HF_ENDPOINT`. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request.
 
 ```yaml
 apiVersion: llamastack.io/v1alpha1
@@ -87,11 +87,21 @@ spec:
         #       name: <pgvector-credentials-secret>
         #       key: password
 
-        # Optional: configure Hugging Face access for the default embedding model.
-        # Set HF_ENDPOINT when a mirror/proxy is required, or pre-populate the model
-        # cache under the server PVC before the first vector-store request.
+        # Optional: configure a Hugging Face mirror or proxy for the default
+        # embedding model download path.
         # - name: HF_ENDPOINT
         #   value: "<huggingface-mirror-or-proxy>"
+        #
+        # Optional: configure fully offline model loading. Pre-populate the
+        # Hugging Face cache under /home/lls/.lls/huggingface/hub, then set:
+        # - name: HF_HUB_CACHE
+        #   value: "/home/lls/.lls/huggingface/hub"
+        # - name: HF_HUB_OFFLINE
+        #   value: "1"
+        # - name: TRANSFORMERS_OFFLINE
+        #   value: "1"
+        # - name: HF_HUB_DISABLE_XET
+        #   value: "1"
 
     distribution:
       name: starter                                # Distribution name (options: starter, postgres-demo, meta-reference-gpu)
@@ -130,6 +140,43 @@ Recommended preparation:
 1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password.
 2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`.
 3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use.
-4. If the cluster cannot reach Hugging Face directly, either set `HF_ENDPOINT` to a reachable mirror/proxy or pre-download the embedding model files into the server PVC.
+4. If the cluster uses a Hugging Face mirror or proxy, set `HF_ENDPOINT` accordingly.
+5. If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables.
 
 After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook.
+
+## Hugging Face Access For Embedding Models
+
+Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache.
+
+Recommended cache path:
+
+- `/home/lls/.lls/huggingface/hub`
+
+Common deployment modes:
+
+1. Mirror or proxy access:
+
+   ```yaml
+   - name: HF_ENDPOINT
+     value: "<huggingface-mirror-or-proxy>"
+   - name: HF_HUB_CACHE
+     value: "/home/lls/.lls/huggingface/hub"
+   ```
+
+2. Fully offline access:
+
+   Pre-download the required model files into the PVC-backed cache directory `/home/lls/.lls/huggingface/hub`, then set:
+
+   ```yaml
+   - name: HF_HUB_CACHE
+     value: "/home/lls/.lls/huggingface/hub"
+   - name: HF_HUB_OFFLINE
+     value: "1"
+   - name: TRANSFORMERS_OFFLINE
+     value: "1"
+   - name: HF_HUB_DISABLE_XET
+     value: "1"
+   ```
+
+If the cache path is pre-populated correctly, the server can create PGVector-backed vector stores without downloading model artifacts at runtime.
diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx
index 3cf19b32..7c424178 100644
--- a/docs/en/llama_stack/quickstart.mdx
+++ b/docs/en/llama_stack/quickstart.mdx
@@ -11,7 +11,7 @@ This section provides a quickstart example for creating an AI Agent with Llama S
 - Python 3.12 or higher (if not satisfied, refer to [FAQ: How to prepare Python 3.12 in Notebook](#how-to-prepare-python-312-in-notebook))
 - Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** (see install notes)
 - Access to a Notebook environment (e.g., Jupyter Notebook, JupyterLab)
-- Python environment with `llama-stack-client`, `fastmcp` (for the MCP section), and other notebook dependencies installed
+- Python environment with `llama-stack-client==0.6.0`, `fastmcp` (for the MCP section), and other notebook dependencies installed
 
 ## Quickstart Example
 
diff --git a/docs/public/llama-stack/llama-stack_quickstart.ipynb b/docs/public/llama-stack/llama-stack_quickstart.ipynb
index 954c7ed4..71ff1022 100644
--- a/docs/public/llama-stack/llama-stack_quickstart.ipynb
+++ b/docs/public/llama-stack/llama-stack_quickstart.ipynb
@@ -49,7 +49,7 @@
         "# Use current kernel's Python so PATH does not point to another env\n",
         "# If download is slow, add: -i https://pypi.tuna.tsinghua.edu.cn/simple\n",
         "import sys\n",
-        "!{sys.executable} -m pip install \"llama-stack-client>=0.4\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\""
+        "!{sys.executable} -m pip install \"llama-stack-client==0.6.0\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\""
       ]
     },
     {
@@ -470,7 +470,8 @@
         "- The server distribution is configured with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings.\n",
         "- ACP-provided PostgreSQL can be used directly; it already includes the `pgvector` extension.\n",
         "- Llama Stack includes a default embedding model configuration, but the model files are downloaded from Hugging Face on first use.\n",
-        "- If the cluster cannot reach Hugging Face directly, configure `HF_ENDPOINT` or pre-download the embedding model files into the server PVC.\n"
+        "- If the cluster uses a Hugging Face mirror or proxy, configure `HF_ENDPOINT`.\n",
+        "- If the cluster is fully offline, pre-download the model files into `/home/lls/.lls/huggingface/hub` and set offline cache-related environment variables.\n"
       ]
     },
     {
@@ -509,13 +510,18 @@
         "    raise RuntimeError(\"No embedding model found from client.models.list()\")\n",
         "\n",
         "embedding_metadata = get_model_metadata(embedding_model)\n",
-        "embedding_dimension = int(\n",
+        "resolved_dimension = (\n",
         "    embedding_metadata.get(\"embedding_dimension\")\n",
         "    or embedding_metadata.get(\"dimensions\")\n",
         "    or getattr(embedding_model, \"embedding_dimension\", None)\n",
         "    or getattr(embedding_model, \"dimensions\", None)\n",
-        "    or 768\n",
         ")\n",
+        "if resolved_dimension is None:\n",
+        "    raise RuntimeError(\n",
+        "        f\"Could not determine embedding dimension for model {embedding_model.id!r}. \"\n",
+        "        \"Set it explicitly to match the embedding model used by the server.\"\n",
+        "    )\n",
+        "embedding_dimension = int(resolved_dimension)\n",
         "\n",
         "document = \"\"\"ACP PostgreSQL with pgvector can be used as the vector backend.\n",
         "Unique token: pgvector-demo-token\n",