From ed34ea3989a9ff26e2366d2ca29e237a5479f1a0 Mon Sep 17 00:00:00 2001 From: tfwang Date: Fri, 24 Apr 2026 13:53:21 +0800 Subject: [PATCH 1/2] docs: improve llama stack pgvector docs --- docs/en/llama_stack/install.mdx | 40 +++++++ docs/en/llama_stack/overview/features.mdx | 1 + docs/en/llama_stack/quickstart.mdx | 11 ++ .../llama-stack/llama-stack_quickstart.ipynb | 105 +++++++++++++++++- 4 files changed, 154 insertions(+), 3 deletions(-) diff --git a/docs/en/llama_stack/install.mdx b/docs/en/llama_stack/install.mdx index 60a709d7..6aa5f579 100644 --- a/docs/en/llama_stack/install.mdx +++ b/docs/en/llama_stack/install.mdx @@ -34,6 +34,8 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS > - **Inference URL**: `VLLM_URL` must point at a **vLLM OpenAI-compatible** HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model. > - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below). > - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready. +> - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension. +> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If direct access is restricted, configure `HF_ENDPOINT` for a mirror/proxy, or pre-download the model files into the server PVC before running the first vector-store request. ```yaml apiVersion: llamastack.io/v1alpha1 @@ -66,6 +68,31 @@ spec: # key: token # name: vllm-api-token + # Optional: enable PGVector-backed vector stores. + # Omit the entire block below if you do not need vector store APIs. + # ACP-provided PostgreSQL already includes the pgvector extension. + # - name: ENABLE_PGVECTOR + # value: "true" + # - name: PGVECTOR_HOST + # value: "" + # - name: PGVECTOR_PORT + # value: "5432" + # - name: PGVECTOR_DB + # value: "" + # - name: PGVECTOR_USER + # value: "" + # - name: PGVECTOR_PASSWORD + # valueFrom: + # secretKeyRef: + # name: + # key: password + + # Optional: configure Hugging Face access for the default embedding model. + # Set HF_ENDPOINT when a mirror/proxy is required, or pre-populate the model + # cache under the server PVC before the first vector-store request. + # - name: HF_ENDPOINT + # value: "" + distribution: name: starter # Distribution name (options: starter, postgres-demo, meta-reference-gpu) storage: @@ -93,3 +120,16 @@ args: ``` Choose `--tool-call-parser` (and any related flags) according to the **served model** and the vLLM documentation for that model family. + +## Enable PGVector Vector Store + +When `ENABLE_PGVECTOR=true` is set on the server, Llama Stack can create vector stores by using `provider_id="pgvector"` from the client API. + +Recommended preparation: + +1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password. +2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`. +3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use. +4. If the cluster cannot reach Hugging Face directly, either set `HF_ENDPOINT` to a reachable mirror/proxy or pre-download the embedding model files into the server PVC. + +After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook. diff --git a/docs/en/llama_stack/overview/features.mdx b/docs/en/llama_stack/overview/features.mdx index 63526883..59de4841 100644 --- a/docs/en/llama_stack/overview/features.mdx +++ b/docs/en/llama_stack/overview/features.mdx @@ -26,4 +26,5 @@ weight: 20 ## Integration - **Python Client**: `llama-stack-client` for Python 3.12+ with full agent and model APIs +- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores when the server is configured with `ENABLE_PGVECTOR=true` - **REST-Friendly**: Server exposes APIs for inference, agents, and tool runtime; can be wrapped in FastAPI or other web frameworks for production use diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx index 3ecd4c41..3cf19b32 100644 --- a/docs/en/llama_stack/quickstart.mdx +++ b/docs/en/llama_stack/quickstart.mdx @@ -25,9 +25,20 @@ The notebook demonstrates: - **Two tool options:** client-side tools (`@client_tool`) and MCP tools (FastMCP + `toolgroups.register`) - **Shared agent flow:** connect to Llama Stack Server, select a model, create an `Agent` with `tools=AGENT_TOOLS`, then run sessions and streaming turns +- **Optional PGVector flow:** upload a file, create a `pgvector`-backed vector store, and run a hybrid search query - Streaming responses and event logging - Optional FastAPI deployment of the `agent` +## PGVector Usage + +The downloadable notebook includes an optional PGVector section. To run it, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension. + +The notebook example covers: + +- Uploading a file through `client.files.create(...)` +- Creating a vector store with `provider_id="pgvector"` +- Running a hybrid search with `client.vector_stores.search(...)` and `search_mode="hybrid"` + ## FAQ ### How to prepare Python 3.12 in Notebook \{#how-to-prepare-python-312-in-notebook} diff --git a/docs/public/llama-stack/llama-stack_quickstart.ipynb b/docs/public/llama-stack/llama-stack_quickstart.ipynb index e7ef89aa..954c7ed4 100644 --- a/docs/public/llama-stack/llama-stack_quickstart.ipynb +++ b/docs/public/llama-stack/llama-stack_quickstart.ipynb @@ -7,11 +7,12 @@ "source": [ "# Llama Stack Quick Start Demo\n", "\n", - "This notebook demonstrates how to use Llama Stack to run an agent with tools in two ways:\n", + "This notebook demonstrates how to use Llama Stack for agent workflows and PGVector-backed vector store access:\n", "\n", "- **Option A (section 2):** define a **client-side** weather tool with `@client_tool`; the cell sets **`AGENT_TOOLS`**.\n", "- **Option B (section 2):** run an **MCP** weather tool with **FastMCP** and register it with the server; the register cell sets **`AGENT_TOOLS`**.\n", "- **Section 3** uses the **same** connect / model selection / `Agent` construction / run flow for both options. The only difference is the value of **`AGENT_TOOLS`** passed into `Agent`.\n", + "- **Section 4** shows how to upload a file and query a **PGVector**-backed vector store.\n", "\n", "### Inference backend (`LlamaStackDistribution`)\n", "\n", @@ -456,12 +457,110 @@ "print('\\n')" ] }, + { + "cell_type": "markdown", + "id": "pgvector-title-md", + "metadata": {}, + "source": [ + "## 4. PGVector Vector Store Example\n", + "\n", + "This section shows how to upload a file and query a PGVector-backed vector store.\n", + "\n", + "Prerequisites:\n", + "- The server distribution is configured with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings.\n", + "- ACP-provided PostgreSQL can be used directly; it already includes the `pgvector` extension.\n", + "- Llama Stack includes a default embedding model configuration, but the model files are downloaded from Hugging Face on first use.\n", + "- If the cluster cannot reach Hugging Face directly, configure `HF_ENDPOINT` or pre-download the embedding model files into the server PVC.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "pgvector-demo-code", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import time\n", + "\n", + "\n", + "def get_model_metadata(model):\n", + " metadata = getattr(model, \"metadata\", None)\n", + " if isinstance(metadata, dict):\n", + " return metadata\n", + "\n", + " custom_metadata = getattr(model, \"custom_metadata\", None)\n", + " if isinstance(custom_metadata, dict):\n", + " return custom_metadata\n", + "\n", + " return {}\n", + "\n", + "\n", + "models = client.models.list()\n", + "embedding_model = next(\n", + " (\n", + " model\n", + " for model in models\n", + " if get_model_metadata(model).get(\"model_type\") == \"embedding\"\n", + " ),\n", + " None,\n", + ")\n", + "if embedding_model is None:\n", + " raise RuntimeError(\"No embedding model found from client.models.list()\")\n", + "\n", + "embedding_metadata = get_model_metadata(embedding_model)\n", + "embedding_dimension = int(\n", + " embedding_metadata.get(\"embedding_dimension\")\n", + " or embedding_metadata.get(\"dimensions\")\n", + " or getattr(embedding_model, \"embedding_dimension\", None)\n", + " or getattr(embedding_model, \"dimensions\", None)\n", + " or 768\n", + ")\n", + "\n", + "document = \"\"\"ACP PostgreSQL with pgvector can be used as the vector backend.\n", + "Unique token: pgvector-demo-token\n", + "This document is used to verify vector store indexing and retrieval.\n", + "\"\"\"\n", + "\n", + "file_object = client.files.create(\n", + " file=(\"pgvector-demo.txt\", document.encode(\"utf-8\"), \"text/plain\"),\n", + " purpose=\"assistants\",\n", + ")\n", + "\n", + "vector_store = client.vector_stores.create(\n", + " name=f\"pgvector-demo-{int(time.time())}\",\n", + " file_ids=[file_object.id],\n", + " extra_body={\n", + " \"provider_id\": \"pgvector\",\n", + " \"embedding_model\": embedding_model.id,\n", + " \"embedding_dimension\": embedding_dimension,\n", + " },\n", + ")\n", + "\n", + "search_result = client.vector_stores.search(\n", + " vector_store_id=vector_store.id,\n", + " query=\"pgvector-demo-token\",\n", + " max_num_results=3,\n", + " extra_body={\"search_mode\": \"hybrid\"},\n", + ")\n", + "\n", + "if hasattr(vector_store, \"model_dump\"):\n", + " vector_store = vector_store.model_dump(mode=\"json\")\n", + "if hasattr(search_result, \"model_dump\"):\n", + " search_result = search_result.model_dump(mode=\"json\")\n", + "\n", + "print(\"Vector store:\")\n", + "print(json.dumps(vector_store, ensure_ascii=False, indent=2))\n", + "print(\"\\nSearch result:\")\n", + "print(json.dumps(search_result, ensure_ascii=False, indent=2))\n" + ] + }, { "cell_type": "markdown", "id": "6f8d31d0", "metadata": {}, "source": [ - "## 4. FastAPI Service Example\n", + "## 5. FastAPI Service Example\n", "\n", "Expose the `llama-stack-client`-based `agent` as a FastAPI web service, so it can be called via HTTP.\n" ] @@ -668,7 +767,7 @@ "id": "a3ebed1f", "metadata": {}, "source": [ - "## 5. More Resources\n", + "## 6. More Resources\n", "\n", "For more resources on developing AI Agents with Llama Stack, see:\n", "\n", From c0d9a1b27ce40827a73a70ec1185806a9217a10a Mon Sep 17 00:00:00 2001 From: tfwang Date: Fri, 24 Apr 2026 15:42:29 +0800 Subject: [PATCH 2/2] update --- docs/en/llama_stack/install.mdx | 57 +++++++++++++++++-- docs/en/llama_stack/quickstart.mdx | 2 +- .../llama-stack/llama-stack_quickstart.ipynb | 14 +++-- 3 files changed, 63 insertions(+), 10 deletions(-) diff --git a/docs/en/llama_stack/install.mdx b/docs/en/llama_stack/install.mdx index 6aa5f579..3ca1ab1d 100644 --- a/docs/en/llama_stack/install.mdx +++ b/docs/en/llama_stack/install.mdx @@ -35,7 +35,7 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS > - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below). > - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready. > - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension. -> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If direct access is restricted, configure `HF_ENDPOINT` for a mirror/proxy, or pre-download the model files into the server PVC before running the first vector-store request. +> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure `HF_ENDPOINT`. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request. ```yaml apiVersion: llamastack.io/v1alpha1 @@ -87,11 +87,21 @@ spec: # name: # key: password - # Optional: configure Hugging Face access for the default embedding model. - # Set HF_ENDPOINT when a mirror/proxy is required, or pre-populate the model - # cache under the server PVC before the first vector-store request. + # Optional: configure a Hugging Face mirror or proxy for the default + # embedding model download path. # - name: HF_ENDPOINT # value: "" + # + # Optional: configure fully offline model loading. Pre-populate the + # Hugging Face cache under /home/lls/.lls/huggingface/hub, then set: + # - name: HF_HUB_CACHE + # value: "/home/lls/.lls/huggingface/hub" + # - name: HF_HUB_OFFLINE + # value: "1" + # - name: TRANSFORMERS_OFFLINE + # value: "1" + # - name: HF_HUB_DISABLE_XET + # value: "1" distribution: name: starter # Distribution name (options: starter, postgres-demo, meta-reference-gpu) @@ -130,6 +140,43 @@ Recommended preparation: 1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password. 2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`. 3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use. -4. If the cluster cannot reach Hugging Face directly, either set `HF_ENDPOINT` to a reachable mirror/proxy or pre-download the embedding model files into the server PVC. +4. If the cluster uses a Hugging Face mirror or proxy, set `HF_ENDPOINT` accordingly. +5. If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables. After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook. + +## Hugging Face Access For Embedding Models + +Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache. + +Recommended cache path: + +- `/home/lls/.lls/huggingface/hub` + +Common deployment modes: + +1. Mirror or proxy access: + + ```yaml + - name: HF_ENDPOINT + value: "" + - name: HF_HUB_CACHE + value: "/home/lls/.lls/huggingface/hub" + ``` + +2. Fully offline access: + + Pre-download the required model files into the PVC-backed cache directory `/home/lls/.lls/huggingface/hub`, then set: + + ```yaml + - name: HF_HUB_CACHE + value: "/home/lls/.lls/huggingface/hub" + - name: HF_HUB_OFFLINE + value: "1" + - name: TRANSFORMERS_OFFLINE + value: "1" + - name: HF_HUB_DISABLE_XET + value: "1" + ``` + +If the cache path is pre-populated correctly, the server can create PGVector-backed vector stores without downloading model artifacts at runtime. diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx index 3cf19b32..7c424178 100644 --- a/docs/en/llama_stack/quickstart.mdx +++ b/docs/en/llama_stack/quickstart.mdx @@ -11,7 +11,7 @@ This section provides a quickstart example for creating an AI Agent with Llama S - Python 3.12 or higher (if not satisfied, refer to [FAQ: How to prepare Python 3.12 in Notebook](#how-to-prepare-python-312-in-notebook)) - Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** (see install notes) - Access to a Notebook environment (e.g., Jupyter Notebook, JupyterLab) -- Python environment with `llama-stack-client`, `fastmcp` (for the MCP section), and other notebook dependencies installed +- Python environment with `llama-stack-client==0.6.0`, `fastmcp` (for the MCP section), and other notebook dependencies installed ## Quickstart Example diff --git a/docs/public/llama-stack/llama-stack_quickstart.ipynb b/docs/public/llama-stack/llama-stack_quickstart.ipynb index 954c7ed4..71ff1022 100644 --- a/docs/public/llama-stack/llama-stack_quickstart.ipynb +++ b/docs/public/llama-stack/llama-stack_quickstart.ipynb @@ -49,7 +49,7 @@ "# Use current kernel's Python so PATH does not point to another env\n", "# If download is slow, add: -i https://pypi.tuna.tsinghua.edu.cn/simple\n", "import sys\n", - "!{sys.executable} -m pip install \"llama-stack-client>=0.4\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\"" + "!{sys.executable} -m pip install \"llama-stack-client==0.6.0\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\"" ] }, { @@ -470,7 +470,8 @@ "- The server distribution is configured with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings.\n", "- ACP-provided PostgreSQL can be used directly; it already includes the `pgvector` extension.\n", "- Llama Stack includes a default embedding model configuration, but the model files are downloaded from Hugging Face on first use.\n", - "- If the cluster cannot reach Hugging Face directly, configure `HF_ENDPOINT` or pre-download the embedding model files into the server PVC.\n" + "- If the cluster uses a Hugging Face mirror or proxy, configure `HF_ENDPOINT`.\n", + "- If the cluster is fully offline, pre-download the model files into `/home/lls/.lls/huggingface/hub` and set offline cache-related environment variables.\n" ] }, { @@ -509,13 +510,18 @@ " raise RuntimeError(\"No embedding model found from client.models.list()\")\n", "\n", "embedding_metadata = get_model_metadata(embedding_model)\n", - "embedding_dimension = int(\n", + "resolved_dimension = (\n", " embedding_metadata.get(\"embedding_dimension\")\n", " or embedding_metadata.get(\"dimensions\")\n", " or getattr(embedding_model, \"embedding_dimension\", None)\n", " or getattr(embedding_model, \"dimensions\", None)\n", - " or 768\n", ")\n", + "if resolved_dimension is None:\n", + " raise RuntimeError(\n", + " f\"Could not determine embedding dimension for model {embedding_model.id!r}. \"\n", + " \"Set it explicitly to match the embedding model used by the server.\"\n", + " )\n", + "embedding_dimension = int(resolved_dimension)\n", "\n", "document = \"\"\"ACP PostgreSQL with pgvector can be used as the vector backend.\n", "Unique token: pgvector-demo-token\n",