Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/en/llama_stack/install.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS
> - **Inference URL**: `VLLM_URL` must point at a **vLLM OpenAI-compatible** HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model.
> - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below).
> - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.
> - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure `HF_ENDPOINT`. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request.

```yaml
apiVersion: llamastack.io/v1alpha1
Expand Down Expand Up @@ -66,6 +68,41 @@ spec:
# key: token
# name: vllm-api-token

# Optional: enable PGVector-backed vector stores.
# Omit the entire block below if you do not need vector store APIs.
# ACP-provided PostgreSQL already includes the pgvector extension.
# - name: ENABLE_PGVECTOR
# value: "true"
# - name: PGVECTOR_HOST
# value: "<acp-postgresql-service>"
# - name: PGVECTOR_PORT
# value: "5432"
# - name: PGVECTOR_DB
# value: "<database-name>"
# - name: PGVECTOR_USER
# value: "<database-username>"
# - name: PGVECTOR_PASSWORD
# valueFrom:
# secretKeyRef:
# name: <pgvector-credentials-secret>
# key: password

# Optional: configure a Hugging Face mirror or proxy for the default
# embedding model download path.
# - name: HF_ENDPOINT
# value: "<huggingface-mirror-or-proxy>"
#
# Optional: configure fully offline model loading. Pre-populate the
# Hugging Face cache under /home/lls/.lls/huggingface/hub, then set:
# - name: HF_HUB_CACHE
# value: "/home/lls/.lls/huggingface/hub"
# - name: HF_HUB_OFFLINE
# value: "1"
# - name: TRANSFORMERS_OFFLINE
# value: "1"
# - name: HF_HUB_DISABLE_XET
# value: "1"

distribution:
name: starter # Distribution name (options: starter, postgres-demo, meta-reference-gpu)
storage:
Expand Down Expand Up @@ -93,3 +130,53 @@ args:
```

Choose `--tool-call-parser` (and any related flags) according to the **served model** and the vLLM documentation for that model family.

## Enable PGVector Vector Store

When `ENABLE_PGVECTOR=true` is set on the server, Llama Stack can create vector stores by using `provider_id="pgvector"` from the client API.

Recommended preparation:

1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password.
2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`.
3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use.
4. If the cluster uses a Hugging Face mirror or proxy, set `HF_ENDPOINT` accordingly.
5. If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables.

After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook.

## Hugging Face Access For Embedding Models

Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache.

Recommended cache path:

- `/home/lls/.lls/huggingface/hub`

Common deployment modes:

1. Mirror or proxy access:

```yaml
- name: HF_ENDPOINT
value: "<huggingface-mirror-or-proxy>"
- name: HF_HUB_CACHE
value: "/home/lls/.lls/huggingface/hub"
```

2. Fully offline access:

Pre-download the required model files into the PVC-backed cache directory `/home/lls/.lls/huggingface/hub`, then set:

```yaml
- name: HF_HUB_CACHE
value: "/home/lls/.lls/huggingface/hub"
- name: HF_HUB_OFFLINE
value: "1"
- name: TRANSFORMERS_OFFLINE
value: "1"
- name: HF_HUB_DISABLE_XET
value: "1"
```

If the cache path is pre-populated correctly, the server can create PGVector-backed vector stores without downloading model artifacts at runtime.
1 change: 1 addition & 0 deletions docs/en/llama_stack/overview/features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,5 @@ weight: 20
## Integration

- **Python Client**: `llama-stack-client` for Python 3.12+ with full agent and model APIs
- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores when the server is configured with `ENABLE_PGVECTOR=true`
- **REST-Friendly**: Server exposes APIs for inference, agents, and tool runtime; can be wrapped in FastAPI or other web frameworks for production use
13 changes: 12 additions & 1 deletion docs/en/llama_stack/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This section provides a quickstart example for creating an AI Agent with Llama S
- Python 3.12 or higher (if not satisfied, refer to [FAQ: How to prepare Python 3.12 in Notebook](#how-to-prepare-python-312-in-notebook))
- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** (see install notes)
- Access to a Notebook environment (e.g., Jupyter Notebook, JupyterLab)
- Python environment with `llama-stack-client`, `fastmcp` (for the MCP section), and other notebook dependencies installed
- Python environment with `llama-stack-client==0.6.0`, `fastmcp` (for the MCP section), and other notebook dependencies installed

## Quickstart Example

Expand All @@ -25,9 +25,20 @@ The notebook demonstrates:

- **Two tool options:** client-side tools (`@client_tool`) and MCP tools (FastMCP + `toolgroups.register`)
- **Shared agent flow:** connect to Llama Stack Server, select a model, create an `Agent` with `tools=AGENT_TOOLS`, then run sessions and streaming turns
- **Optional PGVector flow:** upload a file, create a `pgvector`-backed vector store, and run a hybrid search query
- Streaming responses and event logging
- Optional FastAPI deployment of the `agent`

## PGVector Usage

The downloadable notebook includes an optional PGVector section. To run it, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.

The notebook example covers:

- Uploading a file through `client.files.create(...)`
- Creating a vector store with `provider_id="pgvector"`
- Running a hybrid search with `client.vector_stores.search(...)` and `search_mode="hybrid"`

## FAQ

### How to prepare Python 3.12 in Notebook \{#how-to-prepare-python-312-in-notebook}
Expand Down
113 changes: 109 additions & 4 deletions docs/public/llama-stack/llama-stack_quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@
"source": [
"# Llama Stack Quick Start Demo\n",
"\n",
"This notebook demonstrates how to use Llama Stack to run an agent with tools in two ways:\n",
"This notebook demonstrates how to use Llama Stack for agent workflows and PGVector-backed vector store access:\n",
"\n",
"- **Option A (section 2):** define a **client-side** weather tool with `@client_tool`; the cell sets **`AGENT_TOOLS`**.\n",
"- **Option B (section 2):** run an **MCP** weather tool with **FastMCP** and register it with the server; the register cell sets **`AGENT_TOOLS`**.\n",
"- **Section 3** uses the **same** connect / model selection / `Agent` construction / run flow for both options. The only difference is the value of **`AGENT_TOOLS`** passed into `Agent`.\n",
"- **Section 4** shows how to upload a file and query a **PGVector**-backed vector store.\n",
"\n",
"### Inference backend (`LlamaStackDistribution`)\n",
"\n",
Expand Down Expand Up @@ -48,7 +49,7 @@
"# Use current kernel's Python so PATH does not point to another env\n",
"# If download is slow, add: -i https://pypi.tuna.tsinghua.edu.cn/simple\n",
"import sys\n",
"!{sys.executable} -m pip install \"llama-stack-client>=0.4\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\""
"!{sys.executable} -m pip install \"llama-stack-client==0.6.0\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\""
]
},
{
Expand Down Expand Up @@ -456,12 +457,116 @@
"print('\\n')"
]
},
{
"cell_type": "markdown",
"id": "pgvector-title-md",
"metadata": {},
"source": [
"## 4. PGVector Vector Store Example\n",
"\n",
"This section shows how to upload a file and query a PGVector-backed vector store.\n",
"\n",
"Prerequisites:\n",
"- The server distribution is configured with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings.\n",
"- ACP-provided PostgreSQL can be used directly; it already includes the `pgvector` extension.\n",
"- Llama Stack includes a default embedding model configuration, but the model files are downloaded from Hugging Face on first use.\n",
"- If the cluster uses a Hugging Face mirror or proxy, configure `HF_ENDPOINT`.\n",
"- If the cluster is fully offline, pre-download the model files into `/home/lls/.lls/huggingface/hub` and set offline cache-related environment variables.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "pgvector-demo-code",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import time\n",
"\n",
"\n",
"def get_model_metadata(model):\n",
" metadata = getattr(model, \"metadata\", None)\n",
" if isinstance(metadata, dict):\n",
" return metadata\n",
"\n",
" custom_metadata = getattr(model, \"custom_metadata\", None)\n",
" if isinstance(custom_metadata, dict):\n",
" return custom_metadata\n",
"\n",
" return {}\n",
"\n",
"\n",
"models = client.models.list()\n",
"embedding_model = next(\n",
" (\n",
" model\n",
" for model in models\n",
" if get_model_metadata(model).get(\"model_type\") == \"embedding\"\n",
" ),\n",
" None,\n",
")\n",
"if embedding_model is None:\n",
" raise RuntimeError(\"No embedding model found from client.models.list()\")\n",
"\n",
"embedding_metadata = get_model_metadata(embedding_model)\n",
"resolved_dimension = (\n",
" embedding_metadata.get(\"embedding_dimension\")\n",
" or embedding_metadata.get(\"dimensions\")\n",
" or getattr(embedding_model, \"embedding_dimension\", None)\n",
" or getattr(embedding_model, \"dimensions\", None)\n",
")\n",
"if resolved_dimension is None:\n",
" raise RuntimeError(\n",
" f\"Could not determine embedding dimension for model {embedding_model.id!r}. \"\n",
" \"Set it explicitly to match the embedding model used by the server.\"\n",
" )\n",
"embedding_dimension = int(resolved_dimension)\n",
"\n",
"document = \"\"\"ACP PostgreSQL with pgvector can be used as the vector backend.\n",
"Unique token: pgvector-demo-token\n",
"This document is used to verify vector store indexing and retrieval.\n",
"\"\"\"\n",
"\n",
"file_object = client.files.create(\n",
" file=(\"pgvector-demo.txt\", document.encode(\"utf-8\"), \"text/plain\"),\n",
" purpose=\"assistants\",\n",
")\n",
"\n",
"vector_store = client.vector_stores.create(\n",
" name=f\"pgvector-demo-{int(time.time())}\",\n",
" file_ids=[file_object.id],\n",
" extra_body={\n",
" \"provider_id\": \"pgvector\",\n",
" \"embedding_model\": embedding_model.id,\n",
" \"embedding_dimension\": embedding_dimension,\n",
" },\n",
")\n",
"\n",
"search_result = client.vector_stores.search(\n",
" vector_store_id=vector_store.id,\n",
" query=\"pgvector-demo-token\",\n",
" max_num_results=3,\n",
" extra_body={\"search_mode\": \"hybrid\"},\n",
")\n",
"\n",
"if hasattr(vector_store, \"model_dump\"):\n",
" vector_store = vector_store.model_dump(mode=\"json\")\n",
"if hasattr(search_result, \"model_dump\"):\n",
" search_result = search_result.model_dump(mode=\"json\")\n",
"\n",
"print(\"Vector store:\")\n",
"print(json.dumps(vector_store, ensure_ascii=False, indent=2))\n",
"print(\"\\nSearch result:\")\n",
"print(json.dumps(search_result, ensure_ascii=False, indent=2))\n"
]
},
{
"cell_type": "markdown",
"id": "6f8d31d0",
"metadata": {},
"source": [
"## 4. FastAPI Service Example\n",
"## 5. FastAPI Service Example\n",
"\n",
"Expose the `llama-stack-client`-based `agent` as a FastAPI web service, so it can be called via HTTP.\n"
]
Expand Down Expand Up @@ -668,7 +773,7 @@
"id": "a3ebed1f",
"metadata": {},
"source": [
"## 5. More Resources\n",
"## 6. More Resources\n",
"\n",
"For more resources on developing AI Agents with Llama Stack, see:\n",
"\n",
Expand Down