Liquid4All · alay2shah · Feb 5, 2026 · Jan 31, 2026 · Jan 31, 2026 · Jan 31, 2026
@@ -158,4 +158,5 @@ Thumbs.db
 *.swp
 *.swo
 *~
-.vscode/
+.vscode/
+.onnx-tests/
@@ -92,6 +92,7 @@
               "docs/inference/vllm",
               "docs/inference/mlx",
               "docs/inference/ollama",
+              "docs/inference/onnx",
               {
                 "group": "Other Frameworks",
                 "icon": "server",

@@ -20,7 +20,7 @@ LFM models are compatible with:
 - [vLLM](/docs/inference/vllm) - For high-throughput production serving
 - [MLX](/docs/inference/mlx) - For Apple Silicon optimization
 - [Ollama](/docs/inference/ollama) - For easy local deployment
-- [LEAP](/leap/index) - For edge and mobile deployment
+- [LEAP](/leap/edge-sdk/overview) - For edge and mobile deployment
 </Accordion>
 
 ## Model Selection
@@ -49,7 +49,7 @@ LFM2.5 models are updated versions with improved training that deliver higher pe
 ## Deployment
 
 <Accordion title="Can I run LFM models on mobile devices?">
-Yes! Use the [LEAP SDK](/leap/index) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
+Yes! Use the [LEAP SDK](/leap/edge-sdk/overview) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
 </Accordion>
 
 <Accordion title="What quantization formats are available?">

@@ -0,0 +1,244 @@
+---
+title: "ONNX"
+description: "ONNX provides a platform-agnostic inference specification that allows running the model on device-specific runtimes that include CPU, GPU, NPU, and WebGPU."
+---
+
+<Tip>
+  Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.
+</Tip>
+
+ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications.
+
+## LiquidONNX
+
+[LiquidONNX](https://github.com/Liquid4All/onnx-export) is the official tool for exporting LFM models to ONNX and running inference.
+
+### Installation
+
+```bash
+git clone https://github.com/Liquid4All/onnx-export.git
+cd onnx-export
+uv sync
+
+# For GPU inference
+uv sync --extra gpu
+```
+
+### Supported Models
+
+| Family | Quantization Formats |
+|--------|---------------------|
+| LFM2.5, LFM2 (text) | fp32, fp16, q4, q8 |
+| LFM2.5-VL, LFM2-VL (vision) | fp32, fp16, q4, q8 |
+| LFM2-MoE | fp32, fp16, q4, q4f16 |
+| LFM2.5-Audio | fp32, fp16, q4, q8 |
+
+### Export
+
+```bash
+# Text models - export with all precisions (fp16, q4, q8)
+uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision
+
+# Vision-language models
+uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision
+
+# MoE models
+uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision
+
+# Audio models
+uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision
+```
+
+### Inference
+
+```bash
+# Text model chat
+uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx
+
+# Vision-language with images
+uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
+    --images photo.jpg --prompt "Describe this image"
+
+# Audio transcription (ASR)
+uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
+    --audio input.wav --precision q4
+
+# Text-to-speech (TTS)
+uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
+    --prompt "Hello, how are you?" --output speech.wav --precision q4
+```
+
+For complete documentation and advanced options, see the [LiquidONNX GitHub repository](https://github.com/Liquid4All/onnx-export).
+
+## Pre-exported Models
+
+Many LFM models are available as pre-exported ONNX packages from [LiquidAI](https://huggingface.co/LiquidAI/models?search=onnx) and the [onnx-community](https://huggingface.co/onnx-community). Check the [Model Library](/docs/models/complete-library) for a complete list of available formats.
+
+### Quantization Options
+
+Each ONNX export includes multiple precision levels. **Q4** is recommended for most deployments and supports WebGPU, CPU, and GPU. **FP16** offers higher quality and works on WebGPU and GPU. **Q8** provides a quality/size balance but is server-only (CPU/GPU). **FP32** is the full precision baseline.
+
+## Hugging Face Spaces
+
+These are fully deployed examples of WebGPU and ONNX inference with LFM models.
+
+<CardGroup cols={3}>
+
+<Card title="LFM2 WebGPU Chat" icon="message" href="https://huggingface.co/spaces/LiquidAI/LFM2-WebGPU">
+  Run LFM2 text models directly in your browser with WebGPU acceleration.
+</Card>
+
+<Card title="LFM2.5 Audio" icon="microphone" href="https://huggingface.co/spaces/LiquidAI/LFM2.5-Audio-1.5B-transformers-js">
+  Speech-to-text and text-to-speech with LFM2.5 Audio in the browser.
+</Card>
+
+<Card title="LFM2.5 Vision" icon="eye" href="https://huggingface.co/spaces/LiquidAI/LFM2.5-VL-1.6B-WebGPU">
+  Vision-language inference with LFM2.5-VL in the browser.
+</Card>
+
+</CardGroup>
+
+## WebGPU Inference
+
+ONNX models run in browsers via [Transformers.js](https://huggingface.co/docs/transformers.js) with WebGPU acceleration. This enables fully client-side inference without server infrastructure.
+
+### Setup
+
+1. Install Transformers.js:
+```bash
+npm install @huggingface/transformers
+```
+
+2. Enable WebGPU in your browser:
+   - **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
+   - **Verify**: Check `chrome://gpu` for WebGPU status
+
+### Usage
+
+```javascript
+import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";
+
+const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";
+
+// Load model with WebGPU
+const tokenizer = await AutoTokenizer.from_pretrained(modelId);
+const model = await AutoModelForCausalLM.from_pretrained(modelId, {
+  device: "webgpu",
+  dtype: "q4",  // or "fp16"
+});
+
+// Generate with streaming
+const messages = [{ role: "user", content: "What is the capital of France?" }];
+const input = tokenizer.apply_chat_template(messages, {
+  add_generation_prompt: true,
+  return_dict: true,
+});
+
+const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
+const output = await model.generate({
+  ...input,
+  max_new_tokens: 256,
+  do_sample: false,
+  streamer,
+});
+
+console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
+```
+
+<Note>
+WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.
+</Note>
+
+## Python Inference
+
+Install with pip:
+
+```bash
+pip install onnxruntime transformers numpy huggingface_hub jinja2
+
+# For GPU support
+pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2
+```
+
+<Accordion title="Full Python example with KV cache">
+
+```python
+import numpy as np
+import onnxruntime as ort
+from huggingface_hub import hf_hub_download, list_repo_files
+from transformers import AutoTokenizer
+
+# Download Q4 model (recommended)
+model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
+model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
+
+# Download external data files
+for f in list_repo_files(model_id):
+    if f.startswith("onnx/model_q4.onnx_data"):
+        hf_hub_download(model_id, f)
+
+# Load model and tokenizer
+session = ort.InferenceSession(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+
+# Prepare input
+messages = [{"role": "user", "content": "What is the capital of France?"}]
+prompt = tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+inputs = tokenizer.encode(prompt, add_special_tokens=False)
+input_ids = np.array([inputs], dtype=np.int64)
+
+# Initialize KV cache
+DTYPE_MAP = {
+    "tensor(float)": np.float32,
+    "tensor(float16)": np.float16,
+    "tensor(int64)": np.int64
+}
+cache = {}
+for inp in session.get_inputs():
+    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
+        continue
+    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
+    for i, d in enumerate(inp.shape):
+        if isinstance(d, str) and "sequence" in d.lower():
+            shape[i] = 0
+    dtype = DTYPE_MAP.get(inp.type, np.float32)
+    cache[inp.name] = np.zeros(shape, dtype=dtype)
+
+# Generate tokens
+seq_len = input_ids.shape[1]
+generated = []
+input_names = {inp.name for inp in session.get_inputs()}
+
+for step in range(100):
+    if step == 0:
+        ids = input_ids
+        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
+    else:
+        ids = np.array([[generated[-1]]], dtype=np.int64)
+        pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)
+
+    attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
+    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
+    if "position_ids" in input_names:
+        feed["position_ids"] = pos
+
+    outputs = session.run(None, feed)
+    next_token = int(np.argmax(outputs[0][0, -1]))
+    generated.append(next_token)
+
+    # Update cache
+    for i, out in enumerate(session.get_outputs()[1:], 1):
+        name = out.name.replace("present_conv", "past_conv")
+        name = name.replace("present.", "past_key_values.")
+        if name in cache:
+            cache[name] = outputs[i]
+
+    if next_token == tokenizer.eos_token_id:
+        break
+
+print(tokenizer.decode(generated, skip_special_tokens=True))
+```
+
+</Accordion>