Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -158,4 +158,5 @@ Thumbs.db
*.swp
*.swo
*~
.vscode/
.vscode/
.onnx-tests/
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@
"docs/inference/vllm",
"docs/inference/mlx",
"docs/inference/ollama",
"docs/inference/onnx",
{
"group": "Other Frameworks",
"icon": "server",
Expand Down
4 changes: 2 additions & 2 deletions docs/help/faqs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ LFM models are compatible with:
- [vLLM](/docs/inference/vllm) - For high-throughput production serving
- [MLX](/docs/inference/mlx) - For Apple Silicon optimization
- [Ollama](/docs/inference/ollama) - For easy local deployment
- [LEAP](/leap/index) - For edge and mobile deployment
- [LEAP](/leap/edge-sdk/overview) - For edge and mobile deployment
</Accordion>

## Model Selection
Expand Down Expand Up @@ -49,7 +49,7 @@ LFM2.5 models are updated versions with improved training that deliver higher pe
## Deployment

<Accordion title="Can I run LFM models on mobile devices?">
Yes! Use the [LEAP SDK](/leap/index) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
Yes! Use the [LEAP SDK](/leap/edge-sdk/overview) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
</Accordion>

<Accordion title="What quantization formats are available?">
Expand Down
8 changes: 0 additions & 8 deletions docs/index.mdx

This file was deleted.

244 changes: 244 additions & 0 deletions docs/inference/onnx.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
title: "ONNX"
description: "ONNX provides a platform-agnostic inference specification that allows running the model on device-specific runtimes that include CPU, GPU, NPU, and WebGPU."

Check warning on line 3 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L3

Did you really mean 'runtimes'?

Check warning on line 3 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L3

Did you really mean 'runtimes'?
---

<Tip>
Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.
</Tip>

ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications.

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'runtimes'?

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'CPUs'?

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'GPUs'?

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'NPUs'?

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'runtimes'?

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'CPUs'?

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'GPUs'?

Check warning on line 10 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L10

Did you really mean 'NPUs'?

## LiquidONNX

[LiquidONNX](https://github.com/Liquid4All/onnx-export) is the official tool for exporting LFM models to ONNX and running inference.

### Installation

```bash
git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

# For GPU inference
uv sync --extra gpu
```

### Supported Models

| Family | Quantization Formats |
|--------|---------------------|
| LFM2.5, LFM2 (text) | fp32, fp16, q4, q8 |
| LFM2.5-VL, LFM2-VL (vision) | fp32, fp16, q4, q8 |
| LFM2-MoE | fp32, fp16, q4, q4f16 |
| LFM2.5-Audio | fp32, fp16, q4, q8 |

### Export

```bash
# Text models - export with all precisions (fp16, q4, q8)
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision

# Vision-language models
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision

# MoE models
uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision

# Audio models
uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision
```

### Inference

```bash
# Text model chat
uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx

# Vision-language with images
uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
--images photo.jpg --prompt "Describe this image"

# Audio transcription (ASR)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
--audio input.wav --precision q4

# Text-to-speech (TTS)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
--prompt "Hello, how are you?" --output speech.wav --precision q4
```

For complete documentation and advanced options, see the [LiquidONNX GitHub repository](https://github.com/Liquid4All/onnx-export).

## Pre-exported Models

Many LFM models are available as pre-exported ONNX packages from [LiquidAI](https://huggingface.co/LiquidAI/models?search=onnx) and the [onnx-community](https://huggingface.co/onnx-community). Check the [Model Library](/docs/models/complete-library) for a complete list of available formats.

### Quantization Options

Each ONNX export includes multiple precision levels. **Q4** is recommended for most deployments and supports WebGPU, CPU, and GPU. **FP16** offers higher quality and works on WebGPU and GPU. **Q8** provides a quality/size balance but is server-only (CPU/GPU). **FP32** is the full precision baseline.

## Hugging Face Spaces

These are fully deployed examples of WebGPU and ONNX inference with LFM models.

<CardGroup cols={3}>

<Card title="LFM2 WebGPU Chat" icon="message" href="https://huggingface.co/spaces/LiquidAI/LFM2-WebGPU">
Run LFM2 text models directly in your browser with WebGPU acceleration.
</Card>

<Card title="LFM2.5 Audio" icon="microphone" href="https://huggingface.co/spaces/LiquidAI/LFM2.5-Audio-1.5B-transformers-js">
Speech-to-text and text-to-speech with LFM2.5 Audio in the browser.
</Card>

<Card title="LFM2.5 Vision" icon="eye" href="https://huggingface.co/spaces/LiquidAI/LFM2.5-VL-1.6B-WebGPU">
Vision-language inference with LFM2.5-VL in the browser.
</Card>

</CardGroup>

## WebGPU Inference

ONNX models run in browsers via [Transformers.js](https://huggingface.co/docs/transformers.js) with WebGPU acceleration. This enables fully client-side inference without server infrastructure.

### Setup

1. Install Transformers.js:
```bash
npm install @huggingface/transformers
```

2. Enable WebGPU in your browser:
- **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
- **Verify**: Check `chrome://gpu` for WebGPU status

### Usage

```javascript
import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";

// Load model with WebGPU
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
device: "webgpu",
dtype: "q4", // or "fp16"
});

// Generate with streaming
const messages = [{ role: "user", content: "What is the capital of France?" }];
const input = tokenizer.apply_chat_template(messages, {
add_generation_prompt: true,
return_dict: true,
});

const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
...input,
max_new_tokens: 256,
do_sample: false,
streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
```

<Note>
WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.
</Note>

## Python Inference

Install with pip:

```bash
pip install onnxruntime transformers numpy huggingface_hub jinja2

# For GPU support
pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2
```

<Accordion title="Full Python example with KV cache">

```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

# Download Q4 model (recommended)
model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download external data files
for f in list_repo_files(model_id):
if f.startswith("onnx/model_q4.onnx_data"):
hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare input
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False)
input_ids = np.array([inputs], dtype=np.int64)

# Initialize KV cache
DTYPE_MAP = {
"tensor(float)": np.float32,
"tensor(float16)": np.float16,
"tensor(int64)": np.int64
}
cache = {}
for inp in session.get_inputs():
if inp.name in {"input_ids", "attention_mask", "position_ids"}:
continue
shape = [d if isinstance(d, int) else 1 for d in inp.shape]
for i, d in enumerate(inp.shape):
if isinstance(d, str) and "sequence" in d.lower():
shape[i] = 0
dtype = DTYPE_MAP.get(inp.type, np.float32)
cache[inp.name] = np.zeros(shape, dtype=dtype)

# Generate tokens
seq_len = input_ids.shape[1]
generated = []
input_names = {inp.name for inp in session.get_inputs()}

for step in range(100):
if step == 0:
ids = input_ids
pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
else:
ids = np.array([[generated[-1]]], dtype=np.int64)
pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)

attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
if "position_ids" in input_names:
feed["position_ids"] = pos

outputs = session.run(None, feed)
next_token = int(np.argmax(outputs[0][0, -1]))
generated.append(next_token)

# Update cache
for i, out in enumerate(session.get_outputs()[1:], 1):
name = out.name.replace("present_conv", "past_conv")
name = name.replace("present.", "past_key_values.")
if name in cache:
cache[name] = outputs[i]

if next_token == tokenizer.eos_token_id:
break

print(tokenizer.decode(generated, skip_special_tokens=True))
```

</Accordion>
Loading