Full technical reference for acestep.cpp. For a quick start guide, see README.md.
Portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Text + lyrics in, stereo 48kHz MP3 or WAV out. Runs on CPU, CUDA, ROCm, Metal, Vulkan.
git submodule update --init
mkdir build && cd build
# macOS (Metal + Accelerate BLAS auto-enabled)
cmake ..
# Linux with NVIDIA GPU
cmake .. -DGGML_CUDA=ON
# Linux with AMD GPU (ROCm)
cmake .. -DGGML_HIP=ON
# Linux with Vulkan
cmake .. -DGGML_VULKAN=ON
cmake --build . --config Release -j$(nproc)Install Visual C++ Build Tools (select "Desktop development with C++" workload) and optionally the CUDA Toolkit and/or the Vulkan SDK.
git submodule update --init
call "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
mkdir build
cd build
rem NVIDIA GPU
cmake .. -DGGML_CUDA=ON
rem AMD/Intel GPU (Vulkan)
cmake .. -DGGML_VULKAN=ON
rem all backends (CUDA + Vulkan + CPU, runtime loading)
cmake .. -DGGML_CPU_ALL_VARIANTS=ON -DGGML_CUDA=ON -DGGML_VULKAN=ON -DGGML_BACKEND_DL=ON
cmake --build . --config Release -j %NUMBER_OF_PROCESSORS%Builds seven binaries: ace-lm (LLM), ace-synth (DiT + VAE), ace-server (HTTP server), ace-understand (reverse: audio -> metadata), neural-codec (VAE encode/decode), mp3-codec (MP3 encoder/decoder) and quantize (GGUF requantizer).
Pre-quantized GGUFs on Hugging Face.
pip install hf
./models.sh # Q8_0 turbo essentials (~7.7 GB)
./models.sh --all # every model, every quant (~97 GB)
./models.sh --quant Q6_K # pick a specific quant (Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16)
./models.sh --sft # add SFT DiT variant
./models.sh --shifts # add shift1/shift3/continuous variantsDefault downloads 4 files into models/:
| GGUF | Arch | Size |
|---|---|---|
| Qwen3-Embedding-0.6B-Q8_0.gguf | text encoder (28L, H=1024) | 748 MB |
| acestep-5Hz-lm-4B-Q8_0.gguf | Qwen3 causal LM | 4.2 GB |
| acestep-v15-turbo-Q8_0.gguf | DiT 2B + CondEncoder (24L, H=2048) | 2.4 GB |
| vae-BF16.gguf | AutoencoderOobleck | 322 MB |
Three LM sizes: 0.6B (fast), 1.7B, 4B (best quality). Six DiT variants: turbo, sft, base, turbo-shift1, turbo-shift3, turbo-continuous. XL (4B DiT) variants: xl-turbo, xl-sft, xl-base (32L, H=2560, higher quality, ~9.5 GB BF16). VAE is always BF16 (small, bandwidth-bound, quality-critical).
Building GGUFs from source (checkpoints + convert)
If you want to convert from the original safetensors yourself:
pip install gguf hf
./checkpoints.sh # download raw HF checkpoints (turbo + 4B LM)
./checkpoints.sh --all # all variants (SFT, shift1/3, 0.6B/1.7B LM)
python3 convert.py # convert all checkpoints to GGUF (models/)
./quantize.sh # quantize BF16 -> Q4_K_M/Q5_K_M/Q6_K/Q8_0checkpoints.sh downloads safetensors, config.json, and tokenizer files
into checkpoints/. convert.py packs everything into self-contained
GGUF files in models/, bundling BPE tokenizer, silence_latent, and
config metadata so no external file is needed at runtime.
Two modes, one flag, no in-between.
Default mode optimises VRAM. At most one GPU module is resident at a time. The store evicts whatever was loaded before bringing in the next module, so VAE tile activations never sit next to DiT weights or LM weights. This is the mode that lets the full ACE-Step stack run on consumer cards: the DiT loads for denoising, evicts when done, the VAE loads for decode, evicts, and so on.
--keep-loaded mode optimises latency. Everything stays resident
across requests. No reload, no eviction. This is the mode for workstations
with generous VRAM where startup overhead would dominate the request
latency. No "smart" rules kick in: if the user asks for keep-loaded, they
get exactly that.
Invariant held under both modes. Exactly one LM instance lives in the
process, shared between ace-lm (generate) and ace-understand.
Duplicating the Qwen3 LM would cost gigabytes and buy nothing. The
ModelStore enforces this by keying the LM on (path, max_seq, n_kv_sets)
with identical values across both pipelines.
A single ModelStore instance owns every GPU module the server or the
CLIs touch: Qwen3 LM, DiT (with optional adapter), VAE encoder, VAE
decoder, Qwen3 text encoder, condition encoder, FSQ tokenizer, FSQ
detokenizer. Pipelines never load or free modules themselves. They ask
the store for a module with store_require_*, use it, and release it
when the scope ends. A thin RAII handle (ModelHandle) pairs the require
with the release so no early return, error path or exception can leak a
resident module.
The store keys each module by the fields that actually change what gets
loaded. For an LM that means (path, max_seq, n_kv_sets). For a DiT that
means (path, adapter_path, adapter_scale). For every other module it is
just (path). Two requires with the same key return the same pointer, so
pipelines that share a module naturally share the resident weights.
CPU-only helpers (BPE merges, FSM decoding template, DiT metadata like silence_latent and null_condition_emb) have their own accessors and stay resident for the whole process lifetime. They total a few megabytes and cost nothing to keep.
The chosen eviction policy (STRICT or NEVER, see above) decides what the store does on release. STRICT unloads the module immediately once no pipeline holds a handle on it, so VAE tiles never compete for VRAM with an LM or a DiT. NEVER keeps everything loaded for maximum throughput on machines that have the budget.
ace-lm generates lyrics and audio codes, ace-synth synthesizes audio.
The input JSON is never modified. Output is always numbered: request0.json.
cat > /tmp/request.json << 'EOF'
{
"caption": "Upbeat pop rock with driving guitars and catchy hooks",
"inference_steps": 8,
"shift": 3.0,
"vocal_language": "fr"
}
EOF
# LLM: request.json -> request0.json (enriched with metadata + lyrics + codes)
./ace-lm \
--models models \
--request /tmp/request.json
# DiT+VAE: request0.json -> request00.mp3
./ace-synth \
--models models \
--request /tmp/request0.jsonWith an adapter (LoRA today, PEFT directory or ComfyUI single file), set
adapter in the JSON and point --adapters at a directory that contains it:
# JSON carries the adapter name, CLI passes the directory
cat > /tmp/request.json << 'EOF'
{
"caption": "Upbeat pop rock with driving guitars and catchy hooks",
"adapter": "best_sft_v2_2338_comfyui.safetensors",
"adapter_scale": 1.0,
"vocal_language": "fr"
}
EOF
./ace-synth \
--models models \
--adapters adapters \
--request /tmp/request0.jsonGenerate multiple songs at once with lm_batch_size in the JSON:
# 2 different songs from one prompt (different lyrics, codes, metadata)
cat > /tmp/request.json << 'EOF'
{
"caption": "Upbeat pop rock anthem with driving guitars and catchy hooks",
"vocal_language": "fr",
"lm_batch_size": 2
}
EOF
# LM: request.json (lm_batch_size=2) -> request0.json, request1.json
./ace-lm \
--models models \
--request /tmp/request.json
# DiT+VAE: both requests in one GPU batch -> request00.mp3, request10.mp3
./ace-synth \
--models models \
--request /tmp/request0.json /tmp/request1.jsonlm_batch_size controls how many songs the LM generates. User-provided
fields are preserved in all outputs. Empty fields are filled independently
per batch item, producing genuinely different songs.
ace-synth takes all request files as CLI arguments and runs them in a
single GPU batch.
Transform an existing song with --src-audio (no LLM needed):
cat > /tmp/cover.json << 'EOF'
{
"task_type": "cover",
"caption": "Jazz piano cover with brushed drums and walking bass",
"lyrics": "[Instrumental]"
}
EOF
./ace-synth \
--models models \
--src-audio song.wav \
--request /tmp/cover.jsonReady-made examples in examples/:
cd examples
./simple.sh # caption only, LLM fills everything
./simple-batch.sh # 2 songs from one prompt (lm_batch_size=2)
./partial.sh # caption + lyrics + duration
./full.sh # all metadata provided
./dit-only.sh # skip LLM, DiT from noise
./ace-understand.sh <audio> # audio : understand : SFT DiT : MP3 roundtrip
./server-turbo.sh # start HTTP server (turbo model)
./server-sft.sh # start HTTP server (SFT model)
./client.sh # test server (single song)
./client-batch.py # test server batch (2 songs)
./client-understand.sh <audio> # test /understand + /synth roundtripEach example has a -sft variant (SFT model, 50 steps, CFG 1.0)
alongside the turbo default (8 steps, no CFG).
The LLM fills what's missing in the JSON and generates audio codes.
Empty field = "fill it". Filled = "don't touch".
All modes always output numbered files (request0.json .. requestN-1.json).
The input JSON is never modified.
Caption only (lyrics=""): two LLM passes. Phase 1 uses the "Expand"
prompt to generate an enriched caption, lyrics, and metadata (bpm, keyscale,
timesignature, duration, vocal_language) via CoT. Phase 2 reinjects the CoT
and generates audio codes using the "Generate tokens" prompt. CFG is forced
to 1.0 in phase 1 (free sampling); lm_cfg_scale only applies in phase 2.
With lm_batch_size > 1, each element runs its own phase 1,
producing N completely different songs. See examples/simple-batch.json.
Caption + lyrics (+ optional metadata): single LLM pass. The "Generate
tokens" prompt is used directly. Missing metadata is filled via CoT, the
caption is enriched, and audio codes are generated. User-provided metadata
fields are never overwritten. lm_cfg_scale applies to both CoT and code
generation. See examples/partial.json.
Everything provided (caption, lyrics, bpm, duration, keyscale,
timesignature): the LLM skips CoT and generates audio codes directly.
With lm_batch_size > 1, all elements share the same prompt (single prefill,
KV cache copied), producing N different audio code sets. See examples/full.json.
Instrumental (lyrics="[Instrumental]"): treated as "lyrics provided",
so the single-pass "Generate tokens" path is used. No lyrics generation.
The DiT was trained with this exact string as the no-vocal condition.
Passthrough (audio_codes present): LLM is skipped entirely.
Run ace-synth to decode existing codes. See examples/dit-only.json.
Cover ("task_type": "cover" + --src-audio): no LLM needed. The source audio
(WAV or MP3, any sample rate) is resampled to 48kHz, VAE-encoded to latent
space, then passed through an FSQ roundtrip (tokenize 25Hz to 5Hz, detokenize
back to 25Hz). The lossy 5:1 temporal compression destroys micro-timings,
ornaments and transients, so the DiT diverges from the source and produces
a free reinterpretation rather than a close remix.
audio_cover_strength in the JSON controls how many DiT steps see the source
(0.5 = half the steps use source context, half use silence). The caption
steers the style while the source provides loose structure.
Duration is determined by the source audio.
Cover-nofsq ("task_type": "cover-nofsq" + --src-audio): cover variant
that skips the FSQ roundtrip. The DiT receives clean VAE latents at 25Hz,
preserving the full detail of the source. Produces remixes that stay close
to the original structure, melody, and timbre. Pass --ref-audio pointing to
the same file as --src-audio for best results.
audio_cover_strength works well at higher values (0.2 to 0.5) compared to
regular cover. Same JSON fields as cover, just change the task_type.
Repaint ("task_type": "repaint" + --src-audio):
regenerates a time region of the source audio while preserving the rest.
repainting_start and repainting_end define the region in seconds.
Default start is 0. Default end (-1) resolves to source start when
outpainting (start < 0) or source duration otherwise.
Negative start outpaints before the source, end beyond source duration
outpaints after. The source audio is padded with silence before VAE
encoding. audio_cover_strength is ignored (the mask handles everything).
# Inpaint: regenerate seconds 10-25
cat > /tmp/repaint.json << 'EOF'
{
"task_type": "repaint",
"caption": "Smooth jazz guitar solo with reverb",
"lyrics": "[Instrumental]",
"repainting_start": 10.0,
"repainting_end": 25.0,
"inference_steps": 50,
"guidance_scale": 1.0,
"shift": 1.0
}
EOF
# Outpaint: generate 5s before the song (end defaults to 0)
cat > /tmp/outpaint.json << 'EOF'
{
"task_type": "repaint",
"caption": "Smooth jazz intro building into the main theme",
"lyrics": "[Instrumental]",
"repainting_start": -5.0,
"inference_steps": 50,
"guidance_scale": 1.0,
"shift": 1.0
}
EOF
./ace-synth \
--models models \
--src-audio song.wav \
--request /tmp/repaint.jsonLego ("task_type": "lego" + --src-audio):
generates a new instrument track layered over an existing backing track.
Requires the acestep-v15-base DiT (turbo and SFT do not support lego).
cat > /tmp/lego.json << 'EOF'
{
"synth_model": "acestep-v15-base-Q8_0.gguf",
"caption": "electric guitar riff, funk guitar, house music, instrumental",
"lyrics": "[Instrumental]",
"task_type": "lego",
"track": "guitar",
"output_format": "wav16",
"inference_steps": 50,
"guidance_scale": 1.0,
"shift": 1.0
}
EOF
./ace-synth \
--models models \
--src-audio backing-track.wav \
--request /tmp/lego.jsonAvailable track names for lego, extract, and complete: vocals, backing_vocals,
drums, bass, guitar, keyboard, percussion, strings, synth, fx,
brass, woodwinds.
| Task | Turbo | Base/SFT | LM used |
|---|---|---|---|
| text2music | yes | yes | yes |
| cover | yes | yes | no (skipped) |
| cover-nofsq | yes | yes | no (skipped) |
| repaint | yes | yes | no (skipped) |
| lego | no | yes | yes |
| extract | no | yes | no (skipped) |
| complete | no | yes | yes |
For skipped tasks, caption and lyrics are passed verbatim to the DiT.
What the DiT actually receives in its 128-channel context [src(64) | mask(64)]:
| Mode | src channels | mask value | instruction |
|---|---|---|---|
| text2music | silence | 1.0 | "Fill the audio semantic mask..." |
| cover | FSQ(src) roundtrip | 1.0 | "Generate audio semantic tokens..." |
| cover-nofsq | raw VAE src (no FSQ) | 1.0 | "Generate audio semantic tokens..." |
| repaint | silence in zone / src outside | 0.0 outside / 1.0 in zone | "Repaint the mask area..." |
| lego (no region) | raw VAE src everywhere | 1.0 | "Generate the TRACK track..." |
| lego (with region) | raw VAE src everywhere | 0.0 outside / 1.0 in zone | "Generate the TRACK track..." |
| extract | raw VAE src | 1.0 | "Extract the TRACK track..." |
| complete | raw VAE src | 1.0 | "Complete the input track..." |
cover uses an FSQ roundtrip (tokenize 25Hz->5Hz then detokenize 5Hz->25Hz). The lossy compression destroys source detail and the DiT diverges freely. cover-nofsq skips this roundtrip: same instruction, clean 25Hz latents. The DiT stays close to the source and produces faithful remixes. Pass ref_audio = src_audio for best results. All other tasks with source audio use raw VAE latents (no FSQ).
Region coordinates are resolved in a unified block after mode routing:
s.rs += left_pad_sec; s.re += left_pad_sec. When outpainting is active,
source audio has been padded with silence before VAE encoding, so T_cover
and all downstream latent operations naturally reflect the extended canvas.
Two mechanisms do the work when a repaint region is active:
-
DiT context conditioning: the
context_latentstensor fed to the DiT via cross-attention carries the full source latents with silence pasted inside [t0, t1), paired with a binary mask (1.0 inside, 0.0 outside). The DiT was trained on this exact shape and natively regenerates the silenced zone. Flow matching runs as a generic denoising loop, unaware of repaint. -
Latent splice (pre-VAE decode): keeps the DiT-generated frames inside [t0, t1) and copies the source latents elsewhere (hard cut at frame boundary). Single VAE decode produces the final audio. No crossfade: the tiled VAE decoder smooths the seam in the waveform on its own.
Key difference repaint vs lego: repaint silences the zone in the DiT context src (so the DiT generates fresh content there). Lego keeps the full backing track in context even inside the zone (DiT generates a new layer that harmonizes with it).
ace-synth and ace-lm expose cover and repaint via --src-audio with all
model types. Lego, extract, and complete are accessible via JSON request
(task_type field) and the HTTP server, but have no dedicated CLI flag:
pass --src-audio and set task_type in the JSON directly. These three modes
require a base or SFT model (not turbo).
Every field below has a default. Omitting a field from the JSON is strictly
equivalent to sending that field with its default value. Only caption is
effectively required: the other defaults produce a valid text2music job on
their own, but without caption the LLM has nothing to work from.
{
"caption": "",
"lyrics": "",
"bpm": 0,
"duration": 0,
"keyscale": "",
"timesignature": "",
"vocal_language": "",
"seed": -1,
"lm_batch_size": 1,
"synth_batch_size": 1,
"lm_temperature": 0.85,
"lm_cfg_scale": 2.0,
"lm_top_p": 0.9,
"lm_top_k": 0,
"lm_negative_prompt": "",
"use_cot_caption": true,
"audio_codes": "",
"inference_steps": 0,
"guidance_scale": 0.0,
"shift": 0.0,
"dcw_scaler": 0.0,
"dcw_high_scaler": 0.0,
"dcw_mode": "low",
"audio_cover_strength": 1.0,
"cover_noise_strength": 0.0,
"repainting_start": 0,
"repainting_end": -1,
"latent_shift": 0.0,
"latent_rescale": 1.0,
"custom_timesteps": "",
"task_type": "text2music",
"track": "",
"solver": "euler",
"lm_mode": "generate",
"output_format": "mp3",
"peak_clip": 10,
"mp3_bitrate": 128,
"synth_model": "",
"lm_model": "",
"adapter": "",
"adapter_scale": 1.0
}synth_model, lm_model and adapter are resolved through the model
registry, scanned from --models <dir> (and --adapters <dir> for
adapters), by both the HTTP server and the CLI binaries (ace-lm,
ace-synth, ace-understand). Values are GGUF filenames without the
.gguf suffix; an empty string falls to the first matching entry of the
registry. There is no CLI flag to bypass the JSON: model selection is a
property of the request, not of the command line.
lm_mode picks the LM instruction: "generate" (full: metadata + lyrics
- codes),
"inspire"(short query to metadata + lyrics, no codes),"format"(caption + lyrics to metadata + lyrics, no codes).output_formatpicks the audio encoder:"mp3","wav16","wav24","wav32".
caption (string, required)
Natural language description of the music style, mood, instruments, etc.
Fed to both the LLM and the DiT text encoder.
lyrics (string, default "")
Controls vocal generation. Three valid states:
"": LLM generates lyrics from the caption (phase 1 "Expand" prompt)."[Instrumental]": no vocals. Passed directly to the DiT, LLM skips lyrics generation.- Any other string: user-provided lyrics used as-is, LLM only fills missing metadata.
There is no instrumental flag. This field is the single source of truth for
vocal content.
bpm (int, default 0 = unset)
Beats per minute. LLM generates one if 0.
duration (float seconds, default 0 = unset)
Target audio duration. 0 means the LLM picks it. FSM constrains LLM output
to [10, 600]s; values <= 0 after generation fall back to 120s.
keyscale (string, default "" = unset)
Musical key and scale, e.g. "C major", "F# minor". LLM fills if empty.
timesignature (string, default "" = unset)
Time signature numerator as a string, e.g. "4" for 4/4, "3" for 3/4.
LLM fills if empty.
vocal_language (string, default "" = unset)
BCP-47 language code for lyrics, e.g. "en", "fr", "ja". Three states:
"": LLM detects the language via CoT and fills this field."unknown": explicit "no specific language" signal to the DiT.- Any language code: used as-is. When lyrics are being generated, the FSM constrains the LLM output to that language.
seed (int64, default -1 = random)
RNG seed for the DiT pipeline (Philox noise). The LM always uses a
random seed internally.
lm_batch_size (int, default 1)
Number of LM variations. Has no effect on ace-synth.
synth_batch_size (int, default 1)
Number of DiT variations per request. Works in all modes: text2music,
cover, repaint, lego, extract, complete. Combined with lm_batch_size, you get
lm_batch_size * synth_batch_size total outputs.
Three rules govern all batching, in both CLIs and the server:
- Each input JSON is executed independently, as if it were the only one.
seed=-1is resolved to a random value once per input JSON. An explicit seed is used as-is.lm_batch_size=Nduplicates with consecutive LM-internal seeds.synth_batch_size=Nduplicates with consecutiveseedvalues.
audio_codes (string, default "")
Comma-separated FSQ token IDs produced by ace-lm. When non-empty, the
entire LLM pass is skipped and ace-synth decodes these codes directly
(passthrough mode).
audio_cover_strength (float, default 1.0)
Only used in cover mode. Fraction of DiT steps that see the source audio
as context. At 1.0 all steps use the source. At 0.0 no
steps use the source (pure text2music, source is ignored). Values below 1.0
switch DiT context to silence and encoder hidden states to text2music
instruction at the corresponding step. Lower values give more creative
freedom, higher values preserve more of the original structure.
Defaults to 1.0 for lego, extract, complete (context-switch inactive for these modes).
Ignored in repaint mode (the mask handles everything).
cover_noise_strength (float, default 0.0)
Only used in cover mode. Blends initial noise with source latents before
diffusion starts. 0.0 = pure noise (default). 1.0 = start nearly identical
to the source. The schedule is truncated to the nearest timestep matching the
noise level. cover_steps is recalculated against the remaining steps.
repainting_start (float seconds, default 0)
repainting_end (float seconds, default -1)
Region boundaries for repaint and lego modes. Default end (-1) resolves
to source start when outpainting (start < 0), source duration otherwise.
Negative start pads silence before, end beyond source duration pads after.
Error if end <= start after adjustment.
task_type (string, default "text2music")
Controls the generation mode. This field is the single source of truth for
what the pipeline does and is always serialized in request round trips.
Values: text2music, cover, cover-nofsq, repaint, lego, extract, complete.
text2music: standard text-to-music synthesis from silence.cover: re-synthesize source audio with a new style. FSQ roundtrip degrades source latents, so the DiT diverges freely. Requires--src-audio.audio_cover_strengthcontrols how many DiT steps see the source.cover-nofsq: remix source audio without FSQ roundtrip. The DiT works on clean 25Hz VAE latents and stays close to the original. Requires--src-audio. Pass--ref-audio=--src-audiofor best results.repaint: regenerate a time region of the source audio. Requires--src-audio. Negative start outpaints before, end beyond duration outpaints after.lego: generate a new instrument track in context of a backing track. Requires--src-audioandtrack. Base model only. Output is the generated track (behavior analogous to stem generation; the output mix vs isolated stem is model-dependent and unverified in this codebase). Supports optional region constraint viarepainting_start/end.extract: isolate a specific stem from a mixed source. Requires--src-audioandtrack. Base model only. LM is skipped (same as cover/repaint).complete: generate a full mix from a single isolated stem. Requires--src-audio(the isolated stem, e.g. a cappella vocals) andtrack(what to add, e.g.drums). Base model only. Output duration = source duration. The DiT regenerates all frames conditioned on the stem; it does NOT splice or extend temporally.trackcan be a pre-formatted string like"VOCALS | DRUMS"for multi-stem.
lego, extract, and complete always use the full source context
(audio_cover_strength defaults to 1.0 and the context-switch mechanism is inactive).
track (string, default "")
Track name for lego, extract, and complete modes. Standard names: vocals, backing_vocals, drums,
bass, guitar, keyboard, percussion, strings, synth, fx, brass,
woodwinds. Non-standard names produce a warning but are passed through.
These fields are parsed by ace-server but are not part of the C++ AceRequest
struct. They select which model to load from the --models directory.
synth_model (string, default "")
DiT model filename to use for /synth (e.g. "acestep-v15-turbo-Q8_0.gguf").
Empty string keeps the currently loaded DiT, or loads the first available one.
lm_model (string, default "")
LM model filename to use for /lm and /understand (e.g. "acestep-5Hz-lm-4B-Q8_0.gguf").
Empty string keeps the currently loaded LM, or loads the first available one.
adapter (string, default "")
Adapter name from the --adapters directory (e.g. "singer-v2.safetensors"
or "my-peft-adapter"). Empty string means no adapter. Changing the adapter
reloads the DiT (deltas are merged into weights at load time). Supported
algorithm today: LoRA.
adapter_scale (float, default 1.0)
Adapter scaling factor. Only used when adapter is set.
lm_temperature (float, default 0.85)
Sampling temperature for both phase 1 (lyrics/metadata) and phase 2 (audio
codes). Lower = more deterministic.
lm_cfg_scale (float, default 2.0)
Classifier-Free Guidance scale for the LM. Always active in phase 2 (audio
code generation). In phase 1, CFG is disabled whenever textual expansion is
happening (lyrics generation or CoT caption enrichment). In practice CFG
only applies to phase 1 when lyrics are provided AND use_cot_caption=false,
i.e. the LM is filling metadata fields without any free-text generation.
1.0 disables CFG.
lm_top_p (float, default 0.9)
Nucleus sampling cutoff. 1.0 disables.
lm_top_k (int, default 0 = disabled)
Top-K sampling. 0 disables hard top-K (top_p still applies).
lm_negative_prompt (string, default "")
Negative caption for CFG in phase 2. Empty string falls back to a
caption-less unconditional prompt.
use_cot_caption (bool, default true)
When true, the LLM enriches the user caption via CoT and the enriched
version is written to the output JSON (and fed to the DiT). When false,
the user caption is preserved verbatim. Only matters when the LLM runs
phase 1 (i.e. some metadata is missing). When all metadata is provided
phase 1 is skipped and the caption is never touched regardless of this flag.
inference_steps (int, default 0 = auto)
Number of diffusion denoising steps. 0 resolves from the loaded model:
turbo = 8, base/SFT = 50.
guidance_scale (float, default 0.0 = auto)
CFG scale for the DiT. 0.0 resolves to 1.0 (CFG disabled).
Any value > 1.0 on a turbo model is overridden to 1.0 with a warning.
shift (float, default 0.0 = auto)
Flow-matching schedule shift. Controls the timestep distribution.
shift = s*t / (1 + (s-1)*t). 0.0 resolves from the loaded model:
turbo = 3.0, base/SFT = 1.0.
solver (string, default "euler")
Diffusion solver, resolved by solver_lookup() in src/solvers/.
Accepted values:
"euler": ODE Euler, first order, 1 NFE per step. Deterministic, same seed gives same result."sde": SDE Ancestral, 1 NFE plus Philox renoise per step. Predicts x0 from the velocity, then re-noises to the next timestep with a fresh Philox sample seeded byseed + step + 1. Reproducible bit for bit per seed."dpm3m": DPM++ 3M, third order Adams Bashforth multistep, 1 NFE per step. Stateful, bootstraps from Euler then AB2 then AB3."stork4": STORK 4 (4th order ROCK4 Chebyshev sub stepping), 1 NFE per step. Stateful, sub step count tunable viastork_substeps(default 10).
Turbo preset: inference_steps=8, guidance_scale=1.0, shift=3.0.
Base/SFT preset: inference_steps=50, guidance_scale=1.0, shift=1.0.
Usage: ./ace-lm --models <dir> --request <json> [options]
Required:
--models <dir> Directory of GGUF model files
--request <json> Input request JSON (carries lm_model)
Debug:
--max-seq <N> KV cache size (default: 8192)
--no-fsm Disable FSM constrained decoding
--no-fa Disable flash attention
--no-batch-cfg Split CFG into two separate forwards
--clamp-fp16 Clamp hidden states to FP16 range
--dump-logits <path> Dump prefill logits (binary f32)
--dump-tokens <path> Dump prompt token IDs (CSV)
The LM is picked from the request JSON via lm_model; empty value falls
to the first LM in the registry. lm_mode in the JSON selects the
instruction (generate, inspire, format). Three LM sizes: 0.6B (fast),
1.7B, 4B (best quality).
Batching is controlled by lm_batch_size in the request JSON (default 1).
Model weights are read once per decode step for all N sequences.
Usage: ./ace-synth --models <dir> --request <json...> [options]
Required:
--models <dir> Directory of GGUF model files
--request <json...> One or more request JSONs (from ace-lm --request)
Optional:
--adapters <dir> Directory of adapter files (enables JSON adapter field)
--src-audio <path> Source audio (WAV or MP3)
--ref-audio <path> Timbre reference audio (WAV or MP3)
Memory control:
--vae-chunk <N> Latent frames per tile (default: 1024)
--vae-overlap <N> Overlap frames per side (default: 64)
Debug:
--no-fa Disable flash attention
--no-batch-cfg Split DiT CFG into two separate forwards
--clamp-fp16 Clamp hidden states to FP16 range
--dump <dir> Dump intermediate tensors
Model selection comes from the first request JSON. synth_model picks
the DiT, adapter picks an adapter from --adapters, output_format
picks the output encoder (mp3, wav16, wav24, wav32). Models are loaded
once and reused across all requests.
When adapter is set, deltas are merged into the DiT projection weights
at load time (before QKV fusion and GPU upload). For LoRA, the safetensors
file is parsed directly, each lora_A/lora_B pair is multiplied
(alpha/rank * scale * B @ A), and the result is added to the base weight
in F32 before requantizing back to the original GGUF type. This is a
static merge: inference runs at full speed with no adapter overhead.
The registry accepts either a safetensors file or a directory containing
adapter_model.safetensors and adapter_config.json (PEFT format).
--src-audio provides source content for cover, repaint, lego, extract and
complete tasks. The audio (WAV or MP3, any sample rate) is resampled to 48kHz
and VAE-encoded once. audio_cover_strength in the JSON controls how many
DiT steps use the source context (default 1.0). cover_noise_strength
blends the initial noise with source latents to start diffusion closer to
the source (default 0.0).
--ref-audio provides a timbre reference, independent of the task. The audio
is VAE-encoded and fed to the 4-layer timbre encoder, which pools to a single
embedding via frame[0]. This conditions the DiT to match the tonal quality of
the reference. When omitted, the timbre encoder receives a single silence
frame (no timbre conditioning).
Batching comes from two sources: multiple --request files on the CLI
(or JSON array on the server), and synth_batch_size inside each request.
Both are combined: 2 request files with synth_batch_size=3 yields 6 tracks
in one GPU pass.
HTTP server exposing the same pipelines as ace-lm, ace-synth, and
ace-understand. One binary, one port.
POST /lm, POST /synth, POST /understand and POST /vae are all asynchronous: they return a job ID immediately, push the request to a FIFO queue, and the single worker thread processes jobs in order. Clients poll GET /job?id=N for status and fetch results with GET /job?id=N&result=1. Cancel: POST /job?id=N&cancel=1 stops a specific job.
--models scans a directory for GGUF files and classifies each by its
general.architecture metadata into LM, Text-Enc, DiT, and VAE buckets.
All module lifetime decisions go through the ModelStore:
default mode keeps one GPU module resident at a time, --keep-loaded
keeps the whole working set resident. Requests are serialized by a single
worker thread, no GPU mutex.
The VRAM numbers below assume Q8 quantisation of a 1.7B LM and a 2B DiT.
"Peak" is what you need under the default (STRICT) mode, where only one
module is live at once. "Working set" is what you need under
--keep-loaded, where everything stays resident plus the VAE tile
activations during decode. Larger LMs (4B), XL DiT (4B) or heavier
quantisation raise both columns.
| Pipeline | Modules reached | Peak (default) | Working set (--keep-loaded) |
|---|---|---|---|
| LM | Qwen3 LM + KV cache | ~2-3 GB | ~2-3 GB |
| Synth | Qwen3 text-enc, cond-enc, DiT, VAE enc, VAE dec, FSQ tok/detok | ~2-3 GB (DiT or VAE tiles) | ~3-4 GB + tiles |
| Understand | Qwen3 LM, VAE enc, FSQ tok | ~2-3 GB (LM or VAE tiles) | ~2-3 GB + tiles |
VAE tile activations scale with --vae-chunk and --vae-overlap. Bigger
tiles process audio faster with fewer seams but cost more transient VRAM
during encode or decode; the project default targets 8 GB consumer cards.
Under STRICT those tiles get the full GPU budget alone; under
--keep-loaded they sit on top of everything else, so larger cards
earn back the latency they spend reloading.
Endpoints whose pipeline has no models in the registry return 501.
Usage: ./ace-server --models <dir> [options]
Required:
--models <dir> Directory of GGUF model files
Adapter:
--adapters <dir> Directory of adapters
Memory control:
--keep-loaded Keep models in VRAM between requests
--vae-chunk <N> Latent frames per tile (default: 1024)
--vae-overlap <N> Overlap frames per side (default: 64)
Server:
--host <addr> Listen address (default: 127.0.0.1)
--port <N> Listen port (default: 8080)
--max-batch <N> LM batch limit (default: 1)
--max-seq <N> KV cache size (default: 8192)
Debug:
--no-fsm Disable FSM constrained decoding
--no-fa Disable flash attention
--no-batch-cfg Split CFG into two separate forwards (LM + DiT)
--clamp-fp16 Clamp hidden states to FP16 range
Examples:
# all models in one directory
./ace-server --models /path/to/models
# with adapters
./ace-server --models /path/to/models --adapters /path/to/adapters
# custom port and batch limit
./ace-server --models /path/to/models --host 0.0.0.0 --port 8085 --max-batch 2POST /lm Submit LM generation, returns job ID
body: application/json AceRequest
response: {"id":"1"}
POST /synth Submit synth generation, returns job ID
body: application/json AceRequest or [AceRequest, ...]
body: multipart/form-data (request + audio|src_latents + ref_audio|ref_latents)
latents win over audio when both are sent on the same side
response: {"id":"2"}
POST /understand Submit understand, returns job ID
body: multipart/form-data (audio or src_latents required, optional request JSON)
response: {"id":"3"}
POST /vae Submit VAE encode or decode, returns job ID
body: multipart/form-data (exactly one of 'audio' or 'src_latents')
'audio' -> encode path (latents out)
'src_latents' -> decode path (audio out)
response: {"id":"4"}
GET /job?id=N Poll job status
response: {"status":"running|done|failed|cancelled"}
GET /job?id=N&result=1 Fetch job result
lm: application/json [AceRequest, ...]
synth: multipart/mixed (one audio part + one latent part per track, paired)
understand: multipart/mixed (one json part + one latent part for the source)
vae encode: application/octet-stream (raw .vae bytes, no audio echo: client already has it)
vae decode: audio/mpeg or audio/wav (raw, no latent echo: client already has it)
POST /job?id=N&cancel=1 Cancel a specific job
response: {"status":"cancelled"}
GET /health Server health check
response: {"status":"ok"}
GET /props Server config, models, presets, defaults
response: application/json
GET /logs SSE stream of server stderr
response: text/event-stream
GET / Embedded WebUI (gzipped HTML)
Latent payload format (src_latents, ref_latents, synth/understand response latent parts, /vae encode response body):
raw f32 little-endian, flat [T, 64], no header. T = size / 256. Same byte
layout neural-codec writes as .vae files. Hard cap T <= 15000 frames
(matches the silence_latent buffer baked into the DiT GGUF), 413 over.
lm_model, synth_model, adapter, adapter_scale fields in the JSON body
select which model and adapter to load. lm_mode picks the LM instruction
("generate", "inspire", "format"); output_format picks the audio
encoder for /synth ("mp3", "wav16", "wav24", "wav32").
synth_batch_size duplicates a request for multiple DiT variations
(clamped to 9). Error responses are JSON: {"error":"message"} with 400,
500, 501, or 503 status.
GET /props returns available models, server configuration, and the default AceRequest (source of truth for webui dropdowns and placeholders):
{
"models": {
"lm": ["acestep-5Hz-lm-0.6B-Q8_0.gguf", "acestep-5Hz-lm-4B-Q8_0.gguf"],
"embedding": ["Qwen3-Embedding-0.6B-Q8_0.gguf"],
"dit": ["acestep-v15-turbo-Q8_0.gguf", "acestep-v15-xl-turbo-Q8_0.gguf"],
"vae": ["vae-BF16.gguf"]
},
"adapters": [],
"cli": { "max_batch": 1 },
"default": { "caption": "", "duration": 0, ... }
}One worker thread consumes jobs from a FIFO queue. LM, synth and
understand requests all land in that same queue and run serially: the
worker calls the pipeline, which goes through the ModelStore for its
GPU modules. No GPU mutex, no try_lock, no 503 on busy: the HTTP handler
enqueues and returns a job id immediately, the client polls and the
worker runs in its own time.
Completed jobs sit in memory and are evicted FIFO once the pool exceeds
32 entries, so a disconnected client can poll and fetch the result after
reconnecting. Each job has its own cancel flag, set through POST /job?id=N&cancel=1 and polled by the worker between DiT or LM steps.
Model loading and eviction are not this section's concern: see
ModelStore for the full story. The short version is that
--keep-loaded keeps everything resident across requests, the default
mode keeps one module at a time.
Request bodies are limited to 256 MB (source + reference audio, up to 10 minutes WAV each).
GGML-native neural audio codec based on the Oobleck VAE encoder and decoder. Serves two purposes: validating the precision of the full VAE chain (encode + decode roundtrip), and compressing music at 6.8 kbit/s with no perceptible difference from the original.
Usage: ./neural-codec --vae <gguf> --encode|--decode -i <input> [-o <output>] [--q8|--q4]
Required:
--vae <path> VAE GGUF file
--encode | --decode Encode audio to latent, or decode latent to WAV
-i <path> Input (WAV/MP3 for encode, latent for decode)
Output:
-o <path> Output file (auto-named if omitted)
--q8 Quantize latent to int8 (~13 kbit/s)
--q4 Quantize latent to int4 (~6.8 kbit/s)
--format <fmt> WAV format: wav16, wav24, wav32 (default: wav16)
Output naming: song.wav -> song.vae (f32) or song.nac8 (Q8) or song.nac4 (Q4)
song.vae -> song.wav
Memory control:
--vae-chunk <N> Latent frames per tile (default: 1024)
--vae-overlap <N> Overlap frames per side (default: 64)
Latent formats (decode auto-detects):
.vae: flat [T, 64] f32, no header. ~51 kbit/s.
.nac8: header + per-frame Q8. ~13 kbit/s.
.nac4: header + per-frame Q4. ~6.8 kbit/s.
The encoder is the symmetric mirror of the decoder: same snake activations, same residual units, strided conv1d for downsampling instead of transposed conv1d for upsampling. No new GGML ops. Downsample 2x4x4x6x10 = 1920x.
48kHz stereo audio is compressed to 64-dimensional latent frames at 25 Hz. Three output formats, decode auto-detects from file content:
| Format | Frame size | Bitrate | 3 min song | vs .vae (cossim) |
|---|---|---|---|---|
| .vae | 256B | 51 kbit/s | 1.1 MB | baseline |
| .nac8 | 66B | 13 kbit/s | 290 KB | 0.9999 |
| .nac4 | 34B | 6.8 kbit/s | 150 KB | 0.989 |
NAC = Neural Audio Codec. The .nac8 and .nac4 file formats are headerless
except for a 4-byte magic (NAC8 or NAC4) and a uint32 frame count. The
.vae file is the raw VAE encoder output (flat f32, no header), the same
byte payload the HTTP API exchanges as latent multipart parts.
Q8 quantization error is 39 dB below the VAE reconstruction error (free).
Q4 quantization error is 16 dB below the VAE reconstruction error (inaudible
on most material).
# encode (Q4: 6.8 kbit/s, ~150 KB for 3 minutes)
./neural-codec --vae models/vae-BF16.gguf --encode --q4 -i song.wav -o song.nac4
# encode (Q8: 13 kbit/s, ~290 KB for 3 minutes)
./neural-codec --vae models/vae-BF16.gguf --encode --q8 -i song.wav -o song.nac8
# decode (auto-detects format)
./neural-codec --vae models/vae-BF16.gguf --decode -i song.nac4 -o song_decoded.wav
# roundtrip validation: compare song.wav and song_decoded.wav with your earsStandalone MIT-licensed MPEG1 Layer III encoder and decoder. No external
dependencies. The encoder is used by ace-synth for MP3 output. The decoder
uses minimp3 (CC0). Reads WAV or MP3, writes WAV or MP3 (auto-detected
from output extension).
Usage: ./mp3-codec -i <input> -o <output> [options]
-i <path> Input file (WAV or MP3)
-o <path> Output file (WAV or MP3)
-b <kbps> Bitrate for MP3 encoding (default: 128)
--format <fmt> WAV format: wav16, wav24, wav32 (default: wav16)
Mode is auto-detected from output extension.
Examples:
./mp3-codec -i song.wav -o song.mp3
./mp3-codec -i song.wav -o song.mp3 -b 192
./mp3-codec -i song.mp3 -o song.wav
./mp3-codec -i song.mp3 -o song.wav --format wav32
Reverse pipeline: audio -> VAE encode -> FSQ tokenize -> LM understand -> metadata + lyrics. The output JSON is reusable as ace-lm or ace-synth input.
Usage: ./ace-understand --models <dir> --src-audio <path> [--request <json>] [options]
Required:
--models <dir> Directory of GGUF model files
--src-audio <path> Source audio (WAV or MP3, any sample rate)
Optional:
--request <json> Request JSON carrying model selection and
sampling params (lm_model, synth_model,
lm_temperature, lm_top_p, lm_top_k)
When no --request is given, understand defaults apply
(temperature 0.3, top_p disabled).
Output:
-o <json> Output JSON (default: stdout summary)
Memory control:
--vae-chunk <N> Latent frames per tile (default: 1024)
--vae-overlap <N> Overlap frames per side (default: 64)
Debug:
--max-seq <N> KV cache size (default: 8192)
--no-fsm Disable FSM constrained decoding
--no-fa Disable flash attention
--dump <dir> Dump tok_latents + tok_codes (skip LM)
Model selection comes from the request JSON (lm_model, synth_model,
vae).
ace-lm (Qwen3 causal LM, 0.6B/1.7B/4B)
Phase 1 (if needed): CoT generates bpm, keyscale, timesignature, lyrics
Phase 2: audio codes (5Hz tokens, FSQ vocabulary)
Both phases batched: N sequences per forward, weights read once
CFG with dual KV cache per batch element (cond + uncond)
Output: request0.json .. requestN-1.json
ace-synth
BPE tokenize
Qwen3-Embedding (28L text encoder)
CondEncoder (lyric 8L + timbre 4L + text_proj)
FSQ detokenizer (audio codes -> flow matching source latents)
Adapter merge (optional: LoRA safetensors delta -> dequant/merge/requant at load)
DiT (2B: 24L H=2048, XL: 32L H=2560, flow matching)
VAE (AutoencoderOobleck, tiled decode)
WAV stereo 48kHz
ace-understand (reverse pipeline)
Audio read (WAV/MP3, any rate -> 48kHz stereo)
VAE encode (tiled, AutoencoderOobleck encoder)
FSQ tokenize (latent -> 5Hz codes via 2L attention pooler)
Qwen3 LM (understand prompt: codes -> CoT metadata + lyrics)
FSM constrains CoT fields, audio codes blocked after </think>
No CFG, no batch. Single sequence, greedy-ish (temperature=0.3)
Output: JSON with caption, lyrics, bpm, key, duration, language
ace-lm is not a general-purpose chat engine. It is a two-phase autoregressive pipeline specialized for ACE-Step music generation.
Phase 1 (CoT) generates structured metadata (bpm, keyscale, timesignature, caption, duration, language) and optionally lyrics via chain-of-thought reasoning. An FSM (finite state machine) built from a prefix tree enforces valid field names and values at every decode step, hard-masking invalid tokens before sampling.
Phase 2 (audio codes) generates 5Hz FSQ tokens. The FSQ codec uses levels
[8,8,8,5,5,5] producing 64000 distinct codes (888555). The tokenizer
reserves 65535 slots (audio_code_0 to audio_code_65534) appended to the base
Qwen3 vocabulary; the 1535 extra slots are unused by the codec. A partial LM
head projects only the audio code subrange of the embedding matrix, cutting
the output GEMM by 70% compared to full-vocab projection.
Classifier-free guidance (CFG) is fused into the batch dimension: N
conditional and N unconditional sequences are packed into a single forward pass
(2N tokens, one weight read), then combined as
logits = uncond + scale * (cond - uncond). The KV cache is a single 4D tensor
[D, max_seq, Nkv, n_sets] shared across all batch elements and CFG paths. Shared
prompts are prefilled once and cloned to other KV sets via copy, avoiding redundant
prefills.
Test logs (turbo + SFT, seed 42, Philox noise, multiple quantizations):
tests/
Each script compares GGML C++ output against the Python reference
(cosine similarity per intermediate tensor). Requires the original
ACE-Step-1.5 repo cloned alongside acestep.cpp (../ACE-Step-1.5).
cd tests
python3 debug-lm-logits.py # Qwen3 LM: first-token logits GGML vs PyTorch (0.6B/1.7B/4B)
python3 debug-detok-cossim.py # FSQ detokenizer: step-by-step cossim C++ vs Python
python3 debug-dit-cossim.py # DiT: per-layer cossim GGML vs Python (turbo/SFT, BF16/quantized)Uses a patched GGML fork (submodule) with two new ops, a Metal im2col optimization, and a CUDA bugfix for the Oobleck VAE decoder. All backends: CPU, CUDA, ROCm, Metal, Vulkan. F32/F16/BF16 data types. The DiT uses only standard GGML ops and needs no patches.
The VAE reconstructs audio from latent space through 5 upsampling blocks (total 1920x), each running a transposed convolution followed by 3 WaveNet-style residual units with dilated convolutions and Snake activations. A single tile builds a graph of 36 snake activations, 5 transposed convolutions, and 32 regular convolutions. At the final blocks, sequence lengths reach 491520 timesteps, which stresses GGML ops designed for short NLP sequences.
Computes y = x + sin^2(a * x) * inv_b in a single kernel. The Oobleck VAE calls this 36 times per tile. Without a fused op, each activation requires 5 separate GGML kernels (mul, sin, sqr, mul, add), causing 5x the memory traffic. The fused kernel reads x once and writes y once. BF16 cast nodes before/after each snake call halve memory bandwidth at the cost of negligible precision loss (cossim > 0.999 vs F32 baseline).
Gather-based reconstruction of a 1D signal from GEMM columns [K*OC, T_in] to
[T_out, OC], with fused padding crop via the p0 parameter.
Upstream ggml_conv_transpose_1d uses a naive kernel (one scalar FMA loop per output
element, no shared memory, no tensor cores). The VAE spends 40% of its FLOP budget on
transposed convolutions. We decompose each as mul_mat + col2im_1d, routing the heavy
GEMM through cuBLAS/BLAS/MPS tensor cores. The col2im_1d gather has a 2-iteration inner
loop and is pure bandwidth. BF16 cast nodes around col2im_1d halve the scatter bandwidth.
The generic Metal kernel_im2col dispatches (IC, 1, OW) threadgroups with K threads
each. For the VAE's 1D convolutions with small kernels (k=1 or k=7), this wastes 78-97%
of SIMD lanes (7 or 1 active threads per 32-wide SIMD group). The dedicated
kernel_im2col_1d uses a flat dispatch identical to snake and col2im_1d:
(total/256, 1, 1) threadgroups with 256 threads, achieving full SIMD utilization.
The dispatch branches on is_2D at runtime; the 2D path and kernel are unchanged.
CUDA and Vulkan already use flat dispatch and are not affected.
VAE decode (M2 Pro 16GB, 86.8s audio @ 48kHz stereo):
| chunk | overlap | im2col | tiles | time |
|---|---|---|---|---|
| 256 | 64 | generic | 17 | 71.2s |
| 1024 | 16 | generic | 3 | 38.9s |
| 256 | 64 | im2col_1d | 17 | 31.8s |
| 1024 | 16 | im2col_1d | 3 | 18.3s |
Upstream im2col_kernel uses OW directly as grid dimension Y, which exceeds the CUDA
65535 gridDim limit on long sequences. The VAE calls ggml_conv_1d (im2col path) 32
times per tile at output widths up to 491520. Fixed with a grid-stride loop on OW and
MIN(OW, MAX_GRIDDIM_Z) clamping.
The GGML submodule diverges from upstream only by the addition of
GGML_OP_SNAKE and GGML_OP_COL2IM_1D. No existing upstream kernel is
modified. These ops are required; the VAE does not work without them.
An earlier approach patched the upstream naive ops instead of adding custom ones. Those patches were dropped. They are documented here in case someone wants to study the naive path:
conv_transpose_1d: bounded loop replacing O(T_in) brute-force, CUDA and Metalim2col: grid-stride loop on OW to fix gridDim.y overflow for large tensors