Add /props introspection endpoint and model meta block#81
Open
easel wants to merge 3 commits into
Open
Conversation
- Drop loopback gate on GET /props; the endpoint is now reachable like /v1/models, matching llama-server convention. Operators are responsible for binding to a trusted interface. - Remove ctx_size / default_tokens fallback ladders in append_props_json by relying on the runtime config being unconditionally populated in main(). - Switch MTP introspection to ds4_engine_has_mtp() and ds4_engine_mtp_draft_tokens() so the flag reflects engine state rather than coupling to the cfg.engine.mtp_path argv shape. - Pull sampling defaults (temperature, top_p, top_k, min_p, tool-call structural temperature) into named macros and use them in request_init, the thinking override path, and the /props payload. %g formatting keeps the current JSON byte-identical while letting future non-integer defaults serialize cleanly. - Note that the kv.len read in append_props_json is intentionally lockless and consistent with other kv.* reads in the function. - Comment the api.endpoints array as a sync point with client_main() routing. - Add a short header comment on append_props_json describing its payload sections.
The previous polish commit landed a 121-char single-line block comment, which is the longest single-line block comment in the file and past the ~90-char convention used elsewhere in ds4_server.c. Wrap it with the leading-asterisk style used by other multi-line block comments in this file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
I'm working on a benchmarking a bunch of different providers and model settings and have found it helpful to be able to query the server directly to find out what it's settings are. I had claude/and codex gin up a llama-server style /props endpoint with a bunch of useful information. Happy to adjust as needed, the intent was for this to be a low-impact change of obvious value.
-- Claude's opinion follows --
Summary
Adds a single read-only HTTP endpoint,
GET /props, that returns the server's full runtime configuration as JSON, and extends the existing/v1/models[/…]cards with a parallelmetablock. Everything is sourced from a newserver_runtime_configstruct populated once inmain()from the liveengine/session/cfg, so the reported numbers are the ones actually in effect.Prior art
llama-serverexposesGET /propsfor runtime/model/sampler introspection. This change borrows the endpoint name and the "single JSON document describing the live server" convention, but tailors the body to what ds4 actually has (DSML tool replay, KV disk cache, MTP, think-mode aliasing, routed-expert quant bits, context-memory estimator)./v1/modelsschema is whatappend_model_json_valueswas already emitting before this branch —context_length,max_completion_tokens,supported_parameters. The newmetaobject slots in alongside those without breaking the existing shape./propsis intentionally not loopback-gated — same posture as/v1/models, matching the llama-server convention. Binding to a trusted interface is the operator's responsibility.Shape of the response
GET /propsreturns one JSON object with these sections:server—{ name }model—id,name,path(nullable),routed_expert_quant_bits(nullable),mtp,mtp_draft_tokensruntime—backend,ctx_size,default_max_tokens,effective_max_completion_tokens(themin(ctx, default_tokens)clamp),threads,quality,warm_weightsreasoning—supported_efforts, alias map (low/medium/xhigh→high),default,effective_default(resolved against the current ctx),think_max_min_contextsampling—defaults,thinking_override,tool_protocol_sampling.structural_temperature, all driven from newDS4_DEFAULT_{TEMPERATURE,TOP_P,TOP_K,MIN_P}andDS4_TOOL_STRUCTURAL_TEMPERATUREmacros that also feedrequest_init, the think-mode override path, and the tool-call structural-sampling branch ingenerate_job— so the introspection figures cannot drift from the runtimecontext_memory—total_bytes/raw_bytes/compressed_bytes/scratch_bytesplusprefill_cap/raw_cap/comp_cap, fromds4_context_memory_estimatekv_disk_cache—enabled,dir,budget_bytes,reject_different_quant, fullpolicyblock,entriestool_replay—exact_dsml_replay_enabled,max_ids,current_ids,current_bytes(tool memory, taken undertool_mu)api— endpoint list (kept in sync with theclient_main()router, called out in a comment) andsupported_request_parameters(reused from the/v1/modelshelper so the two surfaces cannot disagree)/v1/modelsand/v1/models/deepseek-v4-flashnow include ametasub-object withbackend,routed_expert_quant_bits,mtp/mtp_draft_tokens, and the reasoning fields. The pre-existing fields and their byte layout are unchanged (%gformatting on the new sampling defaults keeps the current integer-valued payload identical).Notable implementation choices
ds4_engine_has_mtp()/ds4_engine_mtp_draft_tokens()rather than sniffingcfg.engine.mtp_path, so the flag reflects engine state.kv.lenread inappend_props_jsonis intentionally lockless, matching the rest of thekv.*reads in that function (worker-thread-only mutation); called out in a comment.metablock and the/propspayload in both enabled-disk-KV and disabled-disk-KV configurations.Test plan
make test(covers the three newtest_model_metadata_contains_meta_fields/test_props_json_*cases)curl http://<host>:<port>/props | jq .against a running server and sanity-checkruntime,sampling,context_memory,kv_disk_cachecurl .../v1/models/deepseek-v4-flash | jq .metato confirm the new sub-object