Skip to content

Add Codex CLI backend and reasoning-effort support#3

Open
idrassi wants to merge 2 commits into
Kripner:masterfrom
idrassi:codex_support_and_reasoning_effort
Open

Add Codex CLI backend and reasoning-effort support#3
idrassi wants to merge 2 commits into
Kripner:masterfrom
idrassi:codex_support_and_reasoning_effort

Conversation

@idrassi
Copy link
Copy Markdown

@idrassi idrassi commented Mar 24, 2026

Closes #2.

Summary

This PR adds OpenAI Codex CLI as a new OpenProver backend and separates backend selection from model selection, so Codex can use explicit model ids
such as gpt-5.4 and gpt-5.2.

It also adds reasoning-effort support for Claude and Codex backends.

This branch has also been merged with the current master, and the Codex changes were reconciled with the latest CLI/prover updates.

What changed

  • added a new CodexClient based on codex exec --json
  • added provider selection via --provider, --planner-provider, and --worker-provider
  • added Codex model selection via --provider codex --model <model> and --model codex:<model>
  • added --reasoning-effort, --planner-reasoning-effort, and --worker-reasoning-effort
  • preserved compatibility with Claude’s current CLI effort handling
  • enabled Codex web search for literature_search
  • enabled Codex MCP worker tools for Lean
  • refactored shared process cleanup for Claude/Codex
  • updated docs and added focused tests

Notes

  • --model codex selects the Codex backend and uses the Codex CLI default configured model
  • Codex exec --json does not stream partial assistant text, so Codex soft interrupt is advisory and preserves the current response instead of truncating it
  • cost reporting for Codex is best-effort for known explicit GPT-5/Codex model ids
  • after merging latest master, this PR now coexists with the current backend set and CLI behavior

Validation

Tested with targeted pytest coverage and Ubuntu Linux 24.04 / WSL smoke runs, including:

python -m openprover --theorem examples/cauchy_schwarz.md --model codex:gpt-5.4 --headless --reasoning-effort high --max-time 720s

and Lean formalization / verification flow:

python -m openprover --theorem examples/cauchy_schwarz.md \
  --lean-project ~/mathlib4 \
  --lean-theorem examples/cauchy_schwarz.lean \
  --proof runs/for-any-real-numbers-a-1-ldots-a-n-20260324-113109/PROOF.md \
  --model codex:gpt-5.4 \
  --headless \
  --reasoning-effort xhigh

Also re-validated after merging latest master with:

python3 -m pytest tests/test_cli_models.py tests/test_claude_client.py tests/test_codex_client.py

@jjoshua2
Copy link
Copy Markdown

jjoshua2 commented Mar 24, 2026

Can you have opus be the planner?And 5.4, be the worker?
EDIT it appears that it does

@rnbguy
Copy link
Copy Markdown

rnbguy commented Mar 24, 2026

hey thanks for this !

just a question, is there any reason you're using codex exec over codex app-server. I made a refactor for my own for exactly this but I chose codex's app-server.

I feel, it is built for use cases exactly this. Also, codex exec doesn't support token streaming or include reasoning -- like claude.

@jjoshua2
Copy link
Copy Markdown

Yes I'm using opus planner and missing the streaming on my codex worker. Submit your pr

@jjoshua2
Copy link
Copy Markdown

jjoshua2 commented Mar 24, 2026

Do you think it makes sense to have separate reasoning level for the verifier? I'm thinking using gpt 5.4 higher worker with extra high verifier might be useful and cost effective? Xhigh will often run into timeouts and context limit lengths in my experience on some hard prompts... Verification is easier though, and we want to be more sure it is correct.
Edit i modified to use xhigh verifier and its working ok so far. I've only had verifier return true before and after change tho

@idrassi
Copy link
Copy Markdown
Author

idrassi commented Mar 25, 2026

@rnbguy Thanks. The main reason was scope and integration cost, not that I think exec is fundamentally better.

openprover already had a fairly simple "one call in/one result out" CLI wrapper shape for Claude, with per-call archiving, subprocess isolation and no long-lived backend process. codex exec --json fit that model directly, so it was the lowest-risk way to add Codex support without a larger architectural refactor.

I agree with your point about the tradeoff though: app-server looks like the better fit if we want stronger parity with Claude-style UX. The downside is that it pushes us into managing server lifecycle, transport/state, reconnection/error handling, etc...which I was trying to avoid for this first pass.

I have not reviewed your refactor in detail yet but that direction seems very plausible. If maintainers prefer the app-server approach, I would not object to moving the Codex backend that way.

@idrassi
Copy link
Copy Markdown
Author

idrassi commented Mar 25, 2026

@jjoshua2 I think that makes sense.

Conceptually verifier_reasoning_effort seems reasonable for exactly the reason you mentioned: worker search can be broad/expensive, verifier prompts are usually narrower and spending more reasoning budget on verification than on generation can be a good tradeoff.

So I could definitely see value in something like --verifier-reasoning-effort, maybe later even --verifier-model.

That being said, if the verifier is mostly returning true both before and after the change, then the main bottleneck may not be effort alone. It may also mean we need a stricter verifier prompt/protocol, because extra reasoning only helps if the verifier is actually incentivized to search for flaws rather than mostly confirm.

Still, your result is useful signal. I would be in favor of treating verifier effort as a separate setting rather than coupling it to
the worker long term.

@jjoshua2
Copy link
Copy Markdown

jjoshua2 commented Mar 25, 2026 via email

@jjoshua2
Copy link
Copy Markdown

image I've got 48 folders of good data if anyone wants some to help progress euler problem? I think anyone can resume. Ideally we will have ways to go through and verify the existing stuff. Like redo the ones that were verified with high on xhigh. Or the ones with 5.4 that they also pass Opus. Or try to do specific sub lemmas in lean. Could distribute all of these to different people who have different subscriptions or local compute levels. Distributing future work is harder, but you could have a central planner and several people or open provers working...

# Conflicts:
#	README.md
#	openprover/cli.py
#	openprover/llm/__init__.py
#	openprover/llm/claude.py
#	openprover/prover.py
@idrassi
Copy link
Copy Markdown
Author

idrassi commented Mar 30, 2026

I have pushed changes to solve conflicts that were blocking the merge

@jjoshua2
Copy link
Copy Markdown

jjoshua2 commented Apr 5, 2026

@idrassi do you want to colloborate on the proof constant limit improvements for 838 I found?

@idrassi
Copy link
Copy Markdown
Author

idrassi commented Apr 7, 2026

@jjoshua2 Sorry for the late answer. No issues to collaborate, it is just that I cannot guarantee a bandwidth because of other activities.

How to want to manage collaboration? Maybe a private repo to exchange dataset and progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add OpenAI Codex CLI support as an OpenProver backend

3 participants