Add Codex CLI backend and reasoning-effort support by idrassi · Pull Request #3 · Kripner/openprover

idrassi · 2026-03-24T06:02:35Z

Closes #2.

Summary

This PR adds OpenAI Codex CLI as a new OpenProver backend and separates backend selection from model selection, so Codex can use explicit model ids
such as gpt-5.4 and gpt-5.2.

It also adds reasoning-effort support for Claude and Codex backends.

This branch has also been merged with the current master, and the Codex changes were reconciled with the latest CLI/prover updates.

What changed

added a new CodexClient based on codex exec --json
added provider selection via --provider, --planner-provider, and --worker-provider
added Codex model selection via --provider codex --model <model> and --model codex:<model>
added --reasoning-effort, --planner-reasoning-effort, and --worker-reasoning-effort
preserved compatibility with Claude’s current CLI effort handling
enabled Codex web search for literature_search
enabled Codex MCP worker tools for Lean
refactored shared process cleanup for Claude/Codex
updated docs and added focused tests

Notes

--model codex selects the Codex backend and uses the Codex CLI default configured model
Codex exec --json does not stream partial assistant text, so Codex soft interrupt is advisory and preserves the current response instead of truncating it
cost reporting for Codex is best-effort for known explicit GPT-5/Codex model ids
after merging latest master, this PR now coexists with the current backend set and CLI behavior

Validation

Tested with targeted pytest coverage and Ubuntu Linux 24.04 / WSL smoke runs, including:

python -m openprover --theorem examples/cauchy_schwarz.md --model codex:gpt-5.4 --headless --reasoning-effort high --max-time 720s

and Lean formalization / verification flow:

python -m openprover --theorem examples/cauchy_schwarz.md \
  --lean-project ~/mathlib4 \
  --lean-theorem examples/cauchy_schwarz.lean \
  --proof runs/for-any-real-numbers-a-1-ldots-a-n-20260324-113109/PROOF.md \
  --model codex:gpt-5.4 \
  --headless \
  --reasoning-effort xhigh

Also re-validated after merging latest master with:

python3 -m pytest tests/test_cli_models.py tests/test_claude_client.py tests/test_codex_client.py

jjoshua2 · 2026-03-24T14:33:10Z

Can you have opus be the planner?And 5.4, be the worker?
EDIT it appears that it does

rnbguy · 2026-03-24T23:23:24Z

hey thanks for this !

just a question, is there any reason you're using codex exec over codex app-server. I made a refactor for my own for exactly this but I chose codex's app-server.

I feel, it is built for use cases exactly this. Also, codex exec doesn't support token streaming or include reasoning -- like claude.

jjoshua2 · 2026-03-24T23:45:10Z

Yes I'm using opus planner and missing the streaming on my codex worker. Submit your pr

jjoshua2 · 2026-03-24T23:50:46Z

Do you think it makes sense to have separate reasoning level for the verifier? I'm thinking using gpt 5.4 higher worker with extra high verifier might be useful and cost effective? Xhigh will often run into timeouts and context limit lengths in my experience on some hard prompts... Verification is easier though, and we want to be more sure it is correct.
Edit i modified to use xhigh verifier and its working ok so far. I've only had verifier return true before and after change tho

idrassi · 2026-03-25T15:01:50Z

@rnbguy Thanks. The main reason was scope and integration cost, not that I think exec is fundamentally better.

openprover already had a fairly simple "one call in/one result out" CLI wrapper shape for Claude, with per-call archiving, subprocess isolation and no long-lived backend process. codex exec --json fit that model directly, so it was the lowest-risk way to add Codex support without a larger architectural refactor.

I agree with your point about the tradeoff though: app-server looks like the better fit if we want stronger parity with Claude-style UX. The downside is that it pushes us into managing server lifecycle, transport/state, reconnection/error handling, etc...which I was trying to avoid for this first pass.

I have not reviewed your refactor in detail yet but that direction seems very plausible. If maintainers prefer the app-server approach, I would not object to moving the Codex backend that way.

idrassi · 2026-03-25T15:06:20Z

@jjoshua2 I think that makes sense.

Conceptually verifier_reasoning_effort seems reasonable for exactly the reason you mentioned: worker search can be broad/expensive, verifier prompts are usually narrower and spending more reasoning budget on verification than on generation can be a good tradeoff.

So I could definitely see value in something like --verifier-reasoning-effort, maybe later even --verifier-model.

That being said, if the verifier is mostly returning true both before and after the change, then the main bottleneck may not be effort alone. It may also mean we need a stricter verifier prompt/protocol, because extra reasoning only helps if the verifier is actually incentivized to search for flaws rather than mostly confirm.

Still, your result is useful signal. I would be in favor of treating verifier effort as a separate setting rather than coupling it to
the worker long term.

jjoshua2 · 2026-03-25T15:08:29Z

I've actually had it find a lot of mistakes now that I've been using xhigh on verifier for 20 rounds. I only tested 5 without it so with my small sample size it had helped.

…

On Wed, Mar 25, 2026, 11:06 AM Mounir IDRASSI ***@***.***> wrote: *idrassi* left a comment (Kripner/openprover#3) <#3 (comment)> @jjoshua2 <https://github.com/jjoshua2> I think that makes sense. Conceptually verifier_reasoning_effort seems reasonable for exactly the reason you mentioned: worker search can be broad/expensive, verifier prompts are usually narrower and spending more reasoning budget on verification than on generation can be a good tradeoff. So I could definitely see value in something like --verifier-reasoning-effort, maybe later even --verifier-model. That being said, if the verifier is mostly returning true both before and after the change, then the main bottleneck may not be effort alone. It may also mean we need a stricter verifier prompt/protocol, because extra reasoning only helps if the verifier is actually incentivized to search for flaws rather than mostly confirm. Still, your result is useful signal. I would be in favor of treating verifier effort as a separate setting rather than coupling it to the worker long term. — Reply to this email directly, view it on GitHub <#3?email_source=notifications&email_token=ADXIQNGSTDJEZBBV75GGFXT4SPYYJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMJSG4ZTIOJUGM32M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4127349437>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADXIQNEHNYURUJN3JVGGBD34SPYYJAVCNFSM6AAAAACW5ADJKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCMRXGM2DSNBTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jjoshua2 · 2026-03-27T22:51:22Z

I've got 48 folders of good data if anyone wants some to help progress euler problem? I think anyone can resume. Ideally we will have ways to go through and verify the existing stuff. Like redo the ones that were verified with high on xhigh. Or the ones with 5.4 that they also pass Opus. Or try to do specific sub lemmas in lean. Could distribute all of these to different people who have different subscriptions or local compute levels. Distributing future work is harder, but you could have a central planner and several people or open provers working...

# Conflicts: # README.md # openprover/cli.py # openprover/llm/__init__.py # openprover/llm/claude.py # openprover/prover.py

idrassi · 2026-03-30T01:37:36Z

I have pushed changes to solve conflicts that were blocking the merge

jjoshua2 · 2026-04-05T00:21:10Z

@idrassi do you want to colloborate on the proof constant limit improvements for 838 I found?

idrassi · 2026-04-07T08:31:21Z

@jjoshua2 Sorry for the late answer. No issues to collaborate, it is just that I cannot guarantee a bandwidth because of other activities.

How to want to manage collaboration? Maybe a private repo to exchange dataset and progress?

Add Codex backend and reasoning-effort support

0182d10

Merge branch 'master' into codex_support_and_reasoning_effort

51de230

# Conflicts: # README.md # openprover/cli.py # openprover/llm/__init__.py # openprover/llm/claude.py # openprover/prover.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Codex CLI backend and reasoning-effort support#3

Add Codex CLI backend and reasoning-effort support#3
idrassi wants to merge 2 commits into
Kripner:masterfrom
idrassi:codex_support_and_reasoning_effort

idrassi commented Mar 24, 2026 •

edited

Loading

Uh oh!

jjoshua2 commented Mar 24, 2026 •

edited

Loading

Uh oh!

rnbguy commented Mar 24, 2026 •

edited

Loading

Uh oh!

jjoshua2 commented Mar 24, 2026

Uh oh!

jjoshua2 commented Mar 24, 2026 •

edited

Loading

Uh oh!

idrassi commented Mar 25, 2026

Uh oh!

idrassi commented Mar 25, 2026

Uh oh!

jjoshua2 commented Mar 25, 2026 via email

Uh oh!

jjoshua2 commented Mar 27, 2026

Uh oh!

idrassi commented Mar 30, 2026

Uh oh!

jjoshua2 commented Apr 5, 2026

Uh oh!

idrassi commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

idrassi commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Notes

Validation

Uh oh!

jjoshua2 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rnbguy commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjoshua2 commented Mar 24, 2026

Uh oh!

jjoshua2 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idrassi commented Mar 25, 2026

Uh oh!

idrassi commented Mar 25, 2026

Uh oh!

jjoshua2 commented Mar 25, 2026 via email

Uh oh!

jjoshua2 commented Mar 27, 2026

Uh oh!

idrassi commented Mar 30, 2026

Uh oh!

jjoshua2 commented Apr 5, 2026

Uh oh!

idrassi commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

idrassi commented Mar 24, 2026 •

edited

Loading

jjoshua2 commented Mar 24, 2026 •

edited

Loading

rnbguy commented Mar 24, 2026 •

edited

Loading

jjoshua2 commented Mar 24, 2026 •

edited

Loading