Skip to content

Conversation

@noel2004
Copy link
Member

@noel2004 noel2004 commented Nov 25, 2025

We faced a delimma in collection_time setting (i.e. the time limit to set an assigned prove task being timeout): value too small would cause a possible time consuming task can not be completed since all submission would be rejected by timeout; in the other hand, a too big timeout would take too long to re-assign a task if the connection with assignment (prover) lost.

This PR advise to accept the proof submission even it has been timeout: there is no proper reason to reject the result if it can be verified. With the fixing we can reduce the interval of reassignment without worring about a permanent failure of occasional long-running task. The counter of timeout failure would still be counted.

Summary by CodeRabbit

Bug Fixes

  • Enhanced timeout proof handling in the validator. Proof submissions with timeout failures are now processed through the complete validation flow instead of being rejected immediately. Timeout events are now tracked and logged as warnings for improved system monitoring and visibility.

✏️ Tip: You can customize this high-level summary in your review settings.

@noel2004 noel2004 requested a review from Thegaram November 25, 2025 13:01
@coderabbitai
Copy link

coderabbitai bot commented Nov 25, 2025

Walkthrough

A timeout proof validation path was modified to remove early rejection. Instead of immediately returning an error upon detecting a timeout, the validator now increments a counter, logs a warning, and allows validation to proceed.

Changes

Cohort / File(s) Summary
Proof timeout validation logic
coordinator/internal/logic/submitproof/proof_receiver.go
Removed early return on timeout detection; now increments validateFailureProverTaskTimeout counter and logs warning while continuing validation flow

Sequence Diagram(s)

sequenceDiagram
    participant V as Validator
    participant C as Counter
    participant L as Logger
    
    rect rgb(230, 240, 255)
    Note over V: Old behavior
    V->>V: Detect timeout?
    alt Timeout found
        V->>L: Log error
        V-->>V: Return ErrValidatorFailureProofTimeout
    end
    end
    
    rect rgb(240, 250, 230)
    Note over V: New behavior
    V->>V: Detect timeout?
    alt Timeout found
        V->>C: Increment validateFailureProverTaskTimeout
        V->>L: Log warning
        V->>V: Continue validation
    end
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify that the timeout counter increment is wired correctly and accessible for monitoring
  • Confirm downstream validation logic handles timeout cases appropriately without the early guard
  • Check that the warning log provides sufficient context for debugging timeout-related issues

Poem

🐰 A timeout need not end the race,
We hop around with measured grace,
Count and warn, then carry on,
Till validation's finally done!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning Description lacks required PR title section with conventional commits checkbox and breaks down, missing the structured template with checkboxes for deployment tags and breaking changes. Add the required PR title, deployment tag versioning, and breaking change label sections with checkboxes as specified in the description template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title indicates a bug fix and clearly describes accepting timed-out proof submissions, aligning with the main change.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/accept_timeout_submission

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@noel2004 noel2004 requested a review from georgehao November 25, 2025 13:01
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between edb5123 and e1e5278.

📒 Files selected for processing (1)
  • coordinator/internal/logic/submitproof/proof_receiver.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
coordinator/internal/logic/submitproof/proof_receiver.go (1)
common/types/db.go (2)
  • ProverTaskFailureType (99-99)
  • ProverTaskFailureTypeTimeout (105-105)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: check
  • GitHub Check: tests
  • GitHub Check: tests

Comment on lines +364 to +369
// if prover task FailureType is SessionInfoFailureTimeout, the submit proof is timeout, but we still accept it
if types.ProverTaskFailureType(proverTask.FailureType) == types.ProverTaskFailureTypeTimeout {
m.validateFailureProverTaskTimeout.Inc()
log.Warn("proof submit proof have timeout", "hash", proofParameter.TaskID, "taskType", proverTask.TaskType,
"proverName", proverTask.ProverName, "proverPublicKey", pk, "proofTime", proofTimeSec)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Timeout now only logs/metrics and proceeds — double‑check interaction with ProvingStatus guard and clean up comment/log text

The new behavior (incrementing validateFailureProverTaskTimeout and logging, but not returning) matches the PR goal of accepting late proofs after a timeout. A couple of points to verify/tidy up:

  1. Interaction with “submit twice” guard (lines 315‑329)
    If timeouts elsewhere in the system are recorded by setting the task’s ProvingStatus to types.ProverProofInvalid (or Valid), then the early guard:

    if types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofValid ||
        types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofInvalid {
        ...
        return ErrValidatorFailureProverTaskCannotSubmitTwice
    }

    will fire before this timeout block, and late proofs for such tasks will still be rejected. For the new behavior to be effective, timed‑out tasks you still want to accept later proofs for must not be marked ProverProofValid/ProverProofInvalid before the submission arrives. Please confirm that your timeout handling only sets FailureType to ProverTaskFailureTypeTimeout (see common/types/db.go:98-104) and leaves ProvingStatus in a state that passes this guard.

  2. Metrics semantics
    validateFailureProverTaskTimeout is documented as “validate failure timeout”, but it now increments even for proofs you accept and verify successfully. That’s probably fine if you interpret it as “number of submissions whose task had previously timed out”, but it will no longer be a strict subset of validateFailureTotal (which only increments on non‑nil err). Just keep this in mind for dashboards/alerts.

  3. Minor comment & log clarity (optional polish)

    • The comment mentions SessionInfoFailureTimeout, but the actual enum is ProverTaskFailureTypeTimeout.
    • The log message "proof submit proof have timeout" is hard to read.

    You could make these clearer with something like:

  • // if prover task FailureType is SessionInfoFailureTimeout, the submit proof is timeout, but we still accept it
  • // If prover task FailureType is ProverTaskFailureTypeTimeout, the submission has timed out, but we still accept it.

    if types.ProverTaskFailureType(proverTask.FailureType) == types.ProverTaskFailureTypeTimeout {
    m.validateFailureProverTaskTimeout.Inc()

  •   log.Warn("proof submit proof have timeout", "hash", proofParameter.TaskID, "taskType", proverTask.TaskType,
    
  •       "proverName", proverTask.ProverName, "proverPublicKey", pk, "proofTime", proofTimeSec)
    
  •   log.Warn("proof submission timed out but will be verified", "hash", proofParameter.TaskID, "taskType", proverTask.TaskType,
    
  •       "proverName", proverTask.ProverName, "proverPublicKey", pk, "proofTime", proofTimeSec)
    
    }
    
    
    
    
    

To help verify how timeouts are recorded across the codebase, you can run:


🏁 Script executed:

#!/bin/bash
# Inspect usages of ProverTaskFailureTypeTimeout and related status updates
rg -n "ProverTaskFailureTypeTimeout" -C3
rg -n "UpdateProverTaskProvingStatusAndFailureType" -C3

Length of output: 6120


Critical: Timeout-marked tasks are rejected before reaching the new acceptance logic

The script output reveals a blocking issue with the intended behavior. When a timeout is detected, collect_proof.go:237 calls:

UpdateProverTaskProvingStatusAndFailureType(c.ctx, assignedProverTask.UUID, types.ProverProofInvalid, types.ProverTaskFailureTypeTimeout, tx)

This sets ProvingStatus to ProverProofInvalid. However, the guard at lines 315–329 rejects submissions before the new timeout handler at lines 364–369 can execute:

if types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofValid ||
    types.ProverProveStatus(proverTask.ProvingStatus) == types.ProverProofInvalid {
    ...
    return ErrValidatorFailureProverTaskCannotSubmitTwice
}

Result: Late proofs for timed-out tasks will always be rejected by the "cannot submit twice" guard, never reaching the new acceptance logic. The PR's goal cannot be achieved with this implementation.

Fix: Either (1) skip the "submit twice" guard when FailureType == ProverTaskFailureTypeTimeout, or (2) leave timed-out tasks with ProvingStatus unset (or set to a non-rejected state) so they pass the guard. The latter likely requires changes to collect_proof.go:237 as well.

🤖 Prompt for AI Agents
In coordinator/internal/logic/submitproof/proof_receiver.go around lines
315–369, the current "cannot submit twice" guard rejects tasks already marked
ProverProofInvalid before the timeout-acceptance logic at 364–369 runs; to fix,
either update the guard (lines ~315–329) to allow submissions when
proverTask.FailureType == types.ProverTaskFailureTypeTimeout by skipping the
ProverProofInvalid/ProverProofValid check for timeout-marked tasks, or instead
change the code that marks timeouts (collect_proof.go:237) so it does not set
ProvingStatus to ProverProofInvalid for timeout cases (leave it unset or set to
a non-rejecting state) so the existing guard won’t block late proofs—pick one
approach and implement the corresponding change consistently across both files.

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.49%. Comparing base (9100a0b) to head (e1e5278).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
...nator/internal/logic/submitproof/proof_receiver.go 0.00% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1764      +/-   ##
===========================================
- Coverage    36.54%   36.49%   -0.05%     
===========================================
  Files          247      247              
  Lines        21186    21185       -1     
===========================================
- Hits          7742     7732      -10     
- Misses       12614    12631      +17     
+ Partials       830      822       -8     
Flag Coverage Δ
coordinator 32.34% <0.00%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants