RATIS-2507. Fix java.lang.IllegalStateException: gap between entries#1439
RATIS-2507. Fix java.lang.IllegalStateException: gap between entries#1439ChenSammi wants to merge 3 commits intoapache:masterfrom
Conversation
| // If the end index is smaller than lastIndexInSnapshot, it means the state machine state is inconsistent | ||
| // with raft log state, fail the RaftServerImpl.start() to keep the state untacked. |
There was a problem hiding this comment.
... state machine state is inconsistent with raft log state ...
They don't have to be consistent:
- leader's log index is 1000, but a (slow) follower only has log index 500
- the follower dies
- leader purges the log before 700
- the follower restarts and reads log up to index 500
- the follower installs a snapshot at index 1000
- the follower dis again <--- snapshot 1000 but log 500
There was a problem hiding this comment.
@szetszwo , above is exactly the case as what I saw. For this snapshot 1000 but log 500 case, current ratis will still start the RaftServer, which I think is not a good choice.
For Ozone's case, it leads to OM shutdown due to "java.lang.IllegalStateException: gap between entry term" , https://issues.apache.org/jira/browse/HDDS-15103. Do you have any better suggestion for the fix solution?
There was a problem hiding this comment.
@ChenSammi , For HDDS-15103, let's try to create a unit test to reproduce it. Then, it would be easier to see how to fix it.
There was a problem hiding this comment.
Sorry, not HDDS-15103, but HDDS-15068.
HDDS-15103 is fixed.
There was a problem hiding this comment.
For HDDS-15068, Ratis should not write to the RaftLog when there is a gap. Would like to fix it in this PR (or filing a new JIRA) ? If not, I am happy to do it.
There was a problem hiding this comment.
Sure. We can reuse this jira.
szetszwo
left a comment
There was a problem hiding this comment.
+1 the change looks good.
What changes were proposed in this pull request?
Fail RatisServer when RaftLog end log index is smaller than last snapshot index during startup
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-2507
How was this patch tested?
existing unit test