WIP: feat(storage): add transaction support with journal, undo, and crash recovery#431
Open
WIP: feat(storage): add transaction support with journal, undo, and crash recovery#431
Conversation
…recovery Implement a full transaction system for VikingFS storage operations including write-ahead journal, path locking, undo/rollback, context manager API, and crash recovery. Includes comprehensive tests and documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tions Add end-to-end tests covering rollback scenarios that were missing: - mv rollback: file moved back to original location on failure - mv commit: file persists at new location - Multi-step rollback: mkdir + write + mkdir all reversed in order - Partial step rollback: only completed entries are reversed - Nested directory rollback: child removed before parent - Best-effort rollback: single step failure does not block others Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Collaborator
Author
|
/review |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implement a full transaction mechanism for VikingFS storage operations, including write-ahead journal (WAL), undo/rollback, path locking, context manager API, and crash recovery. Core write operations (
rm,mv,add_resource,session.commit) now have atomicity guarantees — automatic rollback on failure and automatic recovery on restart after process crashes.cc @r266-tech
Related Issue
Closes #390
RFC Discussion: #115
Type of Change
Scope & Limitations
Important
This design is for single-node deployments only and does not support multi-node/distributed scenarios.
.path.ovlock)path_lock.py:21-29/local/_system/transactions/queue_messagestable,RecoverStaleresets crashed processing messages on startupFOR UPDATE SKIP LOCKEDfor concurrent dequeue, one table per queue, MySQL protocol driverdeleted=1), Ack is a no-op,RecoverStalenot implementedRecoverStaleis a no-opArchitecture
Transaction Core
Transaction State Machine
INITTransactionContext.__aenter__creates recordACQUIREacquire_lock_point/subtree/mvbeginsEXECCOMMITtx.commit()FAIL__aexit__without commitRELEASINGRELEASEDKey design: Journal is written before lock acquisition (
context_manager.py:66-68), andinit_inforecordslock_paths. This way, even if a crash occurs after successful locking but before the journal'slockslist is updated, the recovery process can still find and clean up orphan lock files viainit_info.lock_paths.Lock Design
Path locks based on lock files (
.path.ovlock), with fencing token content in format{tx_id}:{time_ns}:{lock_type}.Two lock types & conflict detection:
Three locking modes (TransactionContext):
pointacquire_lock_pointadd_resource,session.commitsubtreeacquire_lock_subtreerm(directories)mvacquire_lock_mvmvLivelock prevention: When two transactions simultaneously write lock files and detect a conflict, the one with the larger
(timestamp, tx_id)backs off and retries.Stale lock cleanup: Lock files contain nanosecond timestamps. Locks held longer than
lock_expire(default 300s) are considered remnants of crashed processes and can be force-released.Per-Command Consistency Design
rm()subtree(target)File:
point(parent)2. Delete VectorDB
3. Delete FS
(FS delete marked non-reversible, skip)
mv()mv: SUBTREE(src) + POINT(dst parent)2. VectorDB URI batch update
add_resource(finalize)point(parent dir)2. post_action(enqueue_semantic)
session.commitpoint(session_path) × 2 independent transactionsPhase 1: Archive messages → clear → checkpoint(archived)
LLM call (no transaction, safe to retry)
Phase 2: Write memories → checkpoint(completed) → post_action(enqueue_semantic)
Crash Recovery
On
TransactionManager.start(), scans/local/_system/transactions/for residual journals and decides recovery strategy by status:COMMITTED+ has post_actionsCOMMITTED/RELEASED(no post_actions)EXEC/FAIL/RELEASINGrecover_all=True, includes incomplete ops) → release locks → delete journalINIT/ACQUIREinit_info.lock_paths→ release locks → delete journalBest-effort rollback: Each undo step is independently try-caught; a single step failure does not prevent subsequent steps from executing.
Undo Operation Types
fs_mvdst → src)fs_rmfs_mkdirfs_write_newvectordb_upsertvectordb_deletevectordb_update_uriChanges Made
TransactionContext,TransactionManager(state machine + timeout cleanup),TransactionJournal(AGFS persistence),UndoLog+ rollback executorPathLockwith POINT / SUBTREE / MV locking modes, fencing tokens to prevent ABA, livelock backoffinit_inform()/mv()wrapped in transactionscommit()internally uses two-phase transactions (bridged viarun_async), external sync signature unchangedRecoverStaleat-least-once), ack mechanism, SemanticDAG optimizationsTesting
Checklist
Additional Notes