Skip to content

refactor: Replace progressStore with file-based IndexStatus#40

Open
doITmagic wants to merge 13 commits intodevfrom
refactor/indexing-progress
Open

refactor: Replace progressStore with file-based IndexStatus#40
doITmagic wants to merge 13 commits intodevfrom
refactor/indexing-progress

Conversation

@doITmagic
Copy link
Owner

Description

Replaces the old, complex progressStore indexing progress mechanism with a simple file-based IndexStatus system.

Problem

The previous system had two separate file walks (one for counting, one for indexing), an in-memory progressStore with preRegister/update/carry-over/flusher logic, and multiple intermediate structs (IndexingProgressSummary, LangProgressItem). This made progress reporting confusing, inaccurate, and hard to maintain.

Solution

  • Single source of truth: {workspaceRoot}/.ragcode/index_status.json — a simple JSON file written by the indexer during processing
  • Direct write from indexer: The Progress callback in pkg/indexer/service.go writes OnDisk, Changed, Processed counts per language directly to the status file
  • Direct read by tools: MCP tools read the status file via indexer.LoadIndexStatus() — no intermediate transformations
  • Lifecycle: startingrunning (first file processed) → completed/failed

What was removed

  • progressStore (preRegister, update, carry-over, flusher goroutine)
  • IndexingProgressSummary, LangProgressItem, BuildIndexingProgress, formatAge, buildIndexingMessage
  • Auto-resume logic from SearchCode/HybridSearchCode (redundant with DetectContext which already triggers re-indexing on first tool call)
  • resumeAttempts field from Engine struct

What was added

  • pkg/indexer/index_status.goIndexStatus, LangStatus, SaveIndexStatus, LoadIndexStatus
  • pkg/indexer/index_status_test.go — round-trip and missing file tests
  • Progress callback wiring in engine.IndexWorkspace

Files changed (22 files, +208 / -1182 lines)

  • Moved: engine/index_progress.gopkg/indexer/index_status.go
  • Modified: engine.go, response.go, smart_search.go, smart_search_pipeline.go, index_workspace.go, list_package_exports.go, find_usages.go, call_hierarchy.go, evaluate_ragcode.go, skills.go, read_file_context.go
  • Tests updated: engine_searchcode_test.go, engine_*_test.go, health_metrics_test.go, detector_test.go

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist:

  • I have performed a self-review of my own code
  • I have formatted my code with go fmt ./...
  • I have run tests go test ./... and they pass
  • I have verified integration with Ollama/Qdrant (if applicable)
  • I have updated the documentation accordingly

- Remove progressStore (preRegister, update, carry-over, flusher)
- Remove IndexingProgressSummary, BuildIndexingProgress, formatAge, buildIndexingMessage
- Remove auto-resume from SearchCode/HybridSearchCode (redundant with DetectContext)
- Remove resumeAttempts field from Engine
- Add IndexStatus/LangStatus/SaveIndexStatus/LoadIndexStatus in pkg/indexer/
- Indexer writes OnDisk/Changed/Processed via Progress callback
- Tools read status directly from .ragcode/index_status.json
- Fix TestDetectNoMarkers with AllowedRoots isolation
Copilot AI review requested due to automatic review settings March 8, 2026 23:04
@doITmagic doITmagic self-assigned this Mar 8, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Replaces the in-memory progressStore indexing progress mechanism (with preRegister, flusher goroutine, carry-over logic) with a simpler file-based IndexStatus system in pkg/indexer/index_status.go. MCP tools now read status directly from {workspaceRoot}/.ragcode/index_status.json via indexer.LoadIndexStatus(). This is a significant simplification that removes ~1000 lines of complex concurrency code.

Changes:

  • New pkg/indexer/index_status.go with IndexStatus/LangStatus structs and SaveIndexStatus/LoadIndexStatus functions, replacing the old engine/index_progress.go with its progressStore, flusher goroutine, and deep-copy logic
  • All MCP tools updated to use the new IndexingStatus field (backed by indexer.IndexStatus) instead of IndexingProgress (backed by IndexingProgressSummary), with the JSON tag preserved as "indexing_progress" for backward compatibility
  • Removed auto-resume logic from SearchCode/HybridSearchCode and the resumeAttempts throttle, as workspace re-indexing is now triggered via DetectContext

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/indexer/index_status.go New file: IndexStatus/LangStatus types, SaveIndexStatus, LoadIndexStatus
pkg/indexer/index_status_test.go New file: round-trip and missing-file tests
internal/service/engine/engine.go Remove progressStore/resumeAttempts, add GetIndexStatus, wire Progress callback with file-based status
internal/service/engine/index_progress.go Deleted: old progressStore and all related types/functions
internal/service/engine/index_progress_test.go Deleted: tests for removed progressStore
internal/service/tools/response.go Replace IndexingProgressSummary with IndexingStatus, rename helper to ContextFromWorkspaceWithStatus
internal/service/tools/smart_search_pipeline.go Use LoadIndexStatus, replace dynamic fallback note with static string
internal/service/tools/smart_search.go Simplify indexing error messages, remove progress attachment
internal/service/tools/find_usages.go Switch to GetIndexStatus/ContextFromWorkspaceWithStatus
internal/service/tools/call_hierarchy.go Switch to GetIndexStatus/ContextFromWorkspaceWithStatus
internal/service/tools/list_package_exports.go Switch to ContextFromWorkspaceWithStatus, remove idx-based status override
internal/service/tools/index_workspace.go Use LoadIndexStatus directly
internal/service/tools/skills.go Set IndexingStatus: nil
internal/service/tools/evaluate_ragcode.go Remove unused wctxID/wctxRoot, set IndexingStatus: nil
internal/service/tools/read_file_context.go Set IndexingStatus: nil
internal/service/tools/tests/health_metrics_test.go Remove tests for deleted types/functions
internal/service/engine/engine_searchcode_test.go Remove auto-resume test
internal/service/engine/engine_nonblocking_search_test.go Remove progress.stop() cleanup
internal/service/engine/engine_fallback_search_test.go Remove progress.stop() cleanup
internal/service/engine/engine_sticky_test.go Remove progress.stop() cleanup
pkg/workspace/detector/detector_test.go Isolate test to avoid picking up .ragcode markers from parent dirs
cmd/rag-code-mcp/main.go Version bump to 2.1.63

The Progress callback received totalFiles = len(changedFiles), which only
counts modified files needing re-indexing. This was incorrectly assigned
to OnDisk, causing on_disk: 1 when only 1 file changed — despite 232 Go
files and 655 docs on disk.
Fix:
- Call CountAllFiles() once before the language loop for real disk totals
- Pre-populate index_status.json with on_disk counts at indexing start
- Use diskTotal (pre-counted) for OnDisk, totalFiles for Changed
- Languages with 0 changed files now correctly show their disk totals
@doITmagic doITmagic requested a review from Copilot March 8, 2026 23:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

razvan added 2 commits March 9, 2026 08:56
- Fix nil panic in ContextFromWorkspaceWithStatus when wctx is nil (#7)
- Fix indentation in smart_search_pipeline.go (#1)
- Use loaded idx instead of nil in call_hierarchy.go and find_usages.go (#3, #9)
- Add backward-compat comment on JSON tag mismatch (#6)
- Create fresh IndexStatus when LoadIndexStatus returns nil (#8)
- Populate Elapsed field at completed/failed transitions (#2)
- Throttle progress I/O writes to every 10 files (#4)
- Fix test cleanup for .ragcode dir in TempDir
- Removed the 'State' field ('starting', 'running', 'completed', 'failed') from IndexStatus entirely.
- This state was misleading for AI consumers, especially during incremental re-indexing (which reset state to 'starting' even if the index was 99% complete), causing AI agents to prematurely abandon tools.
- Simplified engine.go progress callbacks and terminal states to only log timestamps and errors, rather than a potentially confusing overall state keyword.
- Updated related tests to match the simplified struct.
Copilot AI review requested due to automatic review settings March 9, 2026 07:17
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (2)

internal/service/tools/call_hierarchy.go:136

  • The check if idx != nil at line 133 is now semantically incorrect. With the old in-memory progressStore, GetIndexProgress returned nil when no indexing was active. With the new file-based system, GetIndexStatus reads from disk and returns non-nil even for a completed previous run (the file persists). This means the response will always be indexing_in_progress when no collections exist but a previous run left an index_status.json file — even if indexing completed hours ago.

This needs to check the actual state (e.g., EndedAt is empty and Error is empty) to determine if indexing is truly in progress. Without a State field on IndexStatus, you'd need something like: if idx != nil && idx.EndedAt == "" && idx.Error == "".

			if idx != nil {
				resp.Status = "indexing_in_progress"
				resp.Data = map[string]any{"indexing": idx}
			}

internal/service/tools/find_usages.go:105

  • Same issue as in call_hierarchy.go: if idx != nil at line 103 will now be true even for a completed previous indexing run (the file persists on disk), incorrectly changing the status from indexing_required to indexing_in_progress. This check needs to verify that indexing is actually ongoing (e.g., idx.EndedAt == "").
			if idx != nil {
				resp.Status = "indexing_in_progress"
			}

razvan added 2 commits March 9, 2026 11:52
BUG-001 (list_package_exports): normalize full import path to short
package name before querying Qdrant. The index stores 'indexer', not
'github.com/doITmagic/rag-code-mcp/pkg/indexer'.

BUG-003 (Go parser): go/doc automatically moves constructor/loader
functions (NewX, LoadX) that return *T from docPkg.Funcs into
docPkg.Types[T].Funcs. The parser only iterated typ.Methods, so these
functions were silently dropped and never written to the vector index.

Fix: add a typ.Funcs loop in AnalyzePackage() after the methods loop.

Affected symbols confirmed missing from Qdrant before fix:
  LoadIndexStatus, NewService, NewState, LoadState (pkg/indexer)

Tests: expanded analyzer_test.go to use real pkg/indexer code as
fixture with expectations anchored to the Qdrant DB snapshot
(25 points, 2026-03-09). Added regression tests for BUG-003,
IsPublic correctness, signature accuracy, and line coverage.
Engine (BUG-004):
- StartIndexingAsync now queues recreate=true as pendingOverflow when a job
  is already running, instead of silently dropping the request
- Fix all flaky engine test cleanups: properly wait for background goroutines
  from BOTH engine instances with time.Sleep before TempDir removal
- Add tests: TestStartIndexingAsyncRecreateQueues/StartsImmediately

Python parser (treesitter.go):
- Add patchExceptAs workaround for gotreesitter v0.6.0 broken AST on except-as
- Extract module-level variables/constants (extractAssignment/extractAssignmentDirect)
- Extract class variables from class body blocks
- Extract function/method calls for Code Graph relations (rag_find_usages)
- Detect generators via nodeContainsType(yield)
- Parse metaclass= keyword arguments in class bases
- Refactor docstring extraction with stripDocstringQuotes helper
- Handle gotreesitter putting string nodes directly in blocks (no wrapper)

Python parser (extract.go):
- Refactor getIndentation to use tagged switch
Copilot AI review requested due to automatic review settings March 9, 2026 13:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (2)

internal/service/tools/call_hierarchy.go:136

  • Same bug as in find_usages.go: idx is loaded from the persisted index_status.json file. Once any indexing run has completed, the file exists and idx != nil is always true — causing the response status to be incorrectly set to "indexing_in_progress" even when no indexing job is active.

The condition on line 133 should check whether indexing is actually in progress (e.g., via ActiveIndexingJobs() or checking idx.EndedAt == "") rather than just checking file existence.

	idx := t.engine.GetIndexStatus(wctx.Root)

	visited := make(map[string]bool)

	rootNode := &CallNode{Name: symbolName}

	// Try to find root symbol info
	rootRes := t.findSymbolInfo(ctx, wctx.ID, symbolName)
	if rootRes != nil {
		rootNode.Type, _ = rootRes.Point.Payload["type"].(string)
		rootNode.FilePath, _ = rootRes.Point.Payload["file_path"].(string)
		rootNode.Package, _ = rootRes.Point.Payload["package"].(string)
	} else {
		// If nothing is indexed yet, ExactSearchPolyglot will return ErrNoCollectionsFound.
		// Signal indexing status instead of returning an empty hierarchy.
		_, sErr := t.engine.ExactSearchPolyglot(ctx, wctx.ID, map[string]interface{}{"name": symbolName}, 1)
		var noCollections *engine.ErrNoCollectionsFound
		if errors.As(sErr, &noCollections) {
			resp := ToolResponse{
				Status:  "indexing_required",
				Message: fmt.Sprintf("⏳ Workspace '%s' is not indexed yet. Indexing is required for complete call hierarchy results.", wctx.Root),
				Context: ContextFromWorkspaceWithStatus(wctx, t.engine),
			}
			if idx != nil {
				resp.Status = "indexing_in_progress"
				resp.Data = map[string]any{"indexing": idx}
			}

internal/service/tools/find_usages.go:105

  • Bug: idx is now loaded from the persisted index_status.json file via GetIndexStatus(). Unlike the old GetIndexProgress() which returned non-nil only when an active in-memory job was running, the file persists after indexing completes. This means idx != nil will always be true once any indexing run has occurred, causing the status to incorrectly change from "indexing_required" to "indexing_in_progress" even when indexing finished long ago.

To fix: either check if an indexing job is currently active (via ActiveIndexingJobs() or indexingJobs.Load(wctx.ID)), or check a specific field on the status (e.g., s.EndedAt == "") before treating it as "in progress".

	idx := t.engine.GetIndexStatus(wctx.Root)
	allResults, err := t.engine.ExactSearchPolyglot(ctx, wctx.ID, filter, 100)
	if err != nil {
		var noCollections *engine.ErrNoCollectionsFound
		if errors.As(err, &noCollections) {
			resp := ToolResponse{
				Status:  "indexing_required",
				Message: fmt.Sprintf("⏳ Workspace '%s' is not indexed yet. Indexing is required for complete results.", wctx.Root),
				Context: ContextFromWorkspaceWithStatus(wctx, t.engine),
			}
			if idx != nil {
				resp.Status = "indexing_in_progress"
			}

doITmagic and others added 3 commits March 10, 2026 11:51
This addresses issues where indexing large files (e.g., barou.sql) caused the host system to freeze due to host CPU/GPU starvation and excessive GC pressure.

- Fix Ollama throttling bug in indexer service by correctly using a 150ms delay instead of 10ms.

- Prevent GC thrashing in treesitter parser by evaluating byte sizes instead of allocating strings for every AST node.

- Truncate massive leaf nodes (>8KB) to prevent crashing the Ollama embedding API.
This addresses issues where indexing large files (e.g., barou.sql) caused the host system to freeze due to host CPU/GPU starvation and excessive GC pressure.

- Fix Ollama throttling bug in indexer service by correctly using a 150ms delay instead of 10ms.
- Prevent GC thrashing in treesitter parser by evaluating byte sizes instead of allocating strings for every AST node.
- Truncate massive leaf nodes (>8KB) to prevent crashing the Ollama embedding API.
Export IsInvalidRoot from the watch package and apply it as a
safety check at the very start of StartIndexingAsync, before any
job registration or SaveIndexStatus call.

This prevents accidental indexing of dangerous paths such as the
user home directory (~), filesystem root (/), or /tmp — which would
cause .ragcode/index_status.json to be written outside any real
workspace.

- pkg/workspace/watch: isInvalidRoot → IsInvalidRoot (exported + docstring)
- internal/service/engine: guard added as first check in StartIndexingAsync
Copilot AI review requested due to automatic review settings March 10, 2026 19:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 9 comments.

razvan added 2 commits March 10, 2026 22:15
Critical fixes:
- Populate IndexingStatus in tool responses (was nil) for ListSkillsTool,
  InstallSkillTool, EvaluateRagCodeTool, ReadFileContextTool, SmartSearchTool,
  ListPackageExportsTool — use ContextFromWorkspaceWithStatus consistently
- fix(engine): preserve Languages map during incremental indexing in
  StartIndexingAsync (was overwriting with empty object)
- fix(engine): extract finalizeIndexStatus helper to eliminate duplicated
  EndedAt/Elapsed/Error finalization logic in success and error branches
- fix(engine): Progress callback — eliminate LoadIndexStatus (disk read +
  JSON unmarshal) on every tick; keep single *IndexStatus in-memory and
  only call SaveIndexStatus (atomic write) for disk flush every 10 files
- fix(indexer): SaveIndexStatus uses atomic write-to-temp-then-rename
  to prevent concurrent readers seeing partial JSON

Hidden from AI consumers:
- LangStatus.Changed field now json:"-" — AI sees only on_disk and processed

Cleanup:
- smart_search_pipeline.go: fix extra blank lines and restore missing
  return statement after buildResponseMeta refactor
- treesitter.go: replace invalid issues/TBD link with descriptive comment
- watcher.go: clarify IsInvalidRoot doc comment (~ is not expanded by
  filepath.Clean; rejection is via os.UserHomeDir())
- BUGS.md: mark BUG-003 as Fixed (PR #40)
- SUGGESTIONS.md: translate to English, update with current State-field status
- analyzer_test.go: remove stale Qdrant DB snapshot references from comments
- extract.go: fix getIndentation break → return to exit for-loop

Tests:
- analyzer_test.go: relax exact line number assertions to > 0
- treesitter_test.go: add 7 new tests for patchExceptAs, call extraction
  (Code Graph), module-level vars/constants, class vars, IsGenerator
- treesitter.go: fix extractClassVarsFromBlock to handle assignment nodes
  placed directly in block without expression_statement wrapper
Copilot AI review requested due to automatic review settings March 10, 2026 20:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

internal/service/tools/find_usages.go:105

  • In the ErrNoCollectionsFound branch, resp.Status is switched to "indexing_in_progress" whenever idx != nil. With the new file-based IndexStatus, idx will be non-nil for any workspace that has ever written index_status.json (even if indexing already completed), so this can misreport "indexing_in_progress". Consider keying this on a real in-progress signal (e.g., idx.EndedAt == "" / idx.Error == "" or checking Engine.ActiveIndexingJobs for wctx.ID) instead of mere file existence.
	idx := t.engine.GetIndexStatus(wctx.Root)
	allResults, err := t.engine.ExactSearchPolyglot(ctx, wctx.ID, filter, 100)
	if err != nil {
		var noCollections *engine.ErrNoCollectionsFound
		if errors.As(err, &noCollections) {
			resp := ToolResponse{
				Status:  "indexing_required",
				Message: fmt.Sprintf("⏳ Workspace '%s' is not indexed yet. Indexing is required for complete results.", wctx.Root),
				Context: ContextFromWorkspaceWithStatus(wctx, t.engine),
			}
			if idx != nil {
				resp.Status = "indexing_in_progress"
			}

internal/service/tools/call_hierarchy.go:136

  • The "indexing_in_progress" status is currently set whenever idx != nil, but IndexStatus will be non-nil for any workspace that has an index_status.json from a previous run. This can incorrectly report indexing as in progress when indexing is actually completed (or stale). Consider switching this condition to something that reflects an active run (e.g., idx.EndedAt == "" and idx.Error == "" / checking Engine.ActiveIndexingJobs for wctx.ID).
	idx := t.engine.GetIndexStatus(wctx.Root)

	visited := make(map[string]bool)

	rootNode := &CallNode{Name: symbolName}

	// Try to find root symbol info
	rootRes := t.findSymbolInfo(ctx, wctx.ID, symbolName)
	if rootRes != nil {
		rootNode.Type, _ = rootRes.Point.Payload["type"].(string)
		rootNode.FilePath, _ = rootRes.Point.Payload["file_path"].(string)
		rootNode.Package, _ = rootRes.Point.Payload["package"].(string)
	} else {
		// If nothing is indexed yet, ExactSearchPolyglot will return ErrNoCollectionsFound.
		// Signal indexing status instead of returning an empty hierarchy.
		_, sErr := t.engine.ExactSearchPolyglot(ctx, wctx.ID, map[string]interface{}{"name": symbolName}, 1)
		var noCollections *engine.ErrNoCollectionsFound
		if errors.As(sErr, &noCollections) {
			resp := ToolResponse{
				Status:  "indexing_required",
				Message: fmt.Sprintf("⏳ Workspace '%s' is not indexed yet. Indexing is required for complete call hierarchy results.", wctx.Root),
				Context: ContextFromWorkspaceWithStatus(wctx, t.engine),
			}
			if idx != nil {
				resp.Status = "indexing_in_progress"
				resp.Data = map[string]any{"indexing": idx}
			}

- index_status.Processed was being overwritten with session-only doneFiles,
  causing incremental re-indexes to reset the counter (e.g. 232 → 1)
- Now captures baseProcessed before each language run and accumulates:
  Processed = baseProcessed + doneFiles
- Full re-index (recreate=true or totalFiles >= diskTotal) resets base to 0
- Register JavaScript parser in daemon imports
- fix(engine): propagate recreate=true through pendingOverflow queue
  When a recreate request is queued while indexing, tryStartPendingIndex
  now correctly passes recreate=true to the follow-up job.

- fix(index_status): Windows-safe atomic rename fallback
  os.Rename fails on Windows when dest exists; now falls back to
  remove-then-rename.

- fix(docs/treesitter): avoid materializing huge strings for massive
  leaf nodes — slice source bytes directly when nodeLen > 8192.

- cleanup(tests): remove empty t.Cleanup block in fallback search test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants