Skip to content

perf(graph): batch DB operations for entity insertion#624

Open
gkorland wants to merge 3 commits intostagingfrom
perf/batch-db-operations
Open

perf(graph): batch DB operations for entity insertion#624
gkorland wants to merge 3 commits intostagingfrom
perf/batch-db-operations

Conversation

@gkorland
Copy link
Contributor

@gkorland gkorland commented Mar 20, 2026

Summary

Replace one-at-a-time graph database queries with UNWIND-based batch operations during code analysis, dramatically reducing network round-trips.

Before: For a 100-file project with ~500 entities and ~1000 relationships, the analysis pipeline made ~1,600 individual DB round-trips (one per file, entity, and relationship).

After: The same project requires ~10 batched queries (one per entity label + one per relationship type).

Changes

  • api/graph.py — Add three batch methods:
    • add_files_batch() — single UNWIND query for all files
    • add_entities_batch() — one UNWIND query per entity label (Class, Function, etc.)
    • connect_entities_batch() — one UNWIND query per relationship type (DEFINES, CALLS, etc.)
  • api/analyzers/source_analyzer.py — Restructure analysis pipeline:
    • first_pass: parse files and build in-memory hierarchy, then batch-flush files → entities → relationships
    • second_pass: collect all resolved symbol relationships, then batch-insert at the end
    • create_hierarchy / create_entity_hierarchy: refactored to collect entities and relationships in memory instead of making immediate DB calls
  • tests/test_graph_ops.py — Add tests for all three batch methods (including edge cases like empty lists and relationships with properties)

Testing

  • All existing test_graph_ops.py tests pass (4/4)
  • New batch tests pass (5/5): test_add_files_batch, test_add_files_batch_empty, test_add_entities_batch, test_connect_entities_batch, test_connect_entities_batch_empty
  • Auth tests pass (2/2)
  • Frontend lint passes
  • Pre-existing test failures in analyzer tests are unchanged (they fail on baseline too due to path mismatch in get_file assertions)

Memory / Performance Impact

  • DB round-trips: ~1,600 → ~10 for a typical small project (160x reduction)
  • Memory: Slightly higher peak memory during first pass (entities held in list before flush), but no persistent increase — the same objects were already retained in self.files for the second pass
  • Backlog compatibility: Batch queries are recorded in the backlog the same way individual queries were, preserving git history transition replay

Related Issues

N/A — Performance improvement identified during code review

Summary by CodeRabbit

  • Performance Improvements

    • Implemented batched writes for files, entities, and relationships to improve throughput and reduce overhead.
  • Bug Fixes

    • Corrected file ignore/filter behavior and ensured out-of-scope files and relationships are skipped.
    • Added validation to prevent creation of entities with invalid labels.
  • Tests

    • Added tests for batch operations, chunking behavior, and empty-input edge cases.

Replace one-at-a-time graph database queries with UNWIND-based batch
operations, dramatically reducing network round-trips during analysis.

Changes:
- Add add_files_batch() to graph.py: single UNWIND query for all files
- Add add_entities_batch() to graph.py: one query per entity label
- Add connect_entities_batch() to graph.py: one query per relation type
- Refactor first_pass in source_analyzer.py: collect entities/rels in
  memory, then batch-flush to DB
- Refactor second_pass in source_analyzer.py: collect all resolved
  symbol relationships, then batch-insert
- Add tests for all batch methods in test_graph_ops.py

For a typical 100-file project (~500 entities, ~1000 relationships),
this reduces ~1600 individual DB round-trips to ~10 batched queries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

Analyzers now defer creating graph nodes/edges during AST traversal: they accumulate pending files, entities, and relationships, then batch-insert files/entities and create relationships via new Graph batch APIs instead of performing immediate per-item graph writes. Ignore-filter and second-pass iteration behavior were adjusted.

Changes

Cohort / File(s) Summary
Batch Operations Infrastructure
api/graph.py
Added BATCH_SIZE and _VALID_LABEL_RE; added add_files_batch(files), add_entities_batch(entities_data), and connect_entities_batch(relationships) which validate, chunk (UNWIND), upsert nodes/edges, and assign back created node IDs. add_entity now enforces label validation.
Source Analyzer Refactoring
api/analyzers/source_analyzer.py
Refactored signatures: create_entity_hierarchy(..., pending_entities, pending_rels) and create_hierarchy(..., pending_entities, pending_rels). first_pass() accumulates pending_files, pending_entities, pending_rels and uses graph.add_files_batch, graph.add_entities_batch, graph.connect_entities_batch. second_pass() skips files not in self.files, defers relationship creation into local pending_rels, and fixed ignore-filter iteration variable.
Batch Operations Testing
tests/test_graph_ops.py
Added tests: test_add_files_batch, test_add_files_batch_empty, test_add_entities_batch, test_connect_entities_batch, test_connect_entities_batch_empty, and test_batch_chunking to verify IDs, properties, relationship creation, empty inputs, and internal chunking behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Analyzer as SourceAnalyzer
    participant AST as AST Walker
    participant Resolver as SymbolResolver
    participant Graph as Graph (client)
    participant DB as Neo4j

    Analyzer->>AST: first_pass walk(file)
    AST->>Resolver: resolve symbols
    Resolver-->>Analyzer: discovered files/entities/DEFINES (append to pending lists)
    Note over Analyzer,Graph: After first pass completes
    Analyzer->>Graph: add_files_batch(pending_files)
    Graph->>DB: UNWIND/MERGE File nodes (chunked)
    DB-->>Graph: created File IDs
    Graph-->>Analyzer: assign File.id
    Analyzer->>Graph: add_entities_batch(pending_entities)
    Graph->>DB: UNWIND/MERGE Entity nodes grouped by label (chunked)
    DB-->>Graph: created Entity IDs
    Graph-->>Analyzer: assign Entity.id
    Analyzer->>AST: second_pass walk(file)
    AST->>Resolver: resolve relationships/calls
    Resolver-->>Analyzer: append relationships to pending_rels (incl. CALLS props)
    Note over Analyzer,Graph: After second pass completes
    Analyzer->>Graph: connect_entities_batch(pending_rels)
    Graph->>DB: UNWIND/MATCH by ID and MERGE relationships (chunked)
    DB-->>Graph: relationships created/updated
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through AST leaves in rows,

Collected nodes where quiet code grows.
Batched my carrots, pushed them through,
Cypher UNWIND — a crunchy crew.
— a rabbit counting IDs anew

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: batching database operations for entity insertion to improve performance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/batch-db-operations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the analysis write-path to reduce FalkorDB round-trips by batching file/entity/relationship creation via UNWIND, moving from per-node/per-edge queries to per-label/per-relationship-type queries.

Changes:

  • Added Graph.add_files_batch(), Graph.add_entities_batch(), and Graph.connect_entities_batch() for batched inserts/merges.
  • Updated SourceAnalyzer first/second pass to collect entities/relationships in memory and flush them in batches.
  • Expanded tests/test_graph_ops.py with coverage for the new batch operations and empty-input edge cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
api/graph.py Introduces batch UNWIND write operations for files, entities, and relationships.
api/analyzers/source_analyzer.py Refactors analysis pipeline to accumulate and batch-write graph mutations.
tests/test_graph_ops.py Adds tests validating batch insertion and batch relationship creation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +295 to +301
q = f"""UNWIND $entities AS e
MERGE (c:{label}:Searchable {{name: e['name'], path: e['path'],
src_start: e['src_start'],
src_end: e['src_end']}})
SET c.doc = e['doc']
SET c += e['props']
RETURN c"""
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_entities_batch interpolates label directly into the Cypher string. Since labels can’t be parameterized, this should defensively validate label (e.g., allowlist known entity labels or enforce a strict [A-Za-z0-9_]+ regex) to avoid Cypher injection if this method is ever called with untrusted input.

Copilot uses AI. Check for mistakes.
api/graph.py Outdated
Comment on lines +569 to +570
MATCH (src), (dest)
WHERE ID(src) = r['src_id'] AND ID(dest) = r['dest_id']
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connect_entities_batch uses MATCH (src), (dest) WHERE ID(src)=... AND ID(dest)=..., which expresses a Cartesian product before filtering. Rewriting as two separate MATCH clauses (or MATCH (src) WHERE ID(src)=... then MATCH (dest) WHERE ID(dest)=...) avoids planner-dependent cross-join behavior and better matches the intent of ID lookups.

Suggested change
MATCH (src), (dest)
WHERE ID(src) = r['src_id'] AND ID(dest) = r['dest_id']
MATCH (src)
WHERE ID(src) = r['src_id']
MATCH (dest)
WHERE ID(dest) = r['dest_id']

Copilot uses AI. Check for mistakes.
q = f"""UNWIND $rels AS r
MATCH (src), (dest)
WHERE ID(src) = r['src_id'] AND ID(dest) = r['dest_id']
MERGE (src)-[e:{relation}]->(dest)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connect_entities_batch always uses MERGE (src)-[e:REL]->(dest) and then SET e += properties. For relationship types where multiple edges between the same pair are meaningful (e.g., multiple CALL sites between the same functions), this will collapse them into a single edge and later rows will overwrite earlier properties. Consider using CREATE for those relation types, or MERGE on a uniqueness key that includes the distinguishing properties (like line/pos).

Suggested change
MERGE (src)-[e:{relation}]->(dest)
CREATE (src)-[e:{relation}]->(dest)

Copilot uses AI. Check for mistakes.

# Skip ignored files
if any([i in str(file_path) for i in ignore]):
if any([ig in str(file_path) for ig in ignore]):
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This any([ ... for ... ]) builds an intermediate list on every file; any(ig in str(file_path) for ig in ignore) avoids the allocation and is equivalent.

Suggested change
if any([ig in str(file_path) for ig in ignore]):
if any(ig in str(file_path) for ig in ignore):

Copilot uses AI. Check for mistakes.
self.graph.add_files_batch([])

def test_add_entities_batch(self):
from api.entities.entity import Entity
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entity is imported here but never used in the test. Dropping the unused import keeps the test focused and avoids lint noise.

Suggested change
from api.entities.entity import Entity

Copilot uses AI. Check for mistakes.
Add BATCH_SIZE=500 constant to cap UNWIND query payloads. Each batch
method now processes items in chunks to avoid overwhelming FalkorDB/Redis
memory and protocol buffers on large repositories.

Add test_batch_chunking to verify correct behavior across chunk boundaries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
api/graph.py (2)

571-576: Optional: Remove unused RETURN e to reduce payload.

The query returns e but the result is never used. Removing it would slightly reduce response payload size, though this is a minor optimization.

♻️ Suggested change
             q = f"""UNWIND $rels AS r
                     MATCH (src), (dest)
                     WHERE ID(src) = r['src_id'] AND ID(dest) = r['dest_id']
                     MERGE (src)-[e:{relation}]->(dest)
-                    SET e += r['properties']
-                    RETURN e"""
+                    SET e += r['properties']"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/graph.py` around lines 571 - 576, The Cypher query string built in
variable q includes a trailing "RETURN e" but those results are never used; edit
the query in api/graph.py (the q string inside the code that UNWINDs $rels,
MATCHes src/dest and MERGEs the relationship) to remove "RETURN e" so the
database does not return unused payload — leave the UNWIND/MATCH/MERGE/SET lines
intact and just drop the final RETURN clause.

288-310: Consider using a named tuple or dataclass for the entity tuple structure.

The tuple indexing (item[0], item[1], etc.) is fragile and hard to read. A named tuple would make the code more maintainable and self-documenting.

♻️ Suggested improvement using NamedTuple
from typing import NamedTuple, Any

class EntityData(NamedTuple):
    entity_obj: Any
    label: str
    name: str
    doc: str
    path: str
    src_start: int
    src_end: int
    props: dict

Then access as item.label, item.name, etc.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/graph.py` around lines 288 - 310, The code uses positional tuple indexing
on entities_data (e.g., item[0], item[1], item[2], ...) which is fragile and
hard to read; replace the tuple with a small typed container (NamedTuple or
dataclass) such as EntityData with fields like entity_obj, label, name, doc,
path, src_start, src_end, props, update all usages in this function (the
grouping into by_label, creation of data dict for _query, and assignment
item[0].id = res.result_set[j][0].id) to use the named attributes (e.g.,
item.entity_obj, item.label, item.name, etc.), and adjust any callers that
populate entities_data so they construct EntityData instances instead of plain
tuples.
tests/test_graph_ops.py (2)

85-101: Consider verifying database persistence in addition to ID assignment.

The test verifies that id is assigned to mock entities, but doesn't confirm the entities were actually persisted to the database. This could mask issues where IDs are assigned but the DB write fails.

💡 Optional enhancement
# After the batch call, verify at least one entity is in the DB
result = self.graph.get_function_by_name('func_0')
self.assertIsNotNone(result)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_graph_ops.py` around lines 85 - 101, The test
test_add_entities_batch only asserts that mock_entity.id was set after calling
Graph.add_entities_batch but doesn't confirm DB persistence; update the test to
also query the DB (use Graph.get_function_by_name or an appropriate retrieval
method) for at least one of the inserted entities (e.g., 'func_0') after
self.graph.add_entities_batch(entities_data) and assert the returned result is
not None to ensure the entity was actually persisted; keep the existing ID
assertions.

142-157: Good test for chunking behavior with proper cleanup.

The try/finally ensures BATCH_SIZE is restored even if assertions fail. The test effectively validates the chunking logic across batch boundaries.

Note: If tests ever run in parallel, modifying the module-level BATCH_SIZE could cause race conditions. Consider using unittest.mock.patch for safer isolation:

♻️ Alternative using mock.patch
def test_batch_chunking(self):
    """Verify batches are correctly chunked when exceeding BATCH_SIZE."""
    from unittest.mock import patch
    
    with patch('api.graph.BATCH_SIZE', 3):
        files = [File(Path(f'/chunked/f{i}.py'), None) for i in range(7)]
        self.graph.add_files_batch(files)
        for f in files:
            self.assertIsNotNone(f.id)
        for i in range(7):
            result = self.graph.get_file(f'/chunked/f{i}.py', f'f{i}.py', '.py')
            self.assertIsNotNone(result)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_graph_ops.py` around lines 142 - 157, Replace the manual
try/finally mutation of api.graph.BATCH_SIZE in test_batch_chunking with a
scoped patch to avoid global state races: use unittest.mock.patch to temporarily
set 'api.graph.BATCH_SIZE' to 3 while running the assertions in
test_batch_chunking (so the rest of the suite sees the original value); keep the
same usage of File, self.graph.add_files_batch and self.graph.get_file, but wrap
them under a with patch('api.graph.BATCH_SIZE', 3) context (or the `@patch`
decorator) instead of changing BATCH_SIZE and restoring it in finally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/graph.py`:
- Around line 554-582: connect_entities_batch currently silently drops
relationship entries when r['src_id'] or r['dest_id'] is None; validate each
tuple in connect_entities_batch (function) before building by_relation: filter
out or raise for entries where src_id or dest_id is None, and emit a
debug/warning log via the existing logger (or process logger) including the
relation name and offending tuple(s) so missing IDs are visible; ensure this
validation happens before grouping and before calling self._query and keep
chunking with BATCH_SIZE unchanged.

---

Nitpick comments:
In `@api/graph.py`:
- Around line 571-576: The Cypher query string built in variable q includes a
trailing "RETURN e" but those results are never used; edit the query in
api/graph.py (the q string inside the code that UNWINDs $rels, MATCHes src/dest
and MERGEs the relationship) to remove "RETURN e" so the database does not
return unused payload — leave the UNWIND/MATCH/MERGE/SET lines intact and just
drop the final RETURN clause.
- Around line 288-310: The code uses positional tuple indexing on entities_data
(e.g., item[0], item[1], item[2], ...) which is fragile and hard to read;
replace the tuple with a small typed container (NamedTuple or dataclass) such as
EntityData with fields like entity_obj, label, name, doc, path, src_start,
src_end, props, update all usages in this function (the grouping into by_label,
creation of data dict for _query, and assignment item[0].id =
res.result_set[j][0].id) to use the named attributes (e.g., item.entity_obj,
item.label, item.name, etc.), and adjust any callers that populate entities_data
so they construct EntityData instances instead of plain tuples.

In `@tests/test_graph_ops.py`:
- Around line 85-101: The test test_add_entities_batch only asserts that
mock_entity.id was set after calling Graph.add_entities_batch but doesn't
confirm DB persistence; update the test to also query the DB (use
Graph.get_function_by_name or an appropriate retrieval method) for at least one
of the inserted entities (e.g., 'func_0') after
self.graph.add_entities_batch(entities_data) and assert the returned result is
not None to ensure the entity was actually persisted; keep the existing ID
assertions.
- Around line 142-157: Replace the manual try/finally mutation of
api.graph.BATCH_SIZE in test_batch_chunking with a scoped patch to avoid global
state races: use unittest.mock.patch to temporarily set 'api.graph.BATCH_SIZE'
to 3 while running the assertions in test_batch_chunking (so the rest of the
suite sees the original value); keep the same usage of File,
self.graph.add_files_batch and self.graph.get_file, but wrap them under a with
patch('api.graph.BATCH_SIZE', 3) context (or the `@patch` decorator) instead of
changing BATCH_SIZE and restoring it in finally.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4a821171-2bce-404a-aaa5-2c778e8fe10a

📥 Commits

Reviewing files that changed from the base of the PR and between 00c6fbd and 0eeb1a3.

📒 Files selected for processing (2)
  • api/graph.py
  • tests/test_graph_ops.py

- Validate labels/relations with regex to prevent Cypher injection
- Split Cartesian MATCH into separate MATCH clauses in connect_entities_batch
- Skip relationships with None IDs with warning log
- Use generator expression in any() to avoid list allocation
- Remove unused Entity import in test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor Author

@gkorland gkorland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all review comments:

  1. Label validation (graph.py): Added _VALID_LABEL_RE regex (^[A-Za-z_][A-Za-z0-9_]*$) to validate labels in both add_entity, add_entities_batch, and connect_entities_batch — prevents Cypher injection.

  2. Cartesian product (graph.py): Rewrote connect_entities_batch MATCH to use separate clauses: MATCH (src) WHERE ID(src)=... then MATCH (dest) WHERE ID(dest)=... instead of the cross-join pattern.

  3. MERGE vs CREATE: Kept MERGE — in our codebase, DEFINES, EXTENDS, IMPLEMENTS, CALLS, RETURNS, and PARAMETERS are all idempotent relationships where collapsing duplicates is the correct behavior. Multiple call sites between the same function pair are not tracked with line-level granularity yet.

  4. any() optimization (source_analyzer.py): Changed from any([...]) to any(...) generator expression to avoid intermediate list allocation.

  5. Unused import (test_graph_ops.py): Removed unused Entity import.

  6. None ID validation (graph.py): Added filtering with warning log in connect_entities_batch — relationships with None src/dest IDs are now skipped with a logged warning.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
api/graph.py (1)

281-321: Consider robustness for partial failures and tuple clarity.

The implementation is correct, but two optional improvements could enhance maintainability:

  1. Partial failure handling: If label validation fails for a later group, entities from earlier groups are already committed. Consider validating all labels upfront before any insertions.

  2. Tuple index fragility: The 8-element tuple with positional indexing (item[1], item[2], etc.) is error-prone. A NamedTuple or dataclass would improve clarity.

♻️ Optional: Validate all labels upfront
     def add_entities_batch(self, entities_data: list) -> None:
         ...
         if not entities_data:
             return

         by_label = defaultdict(list)
         for item in entities_data:
             by_label[item[1]].append(item)

+        # Validate all labels before any DB operations
+        for label in by_label.keys():
+            if not _VALID_LABEL_RE.match(label):
+                raise ValueError(f"Invalid entity label: {label!r}")
+
         for label, group in by_label.items():
-            if not _VALID_LABEL_RE.match(label):
-                raise ValueError(f"Invalid entity label: {label!r}")
-
             q = f"""UNWIND $entities AS e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/graph.py` around lines 281 - 321, The add_entities_batch method currently
validates labels per-group and uses fragile positional tuple indexing; first,
validate all labels before any DB work by iterating entities_data and checking
each label against _VALID_LABEL_RE and raising before entering the
batching/insert loop (so partial commits cannot occur); second, replace the
8-element positional tuples with a clearer structure (a NamedTuple or dataclass
such as EntityData with fields like entity_obj, label, name, doc, path,
src_start, src_end, props) and update add_entities_batch to unpack those named
fields instead of using item[1], item[2], etc., and keep BATCH_SIZE and the
UNWIND logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@api/graph.py`:
- Around line 281-321: The add_entities_batch method currently validates labels
per-group and uses fragile positional tuple indexing; first, validate all labels
before any DB work by iterating entities_data and checking each label against
_VALID_LABEL_RE and raising before entering the batching/insert loop (so partial
commits cannot occur); second, replace the 8-element positional tuples with a
clearer structure (a NamedTuple or dataclass such as EntityData with fields like
entity_obj, label, name, doc, path, src_start, src_end, props) and update
add_entities_batch to unpack those named fields instead of using item[1],
item[2], etc., and keep BATCH_SIZE and the UNWIND logic unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9f6538e-aa36-4ed3-8388-06089b724ecc

📥 Commits

Reviewing files that changed from the base of the PR and between 0eeb1a3 and 620631e.

📒 Files selected for processing (3)
  • api/analyzers/source_analyzer.py
  • api/graph.py
  • tests/test_graph_ops.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_graph_ops.py

Copy link
Contributor Author

@gkorland gkorland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing remaining unresolved threads:

graph.py:309 (copilot — label injection): Already addressed — _VALID_LABEL_RE = re.compile(r'^[A-Za-z_][A-Za-z0-9_]*$') validates all labels before Cypher interpolation. Both add_entities_batch and connect_entities_batch call this validation.

graph.py:592 (copilot — MERGE vs CREATE): Good observation. MERGE is intentional here — for code graph relationships, we want idempotent edge creation (analyzing the same code twice shouldn't create duplicate edges). For CALLS with per-call-site properties, this would need revisiting when call-site metadata is added, but that's tracked as a separate architectural concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants