fix: harden post-commit hook against non-UTF8 encoding#241
fix: harden post-commit hook against non-UTF8 encoding#241mikemolinet wants to merge 2 commits intoEtanHey:mainfrom
Conversation
The post-commit hook's subprocess output .decode() calls raise UnicodeDecodeError if git returns non-UTF8 bytes — e.g., when i18n.commitEncoding is set to a non-UTF8 charset, when the repo has legacy SVN/CVS-imported commit metadata, or when file paths contain non-UTF8 bytes. The outer try/except only catches CalledProcessError and FileNotFoundError, so UnicodeDecodeError escapes and the hook prints a traceback, silently skipping BrainLayer indexing for that commit. Pass errors="replace" to the four .decode() calls so invalid bytes become U+FFFD and the hook completes normally. Matches the existing errors="replace" pattern in src/brainlayer/pipeline/extract_markdown.py. Adds tests/test_post_commit_hook.py with three tests: - non-UTF8 commit message (discriminating; fails without the fix) - non-UTF8 file path - valid UTF-8 (sanity; ensures no behavior change on the happy path)
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughThe PR updates the post-commit hook to explicitly decode subprocess git command outputs as UTF-8 with error replacement, and adds comprehensive test coverage validating correct behavior with both valid and invalid UTF-8 sequences in commit messages and file paths. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_post_commit_hook.py`:
- Around line 70-87: Add a regression test that covers non-UTF8 bytes returned
by git --show-toplevel: extend the existing helper _make_fake_check_output (and
its test usages around tests/test_post_commit_hook.py lines ~107-141) to return
a toplevel_bytes value containing a non-UTF8 byte sequence (e.g. include \xFF)
and add a new test case that uses that helper to assert the hook still correctly
handles/decodes the repo root; locate the factory function
_make_fake_check_output and the inner fake(cmd, **kwargs) and add the
failing-case scenario for "--show-toplevel" to ensure the code path for repo
root decoding is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e30cefb7-e3e8-4613-8324-c2d2a848ec23
📒 Files selected for processing (2)
hooks/post-commit.pytests/test_post_commit_hook.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
hooks/post-commit.pytests/test_post_commit_hook.py
🧠 Learnings (1)
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Run pytest before claiming behavior changed safely; current test suite has 929 tests
Applied to files:
tests/test_post_commit_hook.py
🔇 Additional comments (3)
hooks/post-commit.py (1)
17-25: Decode hardening is correctly and consistently applied.All modified git-output decode paths now use explicit UTF-8 with replacement, which prevents hook aborts on invalid byte sequences while preserving forward progress.
Also applies to: 32-45
tests/test_post_commit_hook.py (2)
89-141: Great targeted regression tests for non-UTF8 commit metadata.These tests accurately capture the original failure mode and validate safe behavior after the decode change.
1-141: [Your rewritten review comment text here]
[Exactly ONE classification tag]
Adds test_post_commit_handles_non_utf8_repo_toplevel to exercise the fourth decode call in hooks/post-commit.py — the `git rev-parse --show-toplevel` path that feeds os.path.basename() for the project kwarg. Completes coverage of all four fixed decode sites. Addresses CodeRabbit review on EtanHey#241.
Summary
.decode()calls inhooks/post-commit.pyagainst non-UTF8 git output by passingerrors="replace"src/brainlayer/pipeline/extract_markdown.py:50tests/test_post_commit_hook.pycovering the non-UTF8 commit-message and file-path paths plus a valid-UTF-8 sanity testRoot cause
The outer
try/except (CalledProcessError, FileNotFoundError)at line 30 doesn't catchUnicodeDecodeError. On a commit whose message or file paths contain non-UTF8 bytes (e.g.,i18n.commitEncoding=latin-1, legacy SVN-imported metadata, or non-UTF8 filesystem paths), the hook crashes with a traceback andstore_memory()never runs — silently dropping that commit from BrainLayer indexing.Test plan
pytest tests/test_post_commit_hook.py -vpytest tests/ -m "not integration"ruff check src/ tests/ruff format --check src/ tests/Summary by CodeRabbit
Release Notes
Bug Fixes
Tests
Note
Harden post-commit hook against non-UTF-8 git output
Replaces plain
.decode()calls in post-commit.py with.decode('utf-8', errors='replace')for all git subprocess outputs (commit hash, message, file paths, repo toplevel). This preventsUnicodeDecodeErrorwhen git returns non-UTF-8 bytes, substitutingU+FFFDfor undecodable bytes instead. Adds a test module covering non-UTF-8 commit messages, file paths, and repo paths, as well as a regression test confirming valid UTF-8 (including non-ASCII) passes through unchanged.Macroscope summarized b9fcc42.