Skip to content

fix: harden post-commit hook against non-UTF8 encoding#241

Open
mikemolinet wants to merge 2 commits intoEtanHey:mainfrom
mikemolinet:fix/post-commit-utf8-hardening
Open

fix: harden post-commit hook against non-UTF8 encoding#241
mikemolinet wants to merge 2 commits intoEtanHey:mainfrom
mikemolinet:fix/post-commit-utf8-hardening

Conversation

@mikemolinet
Copy link
Copy Markdown

@mikemolinet mikemolinet commented Apr 13, 2026

Summary

  • Harden four .decode() calls in hooks/post-commit.py against non-UTF8 git output by passing errors="replace"
  • Matches existing pattern in src/brainlayer/pipeline/extract_markdown.py:50
  • Adds tests/test_post_commit_hook.py covering the non-UTF8 commit-message and file-path paths plus a valid-UTF-8 sanity test

Root cause

The outer try/except (CalledProcessError, FileNotFoundError) at line 30 doesn't catch UnicodeDecodeError. On a commit whose message or file paths contain non-UTF8 bytes (e.g., i18n.commitEncoding=latin-1, legacy SVN-imported metadata, or non-UTF8 filesystem paths), the hook crashes with a traceback and store_memory() never runs — silently dropping that commit from BrainLayer indexing.

Test plan

  • pytest tests/test_post_commit_hook.py -v
  • pytest tests/ -m "not integration"
  • ruff check src/ tests/
  • ruff format --check src/ tests/

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced robustness when processing git commits and file paths containing non-UTF-8 characters, ensuring graceful error handling and recovery.
  • Tests

    • Added comprehensive test coverage for character encoding scenarios, validating proper handling of non-UTF-8 inputs, Latin-1 encoded text, and multi-byte UTF-8 characters.

Note

Harden post-commit hook against non-UTF-8 git output

Replaces plain .decode() calls in post-commit.py with .decode('utf-8', errors='replace') for all git subprocess outputs (commit hash, message, file paths, repo toplevel). This prevents UnicodeDecodeError when git returns non-UTF-8 bytes, substituting U+FFFD for undecodable bytes instead. Adds a test module covering non-UTF-8 commit messages, file paths, and repo paths, as well as a regression test confirming valid UTF-8 (including non-ASCII) passes through unchanged.

Macroscope summarized b9fcc42.

The post-commit hook's subprocess output .decode() calls raise
UnicodeDecodeError if git returns non-UTF8 bytes — e.g., when
i18n.commitEncoding is set to a non-UTF8 charset, when the repo has
legacy SVN/CVS-imported commit metadata, or when file paths contain
non-UTF8 bytes. The outer try/except only catches CalledProcessError
and FileNotFoundError, so UnicodeDecodeError escapes and the hook
prints a traceback, silently skipping BrainLayer indexing for that
commit.

Pass errors="replace" to the four .decode() calls so invalid bytes
become U+FFFD and the hook completes normally. Matches the existing
errors="replace" pattern in src/brainlayer/pipeline/extract_markdown.py.

Adds tests/test_post_commit_hook.py with three tests:
- non-UTF8 commit message (discriminating; fails without the fix)
- non-UTF8 file path
- valid UTF-8 (sanity; ensures no behavior change on the happy path)
@mikemolinet mikemolinet requested a review from EtanHey as a code owner April 13, 2026 22:43
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

The PR updates the post-commit hook to explicitly decode subprocess git command outputs as UTF-8 with error replacement, and adds comprehensive test coverage validating correct behavior with both valid and invalid UTF-8 sequences in commit messages and file paths.

Changes

Cohort / File(s) Summary
Post-commit Hook UTF-8 Decoding
hooks/post-commit.py
Updated git-output decoding for commit_hash, commit_msg, files_changed, and repo_root to explicitly use UTF-8 with errors="replace", replacing prior default .decode() behavior while preserving control flow and error handling.
Test Coverage for Hook
tests/test_post_commit_hook.py
Added pytest test suite with three tests: validation of non-UTF-8 commit messages, non-UTF-8 file paths, and UTF-8 sanity checks. Tests dynamically load hook module, mock subprocess git commands, stub brainlayer modules, and assert correct decoding behavior including replacement character insertion on invalid sequences.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A UTF-8 fix hops into place,
No garbled text to slow the race!
With errors="replace" standing strong,
Invalid bytes won't go wrong,
Tests ensure our commits decode with grace! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: harden post-commit hook against non-UTF8 encoding' directly and accurately describes the main change: hardening the post-commit hook by adding UTF-8 error handling.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_post_commit_hook.py`:
- Around line 70-87: Add a regression test that covers non-UTF8 bytes returned
by git --show-toplevel: extend the existing helper _make_fake_check_output (and
its test usages around tests/test_post_commit_hook.py lines ~107-141) to return
a toplevel_bytes value containing a non-UTF8 byte sequence (e.g. include \xFF)
and add a new test case that uses that helper to assert the hook still correctly
handles/decodes the repo root; locate the factory function
_make_fake_check_output and the inner fake(cmd, **kwargs) and add the
failing-case scenario for "--show-toplevel" to ensure the code path for repo
root decoding is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e30cefb7-e3e8-4613-8324-c2d2a848ec23

📥 Commits

Reviewing files that changed from the base of the PR and between 15ab4e9 and 92f1401.

📒 Files selected for processing (2)
  • hooks/post-commit.py
  • tests/test_post_commit_hook.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

  • hooks/post-commit.py
  • tests/test_post_commit_hook.py
🧠 Learnings (1)
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Run pytest before claiming behavior changed safely; current test suite has 929 tests

Applied to files:

  • tests/test_post_commit_hook.py
🔇 Additional comments (3)
hooks/post-commit.py (1)

17-25: Decode hardening is correctly and consistently applied.

All modified git-output decode paths now use explicit UTF-8 with replacement, which prevents hook aborts on invalid byte sequences while preserving forward progress.

Also applies to: 32-45

tests/test_post_commit_hook.py (2)

89-141: Great targeted regression tests for non-UTF8 commit metadata.

These tests accurately capture the original failure mode and validate safe behavior after the decode change.


1-141: [Your rewritten review comment text here]
[Exactly ONE classification tag]

Comment thread tests/test_post_commit_hook.py
Adds test_post_commit_handles_non_utf8_repo_toplevel to exercise the
fourth decode call in hooks/post-commit.py — the `git rev-parse
--show-toplevel` path that feeds os.path.basename() for the project
kwarg. Completes coverage of all four fixed decode sites.

Addresses CodeRabbit review on EtanHey#241.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant