[SE-3-02] Add cross-page duplicate content detector#813
Draft
[SE-3-02] Add cross-page duplicate content detector#813
Conversation
Add tools/scripts/validators/content/check-duplicate-content.js with: - Exact-duplicate page detection via SHA-256 body content hash - Near-duplicate page detection via Jaccard shingle similarity (≥80%) - Duplicate block detection via paragraph-level hashing (≥3 pages) - CLI flags: --path, --file, --files, --strict, --min-block-pages, --json - Unit test: tests/unit/check-duplicate-content.test.js (30 cases) - Script index auto-updated via script-docs.test.js --write --rebuild-indexes Co-authored-by: DeveloperAlly <12529822+DeveloperAlly@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add duplicate content detector for audit stream
[SE-3-02] Add cross-page duplicate content detector
Mar 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implements a cross-page duplicate content detector for v2 MDX docs as part of Sprint 3 audit tooling. Detects exact-duplicate pages, near-duplicate page pairs, and verbatim paragraph blocks shared across multiple pages.
Scope
In scope:
tools/scripts/validators/content/check-duplicate-content.js— new validatortests/unit/check-duplicate-content.test.js— unit teststools/script-index.md,tests/script-index.md,docs-guide/indexes/scripts-index.mdx— auto-updated indexesOut of scope: No content changes; no modifications to existing validators.
Validation
Follow-up Tasks
v2/community/livepeer-community/roadmap.mdx/v2/home/about-livepeer/roadmap.mdx)--exclude-pathsflag to suppress known-good auto-generated API stub clustersType of Change
New diagnostic validator script — no doc content modified.
Related Issues
Related to #
Changes Made
check-duplicate-content.js— validator with three detection modes:--min-block-pages)--path,--file,--files,--strict,--min-block-pages <n>,--json,--helpcheck-duplicate-content.test.js— 30 unit tests covering all core functions (extractBody,normalizeBody,extractParagraphs,buildShingles,jaccardSimilarity,hashString, and all three detectors)script-docs.test.js --write --rebuild-indexesTesting
Screenshots (if applicable)
N/A — CLI tool only.
Checklist
Additional Notes
Script follows the established validator pattern (
@script,@category,@purpose,@needs SE-3-02, etc.) and ismanual — run on-demand only(not wired into the commit pipeline). The Jaccard comparison is O(n²) over page pairs, which is acceptable for ~300-page corpora at manual run cadence. Uses only Node.js built-ins (crypto) plus the already-requiredgray-matter— no new dependencies.Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.