Skip to content

[SE-3-02] Add cross-page duplicate content detector#813

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/se-3-02-implement-duplicate-detector
Draft

[SE-3-02] Add cross-page duplicate content detector#813
Copilot wants to merge 2 commits intomainfrom
copilot/se-3-02-implement-duplicate-detector

Conversation

Copy link

Copilot AI commented Mar 10, 2026

Description

Implements a cross-page duplicate content detector for v2 MDX docs as part of Sprint 3 audit tooling. Detects exact-duplicate pages, near-duplicate page pairs, and verbatim paragraph blocks shared across multiple pages.

Scope

In scope:

  • tools/scripts/validators/content/check-duplicate-content.js — new validator
  • tests/unit/check-duplicate-content.test.js — unit tests
  • tools/script-index.md, tests/script-index.md, docs-guide/indexes/scripts-index.mdx — auto-updated indexes

Out of scope: No content changes; no modifications to existing validators.

Validation

# Unit tests (30 cases)
node tests/unit/check-duplicate-content.test.js
# ✅ check-duplicate-content unit tests passed (30 cases)

# Script header compliance
node tests/unit/script-docs.test.js
# ✅ Script documentation checks passed

# Live scan of all 327 v2 pages
node tools/scripts/validators/content/check-duplicate-content.js
# ❌ Scanned 327 file(s); 0 error(s); 1 exact-duplicate cluster(s); 1 near-duplicate pair(s); 0 duplicate block(s).
# → 86-page exact-duplicate cluster: auto-generated API stub pages (expected)
# → Near-duplicate pair: v2/community/.../roadmap.mdx ↔ v2/home/.../roadmap.mdx (96% similarity)

# JSON output mode
node tools/scripts/validators/content/check-duplicate-content.js --json

Follow-up Tasks

  • Investigate and resolve the near-duplicate roadmap pages (v2/community/livepeer-community/roadmap.mdx / v2/home/about-livepeer/roadmap.mdx)
  • Consider adding --exclude-paths flag to suppress known-good auto-generated API stub clusters

Type of Change

  • Other (please describe)

New diagnostic validator script — no doc content modified.

Related Issues

Related to #

Changes Made

  • check-duplicate-content.js — validator with three detection modes:
    • Exact duplicates: SHA-256 hash of normalised body (strips frontmatter, JSX tags, import/export lines, lowercases)
    • Near-duplicates: Jaccard similarity on 5-word shingles; default threshold 80%
    • Duplicate blocks: paragraph-level hash index; flags paragraphs shared across ≥ N pages (default 3, configurable via --min-block-pages)
  • CLI flags: --path, --file, --files, --strict, --min-block-pages <n>, --json, --help
  • check-duplicate-content.test.js — 30 unit tests covering all core functions (extractBody, normalizeBody, extractParagraphs, buildShingles, jaccardSimilarity, hashString, and all three detectors)
  • Script indexes regenerated via script-docs.test.js --write --rebuild-indexes

Testing

  • Verified all links work
  • Checked formatting and style
  • Reviewed against style guides

Screenshots (if applicable)

N/A — CLI tool only.

Checklist

  • My changes follow the style guides
  • I've reviewed the Component Library for available components
  • I've updated related pages if needed
  • I've checked for broken links
  • My changes are clear and easy to understand
  • I've tested locally

Additional Notes

Script follows the established validator pattern (@script, @category, @purpose, @needs SE-3-02, etc.) and is manual — run on-demand only (not wired into the commit pipeline). The Jaccard comparison is O(n²) over page pairs, which is acceptable for ~300-page corpora at manual run cadence. Uses only Node.js built-ins (crypto) plus the already-required gray-matter — no new dependencies.

Original prompt

This section details on the original issue you should resolve

<issue_title>[SE-3-02] Duplicate content detector</issue_title>
<issue_description>Tracking issue for SE-3-02 in the Sprint 3 + Post-Migration Audit stream.

Scope:

  • Implement Duplicate content detector
  • Keep changes isolated to the task branch/worktree
  • Validate against docs-v2 governance rules before PR</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Add tools/scripts/validators/content/check-duplicate-content.js with:
- Exact-duplicate page detection via SHA-256 body content hash
- Near-duplicate page detection via Jaccard shingle similarity (≥80%)
- Duplicate block detection via paragraph-level hashing (≥3 pages)
- CLI flags: --path, --file, --files, --strict, --min-block-pages, --json
- Unit test: tests/unit/check-duplicate-content.test.js (30 cases)
- Script index auto-updated via script-docs.test.js --write --rebuild-indexes

Co-authored-by: DeveloperAlly <12529822+DeveloperAlly@users.noreply.github.com>
Copilot AI changed the title [WIP] Add duplicate content detector for audit stream [SE-3-02] Add cross-page duplicate content detector Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SE-3-02] Duplicate content detector

2 participants