Skip to content

labrodev/document-sampler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocumentSampler

Pure PHP library that extracts a structured, representative sample from a document of any length. No framework dependency, no HTTP calls, no AI — just text processing.

Designed as the input layer for downstream AI-powered packages such as relevance checkers, prompt injection detectors, and depersonalisation services.


Requirements

  • PHP ^8.5

Installation

composer require labrodev/document-sampler

Basic usage

use Labrodev\DocumentSampler\DocumentSampler;

$result = (new DocumentSampler())->sample($rawText);

$result->intro             // opening chars — title and introduction
$result->outline           // extracted section headings from anywhere in the document
$result->middle            // fixed window centred on the document midpoint
$result->tail              // closing chars — conclusion and sign-off
$result->text              // all samples joined with separators
$result->charCount         // character count of the combined sample
$result->originalCharCount // character count of the original document

Custom window sizes

By default each zone uses the window defined on the DocumentPart enum. Pass any subset to the constructor to override:

// Override specific zones — unset zones use the enum defaults
$sampler = new DocumentSampler(
    intro:   2000,
    middle:  300,
);

$result = $sampler->sample($rawText);

How it works

The sampler partitions every document into four fixed-size windows regardless of document length:

Zone Default window What it captures
intro 1000 chars Title, abstract, opening paragraphs
outline 500 chars Section headings (# Markdown, 1.1 Numbered, ALL-CAPS lines)
middle 500 chars Window centred on the document midpoint
tail 500 chars Closing paragraphs, conclusion, signature

Windows are fixed — a 400-page PDF gets the same sized sample as a one-page memo. The goal is a compact, representative fingerprint of the document, not a summary.


Exporting results

JSON

$result->toJson();
{
    "meta": {
        "originalCharCount": 50000,
        "sampledCharCount": 2300
    },
    "samples": {
        "intro": "...",
        "outline": "...",
        "middle": "...",
        "tail": "..."
    }
}

Markdown

$result->toMd();
## Document Sample

**Original size:** 50,000 chars
**Sampled size:** 2,300 chars

### Intro
...

### Outline
...

### Middle
...

### Tail
...

Empty zones are omitted from both outputs.


Default window sizes

Window sizes are defined on the DocumentPart enum and can be read at runtime:

use Labrodev\DocumentSampler\Enums\DocumentPart;

DocumentPart::Intro->chars();   // 1000
DocumentPart::Outline->chars(); // 500
DocumentPart::Middle->chars();  // 500
DocumentPart::Tail->chars();    // 500

When to use this

  • Before calling an AI API — reduce a large document to a structured excerpt that fits in a context window without losing structural information.
  • Relevance checking — feed $result->text to a classifier to decide whether a document is relevant before processing it in full.
  • Prompt injection detection — scan a compact sample for malicious instructions before passing untrusted documents to an LLM.
  • Depersonalisation — run PII detection over a representative sample before deciding whether to redact the full document.
  • Document classification — use the outline and intro zones to classify document type without reading the entire file.

Testing

composer test

Static analysis

composer analyse

Author

Petro Lashyncontact@labrodev.com


License

MIT

About

Pure PHP library that extracts a structured, representative sample from a document of any length. No framework dependency, no HTTP calls, no AI — just text processing.

Topics

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages