DocumentSampler

Pure PHP library that extracts a structured, representative sample from a document of any length. No framework dependency, no HTTP calls, no AI — just text processing.

Designed as the input layer for downstream AI-powered packages such as relevance checkers, prompt injection detectors, and depersonalisation services.

Requirements

PHP ^8.5

Installation

composer require labrodev/document-sampler

Basic usage

use Labrodev\DocumentSampler\DocumentSampler;

$result = (new DocumentSampler())->sample($rawText);

$result->intro             // opening chars — title and introduction
$result->outline           // extracted section headings from anywhere in the document
$result->middle            // fixed window centred on the document midpoint
$result->tail              // closing chars — conclusion and sign-off
$result->text              // all samples joined with separators
$result->charCount         // character count of the combined sample
$result->originalCharCount // character count of the original document

Custom window sizes

By default each zone uses the window defined on the DocumentPart enum. Pass any subset to the constructor to override:

// Override specific zones — unset zones use the enum defaults
$sampler = new DocumentSampler(
    intro:   2000,
    middle:  300,
);

$result = $sampler->sample($rawText);

How it works

The sampler partitions every document into four fixed-size windows regardless of document length:

Zone	Default window	What it captures
`intro`	1000 chars	Title, abstract, opening paragraphs
`outline`	500 chars	Section headings (`# Markdown`, `1.1 Numbered`, `ALL-CAPS` lines)
`middle`	500 chars	Window centred on the document midpoint
`tail`	500 chars	Closing paragraphs, conclusion, signature

Windows are fixed — a 400-page PDF gets the same sized sample as a one-page memo. The goal is a compact, representative fingerprint of the document, not a summary.

Exporting results

JSON

$result->toJson();

{
    "meta": {
        "originalCharCount": 50000,
        "sampledCharCount": 2300
    },
    "samples": {
        "intro": "...",
        "outline": "...",
        "middle": "...",
        "tail": "..."
    }
}

Markdown

$result->toMd();

## Document Sample

**Original size:** 50,000 chars
**Sampled size:** 2,300 chars

### Intro
...

### Outline
...

### Middle
...

### Tail
...

Empty zones are omitted from both outputs.

Default window sizes

Window sizes are defined on the DocumentPart enum and can be read at runtime:

use Labrodev\DocumentSampler\Enums\DocumentPart;

DocumentPart::Intro->chars();   // 1000
DocumentPart::Outline->chars(); // 500
DocumentPart::Middle->chars();  // 500
DocumentPart::Tail->chars();    // 500

When to use this

Before calling an AI API — reduce a large document to a structured excerpt that fits in a context window without losing structural information.
Relevance checking — feed $result->text to a classifier to decide whether a document is relevant before processing it in full.
Prompt injection detection — scan a compact sample for malicious instructions before passing untrusted documents to an LLM.
Depersonalisation — run PII detection over a representative sample before deciding whether to redact the full document.
Document classification — use the outline and intro zones to classify document type without reading the entire file.

Testing

composer test

Static analysis

composer analyse

Author

Petro Lashyn — contact@labrodev.com

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
composer.json		composer.json
phpstan.neon		phpstan.neon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocumentSampler

Requirements

Installation

Basic usage

Custom window sizes

How it works

Exporting results

JSON

Markdown

Default window sizes

When to use this

Testing

Static analysis

Author

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocumentSampler

Requirements

Installation

Basic usage

Custom window sizes

How it works

Exporting results

JSON

Markdown

Default window sizes

When to use this

Testing

Static analysis

Author

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages