Skip to content

Add front matter support (CMARK_OPT_FRONT_MATTER)#603

Open
samuel-williams-shopify wants to merge 1 commit intocommonmark:masterfrom
socketry:front-matter-support
Open

Add front matter support (CMARK_OPT_FRONT_MATTER)#603
samuel-williams-shopify wants to merge 1 commit intocommonmark:masterfrom
socketry:front-matter-support

Conversation

@samuel-williams-shopify
Copy link
Copy Markdown

This PR adds opt-in front matter parsing to cmark. When
CMARK_OPT_FRONT_MATTER is set (or --front-matter is passed to the
executable), a --- delimited block at the very start of the document is
captured as a CMARK_NODE_FRONT_MATTER node and excluded from HTML output.
The feature is entirely opt-in; existing behaviour is unchanged when the flag
is not set.

Motivation

Many Markdown-based tools (static site generators, documentation systems,
notebook formats) attach structured metadata to documents using a front matter
block. Without native support, every such tool must pre-process the input
before handing it to cmark, losing source-position information and making it
impossible to round-trip the document.

Design

Node type

CMARK_NODE_FRONT_MATTER is a first-class block node type, added to the enum
after CMARK_NODE_THEMATIC_BREAK. It stores data identically to
CMARK_NODE_CODE_BLOCK:

  • cmark_node_get_literal() — the raw content between the delimiters.
  • cmark_node_get_fence_info() — an optional format hint from the opening
    delimiter line (e.g. --- yaml, --- toml).

The implementation is format-agnostic. How the content is interpreted (YAML,
TOML, JSON, …) is left entirely to the caller.

Parser integration

The feature is implemented as a small state machine in src/front_matter.c.
Two fields are added to cmark_parser:

  • front_matter_scanning — true from the opening --- until the closing
    --- is found or the document ends.
  • front_matter_buf / front_matter_infocmark_strbuf accumulators for
    the content and info string respectively.

cmark_front_matter_process_line() is called from S_process_line() in
blocks.c immediately after parser->line_number is incremented, so line 1
is the trigger. The state lives on the parser struct, so the feature works
correctly regardless of how many times cmark_parser_feed() is called.

Delimiter rules

  • Opening: --- on the very first line, optionally followed by an info
    string (e.g. --- yaml). A fourth consecutive dash (----) is not
    treated as a front matter opener — it remains a thematic break.
  • Closing: exactly --- with optional trailing whitespace. Note that
    ... (the YAML document-end marker) is intentionally not supported as a
    closing delimiter; this implementation is format-agnostic and ... has no
    meaning outside of YAML.
  • No closing delimiter: if the document ends without a closing ---, the
    entire document body (after the opening delimiter) is treated as front
    matter.

Renderers

All renderers handle CMARK_NODE_FRONT_MATTER:

Renderer Behaviour
HTML Silent (front matter is metadata, not content)
Plaintext Silent
CommonMark Round-trips with delimiters and info string
LaTeX Silent
Man Silent
XML Emitted as a front_matter element with xml:space="preserve" and an optional info attribute

Files changed

  • src/front_matter.c — new: state machine implementation
  • src/front_matter.h — new: public declaration
  • src/cmark.hCMARK_NODE_FRONT_MATTER enum entry, CMARK_OPT_FRONT_MATTER flag
  • src/parser.hfront_matter_scanning, front_matter_buf, front_matter_info fields
  • src/blocks.c — strbuf lifecycle, S_process_line hook, cmark_parser_finish hook
  • src/node.cS_free_nodes, get_type_string, get_literal, set_literal, get_fence_info, set_fence_info
  • src/main.c--front-matter flag
  • src/html.c, src/commonmark.c, src/latex.c, src/man.c, src/xml.c — renderer cases
  • src/CMakeLists.txt — build
  • test/front_matter.txt — new: spec-format test fixture (10 examples)
  • test/CMakeLists.txt — test wiring
  • api_test/main.ctest_front_matter(): 12 assertions covering node type,
    literal content, info string, source position, no-flag behaviour, no closing
    delimiter, and multi-feed correctness
  • changelog.txt

Compatibility

CMARK_OPT_FRONT_MATTER uses bit 11 (1 << 11). Note that the cmark-gfm
fork uses this bit for CMARK_OPT_GITHUB_PRE_LANG; these are separate
codebases with independent option namespaces.

CMARK_NODE_LAST_BLOCK is updated from CMARK_NODE_THEMATIC_BREAK to
CMARK_NODE_FRONT_MATTER. Code that iterates over block node types using
this sentinel will automatically include the new type.

Testing

cmake -S . -B build -DBUILD_SHARED_LIBS=ON
cmake --build build
ctest --test-dir build
# 10/10 tests pass, including api_test and front_matter_executable

@jgm
Copy link
Copy Markdown
Member

jgm commented Apr 13, 2026

Is this better handled with an external tool, e.g. a shell script? After all, in general you won't just want to ignore the front matter; you'll want to process it in some way. It's easy to use sed or a similar tool to remove front matter and pipe the result to cmark, if you simply want to ignore it.

@nwellnhof
Copy link
Copy Markdown
Contributor

It's easy to use sed or a similar tool to remove front matter and pipe the result to cmark, if you simply want to ignore it.

I'd guess that most people use libcmark, not the cmark executable, for serious Markdown processing.

Also see #342 which was closed, but this really seems like a useful and widely used feature. It might even make sense to add this extension to the spec.

Regarding the PR, the major problem is that adding a new block type in the middle of the cmark_node_type enum changes the following values which is an ABI break. This isn't a showstopper, but it's likely to cause problems for at least some downstream projects. Even if it's documented in the release notes, we didn't have breaking changes like this for years.

If we were to follow through with that, we should leave a gap between block and inline types, so similar changes can be made without breaking the ABI. (With cmark-gfm inline types start at 0x4000.)

@jgm
Copy link
Copy Markdown
Member

jgm commented Apr 13, 2026

I'd guess that most people use libcmark, not the cmark executable, for serious Markdown processing.

But then isn't it equally easy to extract the front matter in the calling program before calling cmark? No complex parsing is needed for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants