Add front matter support (CMARK_OPT_FRONT_MATTER)#603
Add front matter support (CMARK_OPT_FRONT_MATTER)#603samuel-williams-shopify wants to merge 1 commit intocommonmark:masterfrom
CMARK_OPT_FRONT_MATTER)#603Conversation
|
Is this better handled with an external tool, e.g. a shell script? After all, in general you won't just want to ignore the front matter; you'll want to process it in some way. It's easy to use |
I'd guess that most people use libcmark, not the cmark executable, for serious Markdown processing. Also see #342 which was closed, but this really seems like a useful and widely used feature. It might even make sense to add this extension to the spec. Regarding the PR, the major problem is that adding a new block type in the middle of the cmark_node_type enum changes the following values which is an ABI break. This isn't a showstopper, but it's likely to cause problems for at least some downstream projects. Even if it's documented in the release notes, we didn't have breaking changes like this for years. If we were to follow through with that, we should leave a gap between block and inline types, so similar changes can be made without breaking the ABI. (With cmark-gfm inline types start at 0x4000.) |
But then isn't it equally easy to extract the front matter in the calling program before calling cmark? No complex parsing is needed for this. |
This PR adds opt-in front matter parsing to cmark. When
CMARK_OPT_FRONT_MATTERis set (or--front-matteris passed to theexecutable), a
---delimited block at the very start of the document iscaptured as a
CMARK_NODE_FRONT_MATTERnode and excluded from HTML output.The feature is entirely opt-in; existing behaviour is unchanged when the flag
is not set.
Motivation
Many Markdown-based tools (static site generators, documentation systems,
notebook formats) attach structured metadata to documents using a front matter
block. Without native support, every such tool must pre-process the input
before handing it to cmark, losing source-position information and making it
impossible to round-trip the document.
Design
Node type
CMARK_NODE_FRONT_MATTERis a first-class block node type, added to the enumafter
CMARK_NODE_THEMATIC_BREAK. It stores data identically toCMARK_NODE_CODE_BLOCK:cmark_node_get_literal()— the raw content between the delimiters.cmark_node_get_fence_info()— an optional format hint from the openingdelimiter line (e.g.
--- yaml,--- toml).The implementation is format-agnostic. How the content is interpreted (YAML,
TOML, JSON, …) is left entirely to the caller.
Parser integration
The feature is implemented as a small state machine in
src/front_matter.c.Two fields are added to
cmark_parser:front_matter_scanning— true from the opening---until the closing---is found or the document ends.front_matter_buf/front_matter_info—cmark_strbufaccumulators forthe content and info string respectively.
cmark_front_matter_process_line()is called fromS_process_line()inblocks.cimmediately afterparser->line_numberis incremented, so line 1is the trigger. The state lives on the parser struct, so the feature works
correctly regardless of how many times
cmark_parser_feed()is called.Delimiter rules
---on the very first line, optionally followed by an infostring (e.g.
--- yaml). A fourth consecutive dash (----) is nottreated as a front matter opener — it remains a thematic break.
---with optional trailing whitespace. Note that...(the YAML document-end marker) is intentionally not supported as aclosing delimiter; this implementation is format-agnostic and
...has nomeaning outside of YAML.
---, theentire document body (after the opening delimiter) is treated as front
matter.
Renderers
All renderers handle
CMARK_NODE_FRONT_MATTER:front_matterelement withxml:space="preserve"and an optionalinfoattributeFiles changed
src/front_matter.c— new: state machine implementationsrc/front_matter.h— new: public declarationsrc/cmark.h—CMARK_NODE_FRONT_MATTERenum entry,CMARK_OPT_FRONT_MATTERflagsrc/parser.h—front_matter_scanning,front_matter_buf,front_matter_infofieldssrc/blocks.c— strbuf lifecycle,S_process_linehook,cmark_parser_finishhooksrc/node.c—S_free_nodes,get_type_string,get_literal,set_literal,get_fence_info,set_fence_infosrc/main.c—--front-matterflagsrc/html.c,src/commonmark.c,src/latex.c,src/man.c,src/xml.c— renderer casessrc/CMakeLists.txt— buildtest/front_matter.txt— new: spec-format test fixture (10 examples)test/CMakeLists.txt— test wiringapi_test/main.c—test_front_matter(): 12 assertions covering node type,literal content, info string, source position, no-flag behaviour, no closing
delimiter, and multi-feed correctness
changelog.txtCompatibility
CMARK_OPT_FRONT_MATTERuses bit 11 (1 << 11). Note that the cmark-gfmfork uses this bit for
CMARK_OPT_GITHUB_PRE_LANG; these are separatecodebases with independent option namespaces.
CMARK_NODE_LAST_BLOCKis updated fromCMARK_NODE_THEMATIC_BREAKtoCMARK_NODE_FRONT_MATTER. Code that iterates over block node types usingthis sentinel will automatically include the new type.
Testing