Description
The following code:
<?php
$svg = <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns="http://www.w3.org/2000/svg" width="400" height="300">
<rect width="400" height="300" fill="white"/>
</svg>
XML;
$reader = new XMLReader();
$reader->XML($svg, null, LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NONET);
$reader->setParserProperty(XMLReader::SUBST_ENTITIES, true);
while (@$reader->read()) {
if ($reader->nodeType == XMLReader::DOC_TYPE) {
$outer = $reader->readOuterXml();
$inner = $reader->readInnerXml();
echo " readOuterXml(): " . ($outer !== '' ? '"' . substr($outer, 0, 120) . '"' : '[EMPTY STRING]') . "\n";
echo " readOuterXml() length: " . strlen($outer) . "\n";
echo " readInnerXml(): " . ($inner !== '' ? '"' . substr($inner, 0, 120) . '"' : '[EMPTY STRING]') . "\n";
echo " readInnerXml() length: " . strlen($inner) . "\n";
}
}
$reader->close();
Results in this output with old libxml2 (<= 2.12.9):
readOuterXml(): "<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">"
readOuterXml() length: 98
readInnerXml(): [EMPTY STRING]
readInnerXml() length: 0
But the behavior is different with libxml2 >= 2.13.0:
readOuterXml(): [EMPTY STRING]
readOuterXml() length: 0
readInnerXml(): [EMPTY STRING]
readInnerXml() length: 0
Root cause
libxml2 commit reader: Rework xmlTextReaderRead{Inner,Outer}Xml (2024-04-22, released in v2.13.0) introduced xmlTextReaderDumpCopy() which serializes reader nodes into a buffer. This function skips XML_DTD_NODE entries, whereas the old code had explicit xmlCopyDtd() handling that correctly serialized DTD nodes.
Is this is an upstream libxml2 regression or intended behavior? Does PHP need a workaround for XMLReader?
This issue affects at least MediaWiki (Uploading SVG file generated by matplotlib fails with libxml2 >= 2.13.0): https://phabricator.wikimedia.org/T399990
Reproducer
Two self-contained scripts that build libxml2 from source at specific versions inside a Fedora 41 Docker container:
host_php.sh — host-side driver (usage: ./host_php.sh [good|bad|both])
container_php.sh — runs inside the container, builds libxml2, runs the PHP test
Usage:
container_php.sh
host_php.sh
PHP Version
PHP 8.3.x / 8.4.x (any version linked against libxml2 >= 2.13.0)
Operating System
Arch Linux, also confirmed on Alpine edge, Debian sid, and inside Fedora 41 containers with custom-built libxml2.
Description
The following code:
Results in this output with old libxml2 (<= 2.12.9):
But the behavior is different with libxml2 >= 2.13.0:
Root cause
libxml2 commit reader: Rework xmlTextReaderRead{Inner,Outer}Xml (2024-04-22, released in v2.13.0) introduced
xmlTextReaderDumpCopy()which serializes reader nodes into a buffer. This function skipsXML_DTD_NODEentries, whereas the old code had explicitxmlCopyDtd()handling that correctly serialized DTD nodes.Is this is an upstream libxml2 regression or intended behavior? Does PHP need a workaround for XMLReader?
This issue affects at least MediaWiki (Uploading SVG file generated by matplotlib fails with libxml2 >= 2.13.0): https://phabricator.wikimedia.org/T399990
Reproducer
Two self-contained scripts that build libxml2 from source at specific versions inside a Fedora 41 Docker container:
host_php.sh— host-side driver (usage:./host_php.sh [good|bad|both])container_php.sh— runs inside the container, builds libxml2, runs the PHP testUsage:
container_php.sh
host_php.sh
PHP Version
Operating System
Arch Linux, also confirmed on Alpine edge, Debian sid, and inside Fedora 41 containers with custom-built libxml2.