Describe the bug
When using a subdocument generated by pandoc in a DocxTemplate, the final document becomes invalid due to missing XML namespace prefixes on elements like <a:graphic>. The resulting .docx file fails to open in Word or when loaded with python-docx. The issue occurs because the docxtpl rendering process does not correctly preserve or propagate XML namespace declarations from the subdocument into the main document.
To Reproduce
Here is a minimal standalone example to reproduce the issue:
Ext lib required: pip install pypandoc_binary
from docx import Document
from docxtpl import DocxTemplate
import io
import pypandoc
# Create a main template with a placeholder
main_template = Document()
main_template.add_paragraph("{{p sd }}")
main_template_stream = io.BytesIO()
main_template.save(main_template_stream)
# Convert HTML to DOCX using pandoc
html = '<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAALklEQVR42u3OMQEAAAgDoJnP/nk0xh5IwOzlUjQCAgICAgICAgICAgICAgLtwANzwElBO5ixlAAAAABJRU5ErkJggg==">'
pypandoc.convert_text(html, 'docx', format='html', outputfile='1.docx')
Document('1.docx') # No error
# Load and use the subdoc in the template
template = DocxTemplate(main_template_stream)
template.render(context={"sd": template.new_subdoc('1.docx')})
final_stream = io.BytesIO()
template.save(final_stream)
# The final document will fail to load due to XML namespace errors
Document(final_stream) # Raises lxml.etree.XMLSyntaxError
Error Message:
lxml.etree.XMLSyntaxError: Namespace prefix a on graphic is not defined, line 2, column 1414
Expected behavior
When rendering a subdocument created by pandoc, the final .docx file should be valid and load without XML namespace errors. The XML elements in the merged document should retain the necessary xmlns declarations, either from the root or explicitly attached to the elements, as expected by Word and the python-docx library.
Screenshots
N/A
Additional context
- Pandoc behavior: The generated
.docx file by pandoc includes XML namespaces in the root element (e.g., xmlns:a="http://..."). However, when this file is used as a subdocument in DocxTemplate, the namespaces are not preserved in the final document.
- Workaround: Manually adding
xmlns attributes to affected elements in word/document.xml (e.g., <a:graphic xmlns:a="http://...">) resolves the error, but this is a fragile "monkey patch" solution.
# unpack docx
docx_unpacked["word/document.xml"] = docx_unpacked["word/document.xml"].replace(b"<a:graphic>", b'<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">')
docx_unpacked["word/document.xml"] = docx_unpacked["word/document.xml"].replace(b"<pic:pic>", b'<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">')
# repack docx
# now it valid
- Root cause: The
DocxTemplate rendering process likely strips or fails to propagate XML namespace declarations from the subdocument during the merge, leading to invalid XML in the final document.
Suggested Fix
Ensure that the DocxTemplate renderer properly handles XML namespaces when merging subdocuments. Specifically, when inserting content from pandoc-generated .docx files, the necessary xmlns declarations should either:
- Be preserved in the root element of the final document, or
- Be explicitly attached to the relevant XML elements (e.g.,
<a:graphic>).
This will align the output with standards expected by Word and python-docx, avoiding syntax errors.
Describe the bug
When using a subdocument generated by
pandocin aDocxTemplate, the final document becomes invalid due to missing XML namespace prefixes on elements like<a:graphic>. The resulting .docx file fails to open in Word or when loaded with python-docx. The issue occurs because thedocxtplrendering process does not correctly preserve or propagate XML namespace declarations from the subdocument into the main document.To Reproduce
Here is a minimal standalone example to reproduce the issue:
Ext lib required:
pip install pypandoc_binaryError Message:
Expected behavior
When rendering a subdocument created by
pandoc, the final.docxfile should be valid and load without XML namespace errors. The XML elements in the merged document should retain the necessaryxmlnsdeclarations, either from the root or explicitly attached to the elements, as expected by Word and thepython-docxlibrary.Screenshots
N/A
Additional context
.docxfile bypandocincludes XML namespaces in the root element (e.g.,xmlns:a="http://..."). However, when this file is used as a subdocument inDocxTemplate, the namespaces are not preserved in the final document.xmlnsattributes to affected elements inword/document.xml(e.g.,<a:graphic xmlns:a="http://...">) resolves the error, but this is a fragile "monkey patch" solution.DocxTemplaterendering process likely strips or fails to propagate XML namespace declarations from the subdocument during the merge, leading to invalid XML in the final document.Suggested Fix
Ensure that the
DocxTemplaterenderer properly handles XML namespaces when merging subdocuments. Specifically, when inserting content frompandoc-generated.docxfiles, the necessaryxmlnsdeclarations should either:<a:graphic>).This will align the output with standards expected by Word and
python-docx, avoiding syntax errors.