-
Notifications
You must be signed in to change notification settings - Fork 97
Add encoding option to marcxml.record_to_xml #105
Description
Unless I am missing something with respect to the marcxml functionality, the record_to_xml function seems to return text encoded in us-ascii, which causes problems when systems are expecting utf-8 encoding. Tracing this issue to its source revealed that xml.etree.ElementTree.tostring takes an optional encoding parameter, which defaults to us-ascii. I am proposing to be able to pass an optional encoding parameter from marcxml.record_to_xml's invocation of ET.tostring.
In my local fork, I have made the following change:
def record_to_xml(record, quiet=False, namespace=False, encoding='us-ascii'):
node = record_to_xml_node(record, quiet, namespace)
return ET.tostring(node, encoding=encoding)
Without the change, my output for record_to_xml on UTF-8 strings that contain diacritics looks like this:
<record>
<leader> 22 4500</leader>
<datafield ind1=" " ind2=" " tag="246">
<subfield code="a">Nouvelles-Hébrides, communiqués par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
<subfield code="b">Lois et règlements promulgués pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et réglementer la distribution des stupéfiants, amendée par le Protocole du 11 décembre 1946</subfield>
</datafield>
</record>
And the resulting file ends up with a us-ascii encoding, which causes import of the record to fail on the MARC based system we are using.
With the change, I get output that looks like this when I pass the optional encoding:
<record>
<leader> 22 4500</leader>
<datafield ind1=" " ind2=" " tag="246">
<subfield code="a">Nouvelles-Hébrides, communiqués par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
<subfield code="b">Lois et règlements promulgués pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et réglementer la distribution des stupéfiants, amendée par le Protocole du 11 décembre 1946</subfield>
</datafield>
</record>
I invoke as follows:
out_file.write(marcxml.record_to_xml(record,encoding='utf-8'))
And the resulting file ends up with a utf-8 encoding.
Note that I tried forcing encoding to utf-8 at each successive level beginning with the open() function and working backward to the record itself. The only thing I found that actually works is to pass an encoding parameter in this particular function. If I am missing something (obvious or not), I'd be interested in correcting my oversight.
The change looks trivial to me and preserves the default functionality, but I don't know if there are tests that depend on it.