Skip to content
This repository was archived by the owner on Feb 4, 2020. It is now read-only.
This repository was archived by the owner on Feb 4, 2020. It is now read-only.

Add encoding option to marcxml.record_to_xml #105

@aaronhelton

Description

@aaronhelton

Unless I am missing something with respect to the marcxml functionality, the record_to_xml function seems to return text encoded in us-ascii, which causes problems when systems are expecting utf-8 encoding. Tracing this issue to its source revealed that xml.etree.ElementTree.tostring takes an optional encoding parameter, which defaults to us-ascii. I am proposing to be able to pass an optional encoding parameter from marcxml.record_to_xml's invocation of ET.tostring.

In my local fork, I have made the following change:

def record_to_xml(record, quiet=False, namespace=False, encoding='us-ascii'):
  node = record_to_xml_node(record, quiet, namespace)
  return ET.tostring(node, encoding=encoding)

Without the change, my output for record_to_xml on UTF-8 strings that contain diacritics looks like this:

<record>
    <leader>          22        4500</leader>
    <datafield ind1=" " ind2=" " tag="246">
      <subfield code="a">Nouvelles-H&#233;brides, communiqu&#233;s par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
      <subfield code="b">Lois et r&#232;glements promulgu&#233;s pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et r&#233;glementer la distribution des stup&#233;fiants, amend&#233;e par le Protocole du 11 d&#233;cembre 1946</subfield>
    </datafield>
</record>

And the resulting file ends up with a us-ascii encoding, which causes import of the record to fail on the MARC based system we are using.

With the change, I get output that looks like this when I pass the optional encoding:

<record>
	<leader>          22        4500</leader>
	<datafield ind1=" " ind2=" " tag="246">
		<subfield code="a">Nouvelles-Hébrides, communiqués par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
		<subfield code="b">Lois et règlements promulgués pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et réglementer la distribution des stupéfiants, amendée par le Protocole du 11 décembre 1946</subfield>
	</datafield>
</record>

I invoke as follows:

out_file.write(marcxml.record_to_xml(record,encoding='utf-8'))

And the resulting file ends up with a utf-8 encoding.

Note that I tried forcing encoding to utf-8 at each successive level beginning with the open() function and working backward to the record itself. The only thing I found that actually works is to pass an encoding parameter in this particular function. If I am missing something (obvious or not), I'd be interested in correcting my oversight.

The change looks trivial to me and preserves the default functionality, but I don't know if there are tests that depend on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions