Skip to content

[BUG] extxyz / quip/gap/xyz reader silently ignores stress= headers #973

@SchrodingersCattt

Description

@SchrodingersCattt

Summary

dpdata's native extxyz reader (dpdata/formats/xyz/quip_gap_xyz.py, used by quip/gap/xyz, extxyz, mace/xyz, nequip/xyz, gpumd/xyz) only parses a virial="..." header field. Files that follow the very common ASE convention of writing a stress="..." field (eV/ų) instead are silently parsed without any virials in the resulting LabeledSystem. The user is given no warning and downstream training/eval pipelines therefore lose label information.

The ASE-based plugin (dpdata/plugins/ase.py::from_labeled_system) already does the right thing: read virial first, otherwise read stress and convert to virial = -V * stress. The native xyz reader should match that behaviour.

Versions checked

Minimal reproducer

from pathlib import Path
import dpdata
from ase.io import read

p = Path("/tmp/stress_only.xyz")
p.write_text(
    '3\n'
    'Lattice="3.0 0 0 0 3.0 0 0 0 3.0" energy=-1.5 '
    'stress="0.01 0 0 0 0.02 0 0 0 0.03" '
    'Properties=species:S:1:pos:R:3:force:R:3\n'
    'H 0.0 0.0 0.0 0.1 0.0 0.0\n'
    'H 1.0 0.0 0.0 -0.1 0.0 0.0\n'
    'O 2.0 0.0 0.0 0.0 0.0 0.0\n'
)

ms = dpdata.MultiSystems()
ms.from_quip_gap_xyz(str(p))
for k, s in ms.systems.items():
    print(k, "keys:", sorted(s.data.keys()))
    print(k, "has 'virials':", 'virials' in s.data)

atoms = read(str(p), format="extxyz")
print("ASE stress:\n", atoms.get_stress(voigt=False))
print("ASE volume:", atoms.get_volume())
print("expected virial = -V * stress:\n",
      -atoms.get_volume() * atoms.get_stress(voigt=False))

Output:

H2O1 keys: ['atom_names', 'atom_numbs', 'atom_types', 'cells', 'coords', 'energies', 'forces', 'orig']
H2O1 has 'virials': False
ASE stress:
 [[0.01 0.   0.  ]
 [0.   0.02 0.  ]
 [0.   0.   0.03]]
ASE volume: 27.0
expected virial = -V * stress:
 [[-0.27 -0.   -0.  ]
 [-0.   -0.54 -0.  ]
 [-0.   -0.   -0.81]]

Note that LabeledSystem("file.xyz", fmt="quip/gap/xyz") additionally hits a separate bug (TypeError: 'str' object is not a mapping in system.py:1230 because QuipGapXYZFormat.from_labeled_system is a no-op that returns the filename), which is why the reproducer uses MultiSystems.from_quip_gap_xyz — but this is incidental.

Real-world impact

This routinely affects datasets exported from DFT pipelines that use ASE's default extxyz writer (which emits stress=, never virial=). Concrete examples I just hit:

  • CALYPSO structure-search snapshots
  • polymer DFT trajectories with VASP→ASE export
  • mixed bulk-crystal mpmd.xyz data

In all three cases the upstream .xyz files carry only stress=, and dpdata 1.0.1 produces a LabeledSystem with no virials, silently dropping the label. Downstream dpdata.MultiSystems().to_deepmd_npy_mixed(...) then writes no virial.npy, and training/eval that depends on virial loss silently degrades. Workaround today is a custom extxyz parser.

Proposed fix

In dpdata/formats/xyz/quip_gap_xyz.py::handle_single_xyz_frame, after the existing if field_dict.get("virial", None): block, add a stress→virial fallback that mirrors the ASE plugin's logic:

if field_dict.get("virial", None):
    virials = np.array(
        [
            np.array(
                list(filter(bool, field_dict["virial"].split(" ")))
            ).reshape(3, 3)
        ]
    ).astype(np.float64)
elif field_dict.get("stress", None):
    # ASE-style extxyz: stress in eV/A^3; DeePMD virial = -V * stress (eV).
    stress = np.array(
        list(filter(bool, field_dict["stress"].split(" ")))
    ).reshape(3, 3).astype(np.float64)
    cell = info_dict_cell  # the 3x3 Lattice already parsed above
    volume = float(abs(np.linalg.det(cell)))
    virials = np.array([-volume * stress])
else:
    virials = None

(The Lattice is already parsed a few lines below, so a small refactor that computes it earlier is needed; alternatively, post-process after info_dict["cells"] is built.)

Two further nice-to-haves while in the area:

  1. Loosen the strict Properties loop (currently raise RuntimeError("unknown field ...")) to a warning + skip, so headers with extra per-atom fields (e.g. magmom, charges) don't crash; ASE handles them gracefully.
  2. Either fix or document from_labeled_system for these formats — currently dpdata.LabeledSystem("file.xyz", fmt="quip/gap/xyz") raises a confusing TypeError because the plugin's from_labeled_system returns the filename string instead of a system dict.

——Co-authored with cursor

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions