Summary
dpdata's native extxyz reader (dpdata/formats/xyz/quip_gap_xyz.py, used by quip/gap/xyz, extxyz, mace/xyz, nequip/xyz, gpumd/xyz) only parses a virial="..." header field. Files that follow the very common ASE convention of writing a stress="..." field (eV/ų) instead are silently parsed without any virials in the resulting LabeledSystem. The user is given no warning and downstream training/eval pipelines therefore lose label information.
The ASE-based plugin (dpdata/plugins/ase.py::from_labeled_system) already does the right thing: read virial first, otherwise read stress and convert to virial = -V * stress. The native xyz reader should match that behaviour.
Versions checked
Minimal reproducer
from pathlib import Path
import dpdata
from ase.io import read
p = Path("/tmp/stress_only.xyz")
p.write_text(
'3\n'
'Lattice="3.0 0 0 0 3.0 0 0 0 3.0" energy=-1.5 '
'stress="0.01 0 0 0 0.02 0 0 0 0.03" '
'Properties=species:S:1:pos:R:3:force:R:3\n'
'H 0.0 0.0 0.0 0.1 0.0 0.0\n'
'H 1.0 0.0 0.0 -0.1 0.0 0.0\n'
'O 2.0 0.0 0.0 0.0 0.0 0.0\n'
)
ms = dpdata.MultiSystems()
ms.from_quip_gap_xyz(str(p))
for k, s in ms.systems.items():
print(k, "keys:", sorted(s.data.keys()))
print(k, "has 'virials':", 'virials' in s.data)
atoms = read(str(p), format="extxyz")
print("ASE stress:\n", atoms.get_stress(voigt=False))
print("ASE volume:", atoms.get_volume())
print("expected virial = -V * stress:\n",
-atoms.get_volume() * atoms.get_stress(voigt=False))
Output:
H2O1 keys: ['atom_names', 'atom_numbs', 'atom_types', 'cells', 'coords', 'energies', 'forces', 'orig']
H2O1 has 'virials': False
ASE stress:
[[0.01 0. 0. ]
[0. 0.02 0. ]
[0. 0. 0.03]]
ASE volume: 27.0
expected virial = -V * stress:
[[-0.27 -0. -0. ]
[-0. -0.54 -0. ]
[-0. -0. -0.81]]
Note that LabeledSystem("file.xyz", fmt="quip/gap/xyz") additionally hits a separate bug (TypeError: 'str' object is not a mapping in system.py:1230 because QuipGapXYZFormat.from_labeled_system is a no-op that returns the filename), which is why the reproducer uses MultiSystems.from_quip_gap_xyz — but this is incidental.
Real-world impact
This routinely affects datasets exported from DFT pipelines that use ASE's default extxyz writer (which emits stress=, never virial=). Concrete examples I just hit:
CALYPSO structure-search snapshots
- polymer DFT trajectories with VASP→ASE export
- mixed bulk-crystal
mpmd.xyz data
In all three cases the upstream .xyz files carry only stress=, and dpdata 1.0.1 produces a LabeledSystem with no virials, silently dropping the label. Downstream dpdata.MultiSystems().to_deepmd_npy_mixed(...) then writes no virial.npy, and training/eval that depends on virial loss silently degrades. Workaround today is a custom extxyz parser.
Proposed fix
In dpdata/formats/xyz/quip_gap_xyz.py::handle_single_xyz_frame, after the existing if field_dict.get("virial", None): block, add a stress→virial fallback that mirrors the ASE plugin's logic:
if field_dict.get("virial", None):
virials = np.array(
[
np.array(
list(filter(bool, field_dict["virial"].split(" ")))
).reshape(3, 3)
]
).astype(np.float64)
elif field_dict.get("stress", None):
# ASE-style extxyz: stress in eV/A^3; DeePMD virial = -V * stress (eV).
stress = np.array(
list(filter(bool, field_dict["stress"].split(" ")))
).reshape(3, 3).astype(np.float64)
cell = info_dict_cell # the 3x3 Lattice already parsed above
volume = float(abs(np.linalg.det(cell)))
virials = np.array([-volume * stress])
else:
virials = None
(The Lattice is already parsed a few lines below, so a small refactor that computes it earlier is needed; alternatively, post-process after info_dict["cells"] is built.)
Two further nice-to-haves while in the area:
- Loosen the strict
Properties loop (currently raise RuntimeError("unknown field ...")) to a warning + skip, so headers with extra per-atom fields (e.g. magmom, charges) don't crash; ASE handles them gracefully.
- Either fix or document
from_labeled_system for these formats — currently dpdata.LabeledSystem("file.xyz", fmt="quip/gap/xyz") raises a confusing TypeError because the plugin's from_labeled_system returns the filename string instead of a system dict.
——Co-authored with cursor
Summary
dpdata's native extxyz reader (dpdata/formats/xyz/quip_gap_xyz.py, used byquip/gap/xyz,extxyz,mace/xyz,nequip/xyz,gpumd/xyz) only parses avirial="..."header field. Files that follow the very common ASE convention of writing astress="..."field (eV/ų) instead are silently parsed without anyvirialsin the resultingLabeledSystem. The user is given no warning and downstream training/eval pipelines therefore lose label information.The ASE-based plugin (
dpdata/plugins/ase.py::from_labeled_system) already does the right thing: readvirialfirst, otherwise readstressand convert tovirial = -V * stress. The native xyz reader should match that behaviour.Versions checked
dpdata == 1.0.1(PyPI latest)master @ 2026-05-18— confirmed identical behaviour indpdata/formats/xyz/quip_gap_xyz.pyMinimal reproducer
Output:
Note that
LabeledSystem("file.xyz", fmt="quip/gap/xyz")additionally hits a separate bug (TypeError: 'str' object is not a mappinginsystem.py:1230becauseQuipGapXYZFormat.from_labeled_systemis a no-op that returns the filename), which is why the reproducer usesMultiSystems.from_quip_gap_xyz— but this is incidental.Real-world impact
This routinely affects datasets exported from DFT pipelines that use ASE's default extxyz writer (which emits
stress=, nevervirial=). Concrete examples I just hit:CALYPSOstructure-search snapshotsmpmd.xyzdataIn all three cases the upstream
.xyzfiles carry onlystress=, anddpdata1.0.1 produces aLabeledSystemwith novirials, silently dropping the label. Downstreamdpdata.MultiSystems().to_deepmd_npy_mixed(...)then writes novirial.npy, and training/eval that depends on virial loss silently degrades. Workaround today is a custom extxyz parser.Proposed fix
In
dpdata/formats/xyz/quip_gap_xyz.py::handle_single_xyz_frame, after the existingif field_dict.get("virial", None):block, add a stress→virial fallback that mirrors the ASE plugin's logic:(The Lattice is already parsed a few lines below, so a small refactor that computes it earlier is needed; alternatively, post-process after
info_dict["cells"]is built.)Two further nice-to-haves while in the area:
Propertiesloop (currentlyraise RuntimeError("unknown field ...")) to a warning + skip, so headers with extra per-atom fields (e.g.magmom,charges) don't crash; ASE handles them gracefully.from_labeled_systemfor these formats — currentlydpdata.LabeledSystem("file.xyz", fmt="quip/gap/xyz")raises a confusingTypeErrorbecause the plugin'sfrom_labeled_systemreturns the filename string instead of a system dict.——Co-authored with cursor