Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
c297234
test
yiyunw3 Apr 7, 2026
4f99e87
add: ecg-qa dataset
jovianw Apr 7, 2026
743265d
Merge pull request #1 from jovianw/feature/jovian/add-qa-dataset
jovianw Apr 15, 2026
b65918c
feat: implement ECG-QA dataset download capability and testing
jovianw Apr 15, 2026
9aefcf4
remove: reference to unused base task
jovianw Apr 15, 2026
5c2ed4e
add new signal dataset for PTB-XL
yiyunw3 Apr 15, 2026
10af813
rm read me
yiyunw3 Apr 15, 2026
74d70c7
refactor: remove extra caching logic
jovianw Apr 15, 2026
2c1c2e9
add: ecgqa dataset docs
jovianw Apr 15, 2026
b248772
Merge pull request #3 from jovianw/feature/jovian/add-qa-dataset
jovianw Apr 15, 2026
7c85f6b
add task
matthew-pham Apr 15, 2026
a8c70c8
rename and added rst
yiyunw3 Apr 15, 2026
74b7c59
added actual test file, grammar fixes
yiyunw3 Apr 15, 2026
674f05c
add doc file and add to init, add unit test
yiyunw3 Apr 15, 2026
02bf9d9
Merge pull request #2 from jovianw/user/yiyunw3/dataset
yiyunw3 Apr 15, 2026
054dbbb
Merge pull request #4 from jovianw/matthewpham/addtask
jovianw Apr 16, 2026
b5fd754
modify task to do resampling
matthew-pham Apr 16, 2026
43fcfbd
add comment
matthew-pham Apr 16, 2026
13b5dc0
add author name, re-arranged text order
yiyunw3 Apr 16, 2026
1f48cb7
perform resampling
matthew-pham Apr 16, 2026
6f0ed8b
rename ptb-xl dataset
yiyunw3 Apr 16, 2026
19516a0
change task to downsample
matthew-pham Apr 16, 2026
eb1e650
rebase dataset to BaseDataset
yiyunw3 Apr 16, 2026
c1995dc
fix xor
yiyunw3 Apr 16, 2026
f14bb6f
Merge pull request #6 from jovianw/user/yiyunw3/dataset-base-fix
jovianw Apr 16, 2026
ab74e43
Merge branch 'master' into matthewpham/addtask
jovianw Apr 16, 2026
8bcc1ec
Merge pull request #5 from jovianw/matthewpham/addtask
jovianw Apr 16, 2026
9d9d8fb
modify test files, meet character line length
yiyunw3 Apr 16, 2026
20297dd
uncomment ecg task
yiyunw3 Apr 16, 2026
c7fff4b
Merge remote-tracking branch 'origin/user/yiyunw3/dataset-base-fix'
yiyunw3 Apr 16, 2026
d8cfd23
rm trailing spaces
yiyunw3 Apr 16, 2026
bbebd73
minor
yiyunw3 Apr 16, 2026
704c376
add: ECGQA example and update datasets and tasks
jovianw Apr 16, 2026
562f016
improve code a little
yiyunw3 Apr 20, 2026
29e9629
Merge pull request #7 from jovianw/feature/jovian/ecgqa-task-and-example
yiyunw3 Apr 21, 2026
04bd963
Rename files and task classes for consistency
jovianw Apr 21, 2026
d19fcc0
Merge pull request #8 from jovianw/feature/jovian/ecgqa-task-and-example
jovianw Apr 21, 2026
0539b7b
Merge branch 'master' into master
yiyunw3 May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/api/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ Available Datasets
datasets/pyhealth.datasets.BMDHSDataset
datasets/pyhealth.datasets.COVID19CXRDataset
datasets/pyhealth.datasets.ChestXray14Dataset
datasets/pyhealth.datasets.ECGQADataset
datasets/pyhealth.datasets.PhysioNetDeIDDataset
datasets/pyhealth.datasets.TUABDataset
datasets/pyhealth.datasets.TUEVDataset
Expand All @@ -246,3 +247,4 @@ Available Datasets
datasets/pyhealth.datasets.TCGAPRADDataset
datasets/pyhealth.datasets.splitter
datasets/pyhealth.datasets.utils
datasets/pyhealth.datasets.PTBXLDataset
9 changes: 9 additions & 0 deletions docs/api/datasets/pyhealth.datasets.ECGQADataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
pyhealth.datasets.ECGQADataset
===================================

The ECG-QA dataset (Oh et al., 2024) provides natural-language question-answer pairs grounded in ECG recordings from PTB-XL or MIMIC-IV-ECG, restructured for few-shot learning by Tang et al. (CHIL 2025). For more information see the `FSL_ECG_QA repository <https://github.com/Tang-Jia-Lu/FSL_ECG_QA>`_.

.. autoclass:: pyhealth.datasets.ECGQADataset
:members:
:undoc-members:
:show-inheritance:
9 changes: 9 additions & 0 deletions docs/api/datasets/pyhealth.datasets.PTBXLDataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
pyhealth.datasets.PTBXLDataset
===================================

The PTB-XL 1.0.3 dataset. For the original dataset see `here <https://physionet.org/content/ptb-xl/1.0.3/>`_.

.. autoclass:: pyhealth.datasets.PTBXLDataset
:members:
:undoc-members:
:show-inheritance:
2 changes: 2 additions & 0 deletions docs/api/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,8 @@ Available Tasks
ChestX-ray14 Binary Classification <tasks/pyhealth.tasks.ChestXray14BinaryClassification>
De-Identification NER <tasks/pyhealth.tasks.DeIDNERTask>
ChestX-ray14 Multilabel Classification <tasks/pyhealth.tasks.ChestXray14MultilabelClassification>
ECG Question Answering <tasks/pyhealth.tasks.ECGQAPreprocessing>
PTB-XL Signal Resampling <tasks/pyhealth.tasks.PTBXLResampling>
Variant Classification (ClinVar) <tasks/pyhealth.tasks.VariantClassificationClinVar>
Mutation Pathogenicity (COSMIC) <tasks/pyhealth.tasks.MutationPathogenicityPrediction>
Cancer Survival Prediction (TCGA) <tasks/pyhealth.tasks.CancerSurvivalPrediction>
Expand Down
7 changes: 7 additions & 0 deletions docs/api/tasks/pyhealth.tasks.ECGQAPreprocessing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
pyhealth.tasks.ECGQAPreprocessing
=======================================

.. autoclass:: pyhealth.tasks.ECGQAPreprocessing
:members:
:undoc-members:
:show-inheritance:
11 changes: 11 additions & 0 deletions docs/api/tasks/pyhealth.tasks.PTBXLResampling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
PTB-XL Signal Resampling
========================

.. currentmodule:: pyhealth.tasks.ptbxl_resampling

.. autoclass:: PTBXLResampling
:members:
:show-inheritance:
:exclude-members: __init__

.. automethod:: __call__
135 changes: 135 additions & 0 deletions examples/ecgqa_fsl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
"""ECG Question Answering with Few-Shot Learning — PyHealth example.

This script demonstrates the full data pipeline for the FSL_ECG_QA
project (Tang et al., CHIL 2025) using PyHealth datasets and tasks.

Pipeline:
1. PTBXLDataset → PTBXLResampling task → resampled ECG signals (12, 2500)
2. ECGQADataset → ECGQAPreprocessing task (with signal_loader) → multimodal QA samples

For the full meta-learning training loop, see:
https://github.com/Tang-Jia-Lu/FSL_ECG_QA/blob/main/train.py

Requirements:
- PTB-XL dataset (https://physionet.org/content/ptb-xl/1.0.3/)
- ECG-QA data (https://github.com/Tang-Jia-Lu/FSL_ECG_QA/tree/main/ecgqa)
- pip install wfdb

Authors:
Jovian Wang (jovianw2@illinois.edu)
Matthew Pham (mdpham2@illinois.edu)
Yiyun Wang (yiyunw3@illinois.edu)
"""

import json
import os
import tempfile
from pathlib import Path

import torch
from pyhealth.datasets import PTBXLDataset, ECGQADataset
from pyhealth.tasks import PTBXLResampling, ECGQAPreprocessing

# ---------- Configuration ----------
# Update these paths to match your local setup
PTBXL_ROOT = "/path/to/ptb-xl/1.0.3/" # contains records500/, records100/
ECGQA_ROOT = "/path/to/ecgqa/ptbxl/paraphrased/" # contains train/, valid/, test/

# Set to True for a quick test run (loads a small matched subset).
# Set to False to process the full dataset.
DEV_MODE = True


def _load_dev_subset(ptbxl_root, ecgqa_root):
"""Load a small matched subset for quick testing.

PTBXLDataset dev mode picks 5 random patients from the full 21K
range. To guarantee every QA sample has a matching signal, this
helper pre-filters the QA JSON files to only include records whose
ecg_id appears in the loaded PTB-XL signals, then loads the filtered
data through ECGQADataset.
"""
# Load + resample PTB-XL signals (5 in dev mode)
print(" Loading PTB-XL signals...")
ptb = PTBXLDataset(root=ptbxl_root, downsampled=False, dev=True)
signal_ds = ptb.set_task(PTBXLResampling(root=ptbxl_root))
signal_lookup = {s["record_id"]: s["signal"] for s in signal_ds}
matched_ecg_ids = set(int(k) for k in signal_lookup.keys())
print(f" PTB-XL: {len(signal_lookup)} signals loaded (ecg_ids: {matched_ecg_ids})")

# Pre-filter QA JSONs to only records with matching ecg_ids
print(" Filtering QA data to matched ecg_ids...")
tmp_dir = tempfile.mkdtemp()
src = Path(ecgqa_root)
total_kept = 0
for split in ("train", "valid", "test"):
dst = Path(tmp_dir) / split
dst.mkdir()
split_records = []
for fpath in sorted((src / split).glob("*.json")):
with open(fpath) as f:
records = json.load(f)
split_records.extend(r for r in records if r["ecg_id"][0] in matched_ecg_ids)
if not split_records:
# Write a dummy record so _verify_data passes; it gets filtered
# out by prepare_metadata (question_type won't start with "single-")
split_records = [{"ecg_id": [0], "question": "", "answer": [""],
"question_type": "dummy", "attribute_type": "",
"template_id": 0, "question_id": 0,
"sample_id": 0, "attribute": [""]}]
else:
total_kept += len(split_records)
with open(dst / "00.json", "w") as f:
json.dump(split_records, f)
print(f" Kept {total_kept} QA records for {len(matched_ecg_ids)} ecg_ids")

# Load filtered QA data with signal loader
def signal_loader(ecg_id):
return torch.FloatTensor(signal_lookup[ecg_id])

qa = ECGQADataset(root=tmp_dir)
samples = qa.set_task(ECGQAPreprocessing(signal_loader=signal_loader))
print(f" Created {len(samples)} matched QA samples")
return samples, signal_lookup


def main():
if DEV_MODE:
samples, signal_lookup = _load_dev_subset(PTBXL_ROOT, ECGQA_ROOT)
else:
# ---------- Full pipeline ----------
# Step 1: Load + resample all PTB-XL signals
print("Loading PTB-XL dataset...")
ptb = PTBXLDataset(root=PTBXL_ROOT, downsampled=False)
signal_ds = ptb.set_task(PTBXLResampling(root=PTBXL_ROOT))
signal_lookup = {s["record_id"]: s["signal"] for s in signal_ds}
print(f" Loaded {len(signal_lookup)} signal samples")

# Step 2: Build signal loader
def signal_loader(ecg_id: int) -> torch.Tensor:
return torch.FloatTensor(signal_lookup[ecg_id])

# Step 3: Load ECG-QA data with signals
print("Loading ECG-QA dataset...")
qa = ECGQADataset(root=ECGQA_ROOT)
samples = qa.set_task(ECGQAPreprocessing(signal_loader=signal_loader))
print(f" Created {len(samples)} QA samples")

# ---------- Inspect a sample ----------
if len(samples) == 0:
print("\nNo matched samples found. Check that PTBXL_ROOT and ECGQA_ROOT are correct.")
return

sample = samples[0]
print("\n=== Sample ===")
print(f" patient_id: {sample['patient_id']}")
print(f" question: {sample['question'][:80]}...")
print(f" answer: {sample['answer']}")
print(f" question_type: {sample['question_type']}")
print(f" episode_class: {sample['episode_class']}")
if "signal" in sample:
print(f" signal shape: {sample['signal'].shape}")


if __name__ == "__main__":
main()
2 changes: 2 additions & 0 deletions pyhealth/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ def __init__(self, *args, **kwargs):
from .cosmic import COSMICDataset
from .covid19_cxr import COVID19CXRDataset
from .dreamt import DREAMTDataset
from .ecgqa import ECGQADataset
from .ehrshot import EHRShotDataset
from .eicu import eICUDataset
from .isruc import ISRUCDataset
Expand All @@ -61,6 +62,7 @@ def __init__(self, *args, **kwargs):
from .mimic4 import MIMIC4CXRDataset, MIMIC4Dataset, MIMIC4EHRDataset, MIMIC4NoteDataset
from .mimicextract import MIMICExtractDataset
from .omop import OMOPDataset
from .ptbxl import PTBXLDataset
from .physionet_deid import PhysioNetDeIDDataset
from .sample_dataset import SampleBuilder, SampleDataset, create_sample_dataset
from .shhs import SHHSDataset
Expand Down
16 changes: 16 additions & 0 deletions pyhealth/datasets/configs/ecgqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
version: "3.0.0"
tables:
ecg_qa:
file_path: "ecg-qa-pyhealth.csv"
patient_id: "patient_id"
timestamp: null
attributes:
- "ecg_id"
- "question"
- "answer"
- "question_type"
- "attribute_type"
- "template_id"
- "question_id"
- "sample_id"
- "attribute"
11 changes: 11 additions & 0 deletions pyhealth/datasets/configs/ptbxl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
version: "1.0.3"
tables:
ptb-xl:
file_path: "ptbxl.csv"
patient_id: "patient_id"
timestamp: null
attributes:
- "load_from_path"
- "signal_file"
- "label_file"
- "save_to_path"
Loading