Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
49a2920
feat(datasets): add MIMIC-IV FHIR NDJSON ingest and CEHR sequences
evanfebrianto Mar 21, 2026
b7476a6
feat(tasks): add MPF clinical prediction task for FHIR timelines
evanfebrianto Mar 21, 2026
c1b2f80
feat(models): add CEHR embeddings and EHRMambaCEHR
evanfebrianto Mar 21, 2026
bfa6ec3
test: add synthetic FHIR, MPF, and EHRMambaCEHR coverage
evanfebrianto Mar 21, 2026
cacfe35
feat(examples): add MIMIC-IV FHIR MPF training script
evanfebrianto Mar 22, 2026
13a5a5d
docs: add EHRMambaCEHR API page and models toctree entry
evanfebrianto Mar 22, 2026
9a9ed59
chore: refresh pixi lock editable pyhealth package hash
evanfebrianto Mar 22, 2026
8d8e9e2
feat(datasets): make MIMIC4FHIRDataset extend BaseDataset
evanfebrianto Mar 22, 2026
aa31fb5
docs(tasks): expand MPF clinical prediction task docstrings
evanfebrianto Mar 23, 2026
05d5958
test: temp NDJSON root, global_event_df guard, MPF max_len edge
evanfebrianto Mar 23, 2026
3070cb1
feat(examples): synthetic ablation grid and setup docstring
evanfebrianto Mar 23, 2026
5c37d1d
docs: add MIMIC4FHIR and MPF task API pages
evanfebrianto Mar 23, 2026
f1223c4
fix(datasets): ensure exact encounter ID matching in CEHR sequences
evanfebrianto Mar 23, 2026
1419d4d
feat(datasets): add handling for unlinked events in CEHR sequences
evanfebrianto Mar 23, 2026
41e246d
fix(datasets): improve handling of max_len in CEHR sequences and upda…
evanfebrianto Mar 23, 2026
9b16810
refactor(datasets): rename and enhance visit index handling for unlin…
evanfebrianto Mar 23, 2026
4f4db9c
fix(datasets): enhance MIMIC4FHIRDataset to support gzip NDJSON files…
evanfebrianto Mar 23, 2026
9d1bf3f
docs(datasets): update MIMIC4FHIRDataset documentation and enhance ex…
evanfebrianto Mar 23, 2026
507d329
feat(datasets): add clinical concept key resolution for MedicationReq…
evanfebrianto Mar 23, 2026
4ff1cc1
feat(examples): update learning rate handling in training functions
evanfebrianto Mar 23, 2026
108687a
fix(datasets): improve handling of empty token_to_id in ConceptVocab …
evanfebrianto Mar 23, 2026
045e536
Refactor/ndjson-ingestion (#1)
evanfebrianto Mar 28, 2026
49a2363
Refactor FHIR ingest to flattened YAML-driven tables (#2)
evanfebrianto Apr 12, 2026
b1d169c
Merge branch 'sunlabuiuc:master' into feat/ehrmamba
evanfebrianto Apr 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/api/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ Available Datasets
datasets/pyhealth.datasets.SampleDataset
datasets/pyhealth.datasets.MIMIC3Dataset
datasets/pyhealth.datasets.MIMIC4Dataset
datasets/pyhealth.datasets.MIMIC4FHIRDataset
datasets/pyhealth.datasets.MedicalTranscriptionsDataset
datasets/pyhealth.datasets.CardiologyDataset
datasets/pyhealth.datasets.eICUDataset
Expand Down
70 changes: 70 additions & 0 deletions docs/api/datasets/pyhealth.datasets.MIMIC4FHIRDataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
pyhealth.datasets.MIMIC4FHIRDataset
=====================================

`MIMIC-IV on FHIR <https://physionet.org/content/mimic-iv-fhir/>`_ NDJSON ingest
for CEHR-style token sequences used with
:class:`~pyhealth.tasks.mpf_clinical_prediction.MPFClinicalPredictionTask` and
:class:`~pyhealth.models.EHRMambaCEHR`.

YAML defaults live in ``pyhealth/datasets/configs/mimic4_fhir.yaml``. Unlike the
earlier nested-object approach, the YAML now declares a normal ``tables:``
schema for flattened FHIR resources (``patient``, ``encounter``, ``condition``,
``observation``, ``medication_request``, ``procedure``). The class subclasses
:class:`~pyhealth.datasets.BaseDataset` and builds a standard Polars
``global_event_df`` backed by cached Parquet (``global_event_df.parquet/part-*.parquet``),
same tabular path as other datasets: :meth:`~pyhealth.datasets.BaseDataset.set_task`,
:meth:`iter_patients`, :meth:`get_patient`, etc.

**Ingest (out-of-core).** Matching ``*.ndjson`` / ``*.ndjson.gz`` files are read
**line by line**; each resource is normalized into a flattened per-resource
Parquet table under ``cache/flattened_tables/``. Those tables are then fed
through the regular YAML-driven :class:`~pyhealth.datasets.BaseDataset` loader to
materialize ``global_event_df``. This keeps FHIR aligned with PyHealth's usual
table-first pipeline instead of reparsing nested JSON per patient downstream.

**``max_patients``.** When set, the loader selects the first *N* patient ids after
a **sorted** ``unique`` over the flattened patient table, filters every
normalized table to that cohort, and then builds ``global_event_df`` from the
filtered tables. Ingest still scans all matching NDJSON once unless you also
override ``glob_patterns`` / ``glob_pattern`` (defaults skip non-flattened PhysioNet shards).

**Downstream memory (still important).** Streaming ingest avoids loading the
entire NDJSON corpus into RAM at once, but other steps can still be heavy on
large cohorts: ``global_event_df`` materialization, MPF vocabulary warmup, and
:meth:`set_task` still walk patients and samples; training needs RAM/VRAM for the
model and batches. For a **full** PhysioNet tree, plan for **large disk**
(flattened tables plus event cache), **comfortable system RAM** for Polars/PyArrow
and task pipelines, and restrict ``glob_patterns`` / ``glob_pattern`` or ``max_patients`` when
prototyping on a laptop.

**Recommended hardware (informal)**

Order-of-magnitude guides, not guarantees. Ingest footprint is **much smaller**
than “load everything into Python”; wall time still grows with **decompressed
NDJSON volume** and the amount of flattened table data produced.

* **Smoke / CI**
Small on-disk fixtures (see tests and ``examples/mimic4fhir_mpf_ehrmamba.py``):
a recent laptop is sufficient.

* **Laptop-scale real FHIR subset**
A **narrow** ``glob_patterns`` / ``glob_pattern`` and/or ``max_patients`` in the hundreds keeps
cache and task passes manageable. **≥ 16 GB** system RAM is a practical
comfort target for Polars + trainer + OS; validate GPU **VRAM** for your
``max_len`` and batch size.

* **Full default globs on a complete export**
Favor **workstations or servers** with **fast SSD**, **large disk**, and
**ample RAM** for downstream steps—not because NDJSON is fully buffered in
memory during ingest, but because total work and caches still scale with the
full dataset.

.. autoclass:: pyhealth.datasets.MIMIC4FHIRDataset
:members:
:undoc-members:
:show-inheritance:

.. autoclass:: pyhealth.datasets.ConceptVocab
:members:
:undoc-members:
:show-inheritance:
1 change: 1 addition & 0 deletions docs/api/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ API Reference
models/pyhealth.models.MoleRec
models/pyhealth.models.Deepr
models/pyhealth.models.EHRMamba
models/pyhealth.models.EHRMambaCEHR
models/pyhealth.models.JambaEHR
models/pyhealth.models.ContraWR
models/pyhealth.models.SparcNet
Expand Down
12 changes: 12 additions & 0 deletions docs/api/models/pyhealth.models.EHRMambaCEHR.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
pyhealth.models.EHRMambaCEHR
===================================

EHRMambaCEHR applies CEHR-style embeddings (:class:`~pyhealth.models.cehr_embeddings.MambaEmbeddingsForCEHR`)
and a stack of :class:`~pyhealth.models.MambaBlock` layers to a single FHIR token stream, for use with
:class:`~pyhealth.tasks.mpf_clinical_prediction.MPFClinicalPredictionTask` and
:class:`~pyhealth.datasets.mimic4_fhir.MIMIC4FHIRDataset`.

.. autoclass:: pyhealth.models.EHRMambaCEHR
:members:
:undoc-members:
:show-inheritance:
1 change: 1 addition & 0 deletions docs/api/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,7 @@ Available Tasks
Drug Recommendation <tasks/pyhealth.tasks.drug_recommendation>
Length of Stay Prediction <tasks/pyhealth.tasks.length_of_stay_prediction>
Medical Transcriptions Classification <tasks/pyhealth.tasks.MedicalTranscriptionsClassification>
MPF Clinical Prediction (FHIR) <tasks/pyhealth.tasks.mpf_clinical_prediction>
Mortality Prediction (Next Visit) <tasks/pyhealth.tasks.mortality_prediction>
Mortality Prediction (StageNet MIMIC-IV) <tasks/pyhealth.tasks.mortality_prediction_stagenet_mimic4>
Patient Linkage (MIMIC-III) <tasks/pyhealth.tasks.patient_linkage_mimic3_fn>
Expand Down
12 changes: 12 additions & 0 deletions docs/api/tasks/pyhealth.tasks.mpf_clinical_prediction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
pyhealth.tasks.mpf_clinical_prediction
======================================

Multitask Prompted Fine-tuning (MPF) style binary clinical prediction on FHIR
token timelines, paired with :class:`~pyhealth.datasets.MIMIC4FHIRDataset` and
:class:`~pyhealth.models.EHRMambaCEHR`. Based on CEHR / EHRMamba ideas; see the
paper linked in the course replication PR.

.. autoclass:: pyhealth.tasks.MPFClinicalPredictionTask
:members:
:undoc-members:
:show-inheritance:
Loading