This document describes the current src/M2F codebase.
M2F provides:
- logging setup
- HUMAnN/UniProt mining helpers
- regex-based annotation cleaning
- sequence/text embedding utilities
- GO/EC and generic multi-hot encoders
- dataframe persistence (
save_df/load_df) - a PyG in-memory dataset interface for graph training
Exported from src/M2F/__init__.py:
configure_logging
extract_accessions_from_humannextract_all_accessions_from_dirfetch_uniprotkb_fieldsfetch_save_uniprotkb_batches
clean_colclean_cols
AAChainEmbedderFreeTXTEmbedderMultiHotEncoderGOEncoderECEncoderencode_multihotget_GODag
embed_ft_domainsembed_AAsequencesembed_freetxt_colsencode_goencode_ecempty_tuples_to_NaNssave_dfload_df
util
fetch_uniprotkb_fields(...) performs rate-limited batched UniProt calls, retries HTTP failures with smaller batches, and returns one concatenated DataFrame.
clean_col / clean_cols extract structured values with column-specific regexes, optionally normalize text, and convert cell payloads to tuples.
AAChainEmbedder: mean-pooled ESM2 embeddings.FreeTXTEmbedder: OpenAI embeddings with optional RAM+SQLite caching.GOEncoder/ECEncoder: depth-cut + label encoding.
Current depth behavior:
- GO/EC terms that do not reach requested depth are dropped.
- Rows that become empty are represented as
NaNinencode_go/encode_ecoutputs.
Provides wrappers and higher-level transforms:
embed_AAsequencesembed_ft_domainsembed_freetxt_colsencode_go(bound GO encoder)encode_ec(bound EC encoder)
save_dfwrites heterogeneous payloads to a.zipZarr store.load_dfreconstructs the DataFrame.
File: src/M2F/pyg_data_interfaces.py
This module is not exported from M2F.__init__; import directly.
from M2F.pyg_data_interfaces import DatasetInput, ProteinGraphInMemoryDatasetValidated contract with:
- accession index CSV (
uniref,i) - edge chunk directory (
chunk_<id>.csvstyle by default) - requested UniProt fields (
uniprot_features) - supervised fields (
X,Y) - UniProt request parameters (
request_size,rps,max_retry) - flexible edge schema:
- destination column (
edge_dst_column, defaultj) - optional fixed edge attribute columns (
edge_attr_columns)
- destination column (
Current lifecycle:
download()- queries UniProt and stores
raw/features.csv - materializes index + edge chunks into
raw/
- queries UniProt and stores
process()- reads
features.csvand index file - normalizes UniProt accession alias (
Entry->accession) - aligns graph nodes to features by accession
- applies dataset-level
pre_transform(node_df) -> DataFrame - applies dataset-level
pre_filter(node_df) -> boolean mask - removes rows missing required
X/Y - reindexes nodes after filtering
- builds
x,y,edge_index,edge_attr - supports variable-dimensional
edge_attr - writes single processed graph to
processed/data.pt
- reads
Stored metadata on Data includes:
node_id_to_accessionx_fieldsy_fieldedge_attr_fields
ProteinGraphOnDiskDataset has been removed from the codebase.
from pathlib import Path
from M2F.pyg_data_interfaces import DatasetInput, ProteinGraphInMemoryDataset
inp = DatasetInput(
path_to_accession_ids_csv_file=Path("untracked/test_data_subset/uniref_index_count.csv"),
path_to_edge_csv_dir=Path("untracked/test_data_subset"),
uniprot_features=("accession", "sequence", "go_f"),
X=("sequence",),
Y="go_f",
edge_dst_column="j",
)
ds = ProteinGraphInMemoryDataset(
root=Path("untracked/prot_graph_root"),
dataset_input=inp,
pre_transform=my_dataframe_transform,
pre_filter=my_dataframe_filter,
force_reload=True,
)
data = ds[0]- package:
src/M2F - active notebooks:
model_notebooks - scratch outputs:
untracked - legacy scripts/examples:
legacy_code_examples