TFBindFormer is a hybrid cross-attention Transformer model for transcription factor (TF)–DNA binding prediction. The model explicitly integrates transcription factor protein representations derived from amino-acid sequence and protein structural context with genomic DNA sequence bins, enabling position-specific modeling of protein–DNA interactions beyond sequence-only approaches.
TFBindFormer is designed for genome-wide TF binding prediction under severe class imbalance and demonstrates improved ranking and enrichment of bona fide binding sites compared with representative state-of-the-art models.
Overview of TFBindFormer architecture.
Hybrid cross-attention module illustrating residue–nucleotide interactions.
- Hybrid cross-attention architecture for explicit residue–nucleotide interactions
- Integration of TF amino-acid sequence and protein structure information
- Genome-wide TF binding prediction under extreme class imbalance
- Modular design for ablation and extension
- Reproducible training and evaluation pipeline
TFBindFormer/
├── data/
│ ├── dna_data/
│ │ ├── train/
│ │ ├── val/
│ │ └── test/
│ └── tf_data/
│ ├── tf_sequence/
│ ├── tf_structure/
│ ├── 3di_out/
│ ├── tf_embeddings/
│ └── metadata_tfbs.tsv
├── analysis/
│ ├── ablation/
│ ├── attention_map/
│ └── data_distribution/
│
├── figures/
├── scripts/
│ ├── train.py
│ ├── eval.py
│ ├── extract_tf_embeddings.py
│ └── generate_3di_tokens.sh
│
├── src/
│ ├── architectures/
│ │ ├── binding_predictor.py
│ │ ├── cross_attention_encoder.py
│ │ └── tbinet_dna_encoder.py
│ │
│ ├── model.py
│ └── utils.py
│
├── LICENSE
├── README.md
└── environment.yml
-
data/: DNA sequence data, TF protein data, and metadata
Download from https://doi.org/10.5281/zenodo.18362288 -
scripts/: Training, evaluation, and preprocessing scripts
-
analysis/: Post-training analyses, including ablation studies, attention map visualization, and data distribution analysis
-
src/architectures/: Core model components and attention modules
-
src/model.py: TFBindFormer model wrapper
-
src/utils.py: Shared utilities and helper functions
git clone https://github.com/BioinfoMachineLearning/TFBindFormer.git
cd TFBindFormer
conda env create -f environment.yml
conda activate tfbindformerTFBindFormer uses Foldseek-derived 3Di structural tokens to encode protein structural information. The 3Di tokens used in this work are included in the released dataset. Users interested in recomputing 3Di representations from raw protein structures or applying the method to additional transcription factors may install Foldseek following the official documentation: https://github.com/steineggerlab/foldseek
Ensure the foldseek executable is available in your $PATH.
All DNA sequence data and transcription factor (TF)–related data used in this project are publicly available on Zenodo:
DOI: 10.5281/zenodo.18362288
URL: https://doi.org/10.5281/zenodo.18362288
To download the dataset from the command line, run:
pwd
# .../TFBindFormer
wget -c https://zenodo.org/api/records/18362288/files/data.tar.gz/content -O data.tar.gz
tar -xzf data.tar.gzPlease download the dataset and place it under the TFBindFormer/ directory of this repository, following the directory structure described above. The provided files include the preprocessed DNA inputs, corresponding labels and metadata, and the TF-related data required to reproduce the training and evaluation experiments.
The 3Di tokens used in this study are included in the released dataset(../data/tf_data/3di_out/pdb_3Di_ss.fasta). To recompute 3Di tokens from protein structure files (e.g., PDB) or to generate 3Di representations for additional transcription factors, the following helper script is provided. Internally, this script runs Foldseek to convert protein structures into sequence-like 3Di tokens.
pwd
# .../TFBindFormer
chmod +x scripts/generate_3di_tokens.sh
./scripts/generate_3di_tokens.sh <pdb_dir> <output_dir>Arguments:
<pdb_dir>: Directory containing TF protein structure files in PDB format;
<output_dir>: Directory where generated 3Di token FASTA files will be saved
Example:
pwd
# .../TFBindFormer
./scripts/generate_3di_tokens.sh \
data/tf_data/tf_structure \
data/tf_data/3di_out
TFBindFormer represents transcription factors using embeddings derived from amino acid sequences and 3Di structural tokens. The TF protein embeddings used in this study are included in the released dataset(../data/tf_data/tf_embeddings_512).
To recompute TF protein embeddings from the provided amino acid sequences and 3Di tokens, or to generate embeddings for additional transcription factors, run:
pwd
# .../TFBindFormer
nohup python scripts/extract_tf_embeddings.py \
--aa_dir data/tf_data/tf_sequence \
--di_fasta data/tf_data/3di_out/pdb_3Di_ss.fasta \
--out_dir data/tf_data/tf_embeddings \
> extract_tf_embeddings.log 2>&1 &This command loads transcription factor amino-acid sequences from tf_sequence, integrates the corresponding precomputed 3Di structural token sequences, and outputs TF protein embeddings to tf_embeddings.
Run training from the scripts/ directory:
pwd
# .../TFBindFormer/scripts
nohup python train.py \
--train_dna_npy ../data/dna_data/train/train_oneHot.npy \
--train_labels_npy ../data/dna_data/train/train_labels.npy \
--train_metadata_tsv ../data/tf_data/metadata_tfbs.tsv \
--val_dna_npy ../data/dna_data/val/valid_oneHot.npy \
--val_labels_npy ../data/dna_data/val/valid_labels.npy \
--val_metadata_tsv ../data/tf_data/metadata_tfbs.tsv \
--embedding_dir ../data/tf_data/tf_embeddings \
--epochs 20 \
--batch_size 1024 \
--num_workers 6 \
--lr 1e-4 \
--neg_fraction 0.015 \
--wandb_project tfbind-train \
--run_name tfbind_train \
--output_dir ./checkpoints/tfbind_train \
> tfbind_train.log 2>&1 &
Description
This command trains TFBindFormer using preprocessed genomic DNA inputs and pretrained TF protein embeddings. The model learns TF-conditioned DNA representations under extreme class imbalance.
- DNA sequence inputs are loaded from NumPy arrays (
*_oneHot.npy) - Binding labels are loaded from corresponding label matrices (
*_labels.npy) - TF metadata is shared between training and validation splits
- Precomputed TF protein embeddings are loaded from
--embedding_dir - Training is performed with downsampled negatives (
--neg_fraction) - Model checkpoints are written to
checkpoints/tfbind_train - Logs are saved to
tfbind_train.log - Training metrics are tracked with Weights & Biases (W&B)
- The job is launched with
nohupto allow long-running background execution
The following command evaluates a trained TFBindFormer model on the test dataset using a saved checkpoint:
pwd
# .../TFBindFormer/scripts
nohup python eval.py \
--test_dna_npy ../data/dna_data/test/test_oneHot.npy \
--test_labels_npy ../data/dna_data/test/test_labels.npy \
--test_metadata_tsv ../data/tf_data/metadata_tfbs.tsv \
--embedding_dir ../data/tf_data/tf_embeddings \
--ckpt_path ../checkpoints/---.ckpt \
--batch_size 1024 \
--wandb_project tfbind_eval \
--run_name tfbind_eval \
> tfbind_eval.log 2>&1 &This command loads the specified model checkpoint, runs evaluation on the test set, and reports performance metrics. When enabled, results are logged to Weights & Biases.
Post-training analyses are organized under the analysis/ directory to
evaluate model design choices, interpret learned interactions, and
characterize data properties.
This directory contains scripts for systematic ablation of transcription factor protein representations and model components:
-
aaOnlyEmbedder.py
Evaluates model performance using amino-acid sequence embeddings only. -
3diOnlyEmbedder.py
Evaluates model performance using structure-derived 3Di embeddings only. -
protst5_embedder.py
Baseline embedding pipeline using combined amino-acid and 3Di representations from ProtST5. -
ablation_figure.py
Generates summary figures comparing ablation results across embedding variants.
These experiments quantify the relative contribution of sequence and structure information to TF–DNA binding prediction performance.
Scripts for extracting and visualizing learned cross-attention patterns:
-
get_att_weights.py
Extracts residue–nucleotide cross-attention weights from trained models. -
1dHeatMap.py
Generates one-dimensional heatmaps illustrating attention intensity across TF residues or DNA positions.
These analyses enable interpretation of residue–nucleotide interaction preferences learned by the hybrid cross-attention module.
This directory contains scripts for analyzing the distribution of positive genomic bins across transcription factors in different dataset splits:
-
plot_train.py
Generates violin plots with embedded boxplots showing the distribution of positive genomic bins per transcription factor in the training set. -
plot_val.py
Generates analogous distribution plots for the validation set. -
plot_test.py
Generates analogous distribution plots for the test set. -
counts_labels.py
Computes label counts and summary statistics used in the distribution analyses.
These analyses characterize inter-TF variability and class imbalance across training, validation, and test splits.
The stacked Cross Attention Blocks illustrated above are implemented in:
src/architectures/cross_attention_encoder.py src/architectures/binding_predictor.py
The number of Cross Attention Blocks (k), as well as their internal configuration (hidden dimension, number of heads, dropout, etc.), can be adjusted by modifying the corresponding initialization parameters and module definitions in those files.
In particular:
The depth of the hybrid cross-attention module controls how many cross-attention blocks are stacked sequentially.
Each block models residue–nucleotide interactions via cross-attention, followed by feed-forward transformations.
Advanced users may change the number of blocks or block-level hyperparameters to explore alternative model capacities or ablation variants.
If you use TFBindFormer in your work, please cite the associated manuscript:
@unpublished{TFBindFormer,
title = {TFBindFormer: A hybrid cross-attention Transformer for transcription factor--DNA binding prediction},
author = {Liu, Ping and others},
note = {Manuscript in preparation},
year = {2026}
}For questions or issues related to the code or dataset, please open an issue in this repository. Additional inquiries may be directed to the corresponding author:
Ping Liu
Email: pl5vw@missouri.edu

