TFBindFormer

TFBindFormer is a hybrid cross-attention Transformer model for transcription factor (TF)–DNA binding prediction. The model explicitly integrates transcription factor protein representations derived from amino-acid sequence and protein structural context with genomic DNA sequence bins, enabling position-specific modeling of protein–DNA interactions beyond sequence-only approaches.

TFBindFormer is designed for genome-wide TF binding prediction under severe class imbalance and demonstrates improved ranking and enrichment of bona fide binding sites compared with representative state-of-the-art models.

Model Architecture

Overview of TFBindFormer architecture.

Hybrid cross-attention module illustrating residue–nucleotide interactions.

Features

Hybrid cross-attention architecture for explicit residue–nucleotide interactions
Integration of TF amino-acid sequence and protein structure information
Genome-wide TF binding prediction under extreme class imbalance
Modular design for ablation and extension
Reproducible training and evaluation pipeline

Repository Structure

TFBindFormer/
├── data/
│   ├── dna_data/
│   │   ├── train/
│   │   ├── val/
│   │   └── test/
│   └── tf_data/
│       ├── tf_sequence/
│       ├── tf_structure/
│       ├── 3di_out/
│       ├── tf_embeddings/
│       └── metadata_tfbs.tsv
├── analysis/
│   ├── ablation/
│   ├── attention_map/
│   └── data_distribution/
│  
├── figures/
├── scripts/
│   ├── train.py
│   ├── eval.py
│   ├── extract_tf_embeddings.py
│   └── generate_3di_tokens.sh
│
├── src/
│   ├── architectures/
│   │   ├── binding_predictor.py
│   │   ├── cross_attention_encoder.py
│   │   └── tbinet_dna_encoder.py
│   │
│   ├── model.py
│   └── utils.py
│
├── LICENSE
├── README.md
└── environment.yml

data/: DNA sequence data, TF protein data, and metadata
Download from https://doi.org/10.5281/zenodo.18362288
scripts/: Training, evaluation, and preprocessing scripts
analysis/: Post-training analyses, including ablation studies, attention map visualization, and data distribution analysis
src/architectures/: Core model components and attention modules
src/model.py: TFBindFormer model wrapper
src/utils.py: Shared utilities and helper functions

Quick Start

1. Create environment and install dependencies

git clone https://github.com/BioinfoMachineLearning/TFBindFormer.git
cd TFBindFormer
conda env create -f environment.yml
conda activate tfbindformer

2. External Dependencies

TFBindFormer uses Foldseek-derived 3Di structural tokens to encode protein structural information. The 3Di tokens used in this work are included in the released dataset. Users interested in recomputing 3Di representations from raw protein structures or applying the method to additional transcription factors may install Foldseek following the official documentation: https://github.com/steineggerlab/foldseek

Ensure the foldseek executable is available in your $PATH.

3. Download dataset

All DNA sequence data and transcription factor (TF)–related data used in this project are publicly available on Zenodo:

DOI: 10.5281/zenodo.18362288

URL: https://doi.org/10.5281/zenodo.18362288

To download the dataset from the command line, run:

pwd
# .../TFBindFormer

wget -c https://zenodo.org/api/records/18362288/files/data.tar.gz/content -O data.tar.gz
tar -xzf data.tar.gz

Please download the dataset and place it under the TFBindFormer/ directory of this repository, following the directory structure described above. The provided files include the preprocessed DNA inputs, corresponding labels and metadata, and the TF-related data required to reproduce the training and evaluation experiments.

4. Generate 3Di structural tokens

The 3Di tokens used in this study are included in the released dataset(../data/tf_data/3di_out/pdb_3Di_ss.fasta). To recompute 3Di tokens from protein structure files (e.g., PDB) or to generate 3Di representations for additional transcription factors, the following helper script is provided. Internally, this script runs Foldseek to convert protein structures into sequence-like 3Di tokens.

pwd
# .../TFBindFormer

chmod +x scripts/generate_3di_tokens.sh
./scripts/generate_3di_tokens.sh <pdb_dir> <output_dir>

Arguments:

<pdb_dir>: Directory containing TF protein structure files in PDB format;

<output_dir>: Directory where generated 3Di token FASTA files will be saved

Example:

pwd
# .../TFBindFormer

./scripts/generate_3di_tokens.sh \
  data/tf_data/tf_structure \
  data/tf_data/3di_out

5. Generate TF protein embeddings

TFBindFormer represents transcription factors using embeddings derived from amino acid sequences and 3Di structural tokens. The TF protein embeddings used in this study are included in the released dataset(../data/tf_data/tf_embeddings_512).

To recompute TF protein embeddings from the provided amino acid sequences and 3Di tokens, or to generate embeddings for additional transcription factors, run:

pwd
# .../TFBindFormer

nohup python scripts/extract_tf_embeddings.py \
  --aa_dir data/tf_data/tf_sequence \
  --di_fasta data/tf_data/3di_out/pdb_3Di_ss.fasta \
  --out_dir data/tf_data/tf_embeddings \
  > extract_tf_embeddings.log 2>&1 &

This command loads transcription factor amino-acid sequences from tf_sequence, integrates the corresponding precomputed 3Di structural token sequences, and outputs TF protein embeddings to tf_embeddings.

6. Train TFBindFormer

Run training from the scripts/ directory:

pwd
# .../TFBindFormer/scripts

nohup python train.py \
  --train_dna_npy ../data/dna_data/train/train_oneHot.npy \
  --train_labels_npy ../data/dna_data/train/train_labels.npy \
  --train_metadata_tsv ../data/tf_data/metadata_tfbs.tsv \
  --val_dna_npy ../data/dna_data/val/valid_oneHot.npy \
  --val_labels_npy ../data/dna_data/val/valid_labels.npy \
  --val_metadata_tsv ../data/tf_data/metadata_tfbs.tsv \
  --embedding_dir ../data/tf_data/tf_embeddings \
  --epochs 20 \
  --batch_size 1024 \
  --num_workers 6 \
  --lr 1e-4 \
  --neg_fraction 0.015 \
  --wandb_project tfbind-train \
  --run_name tfbind_train \
  --output_dir ./checkpoints/tfbind_train \
  > tfbind_train.log 2>&1 &

Description

This command trains TFBindFormer using preprocessed genomic DNA inputs and pretrained TF protein embeddings. The model learns TF-conditioned DNA representations under extreme class imbalance.

DNA sequence inputs are loaded from NumPy arrays (*_oneHot.npy)
Binding labels are loaded from corresponding label matrices (*_labels.npy)
TF metadata is shared between training and validation splits
Precomputed TF protein embeddings are loaded from --embedding_dir
Training is performed with downsampled negatives (--neg_fraction)
Model checkpoints are written to checkpoints/tfbind_train
Logs are saved to tfbind_train.log
Training metrics are tracked with Weights & Biases (W&B)
The job is launched with nohup to allow long-running background execution

7. Evaluation

The following command evaluates a trained TFBindFormer model on the test dataset using a saved checkpoint:

pwd
# .../TFBindFormer/scripts

nohup python eval.py \
  --test_dna_npy ../data/dna_data/test/test_oneHot.npy \
  --test_labels_npy ../data/dna_data/test/test_labels.npy \
  --test_metadata_tsv ../data/tf_data/metadata_tfbs.tsv \
  --embedding_dir ../data/tf_data/tf_embeddings \
  --ckpt_path ../checkpoints/---.ckpt \
  --batch_size 1024 \
  --wandb_project tfbind_eval \
  --run_name tfbind_eval \
  > tfbind_eval.log 2>&1 &

This command loads the specified model checkpoint, runs evaluation on the test set, and reports performance metrics. When enabled, results are logged to Weights & Biases.

Analysis

Post-training analyses are organized under the analysis/ directory to evaluate model design choices, interpret learned interactions, and characterize data properties.

1. Ablation Studies (`analysis/ablation/`)

This directory contains scripts for systematic ablation of transcription factor protein representations and model components:

aaOnlyEmbedder.py
Evaluates model performance using amino-acid sequence embeddings only.
3diOnlyEmbedder.py
Evaluates model performance using structure-derived 3Di embeddings only.
protst5_embedder.py
Baseline embedding pipeline using combined amino-acid and 3Di representations from ProtST5.
ablation_figure.py
Generates summary figures comparing ablation results across embedding variants.

These experiments quantify the relative contribution of sequence and structure information to TF–DNA binding prediction performance.

2. Attention Map Analysis (`analysis/attention_map/`)

Scripts for extracting and visualizing learned cross-attention patterns:

get_att_weights.py
Extracts residue–nucleotide cross-attention weights from trained models.
1dHeatMap.py
Generates one-dimensional heatmaps illustrating attention intensity across TF residues or DNA positions.

These analyses enable interpretation of residue–nucleotide interaction preferences learned by the hybrid cross-attention module.

3. Data Distribution Analysis (`analysis/data_distribution/`)

This directory contains scripts for analyzing the distribution of positive genomic bins across transcription factors in different dataset splits:

plot_train.py
Generates violin plots with embedded boxplots showing the distribution of positive genomic bins per transcription factor in the training set.
plot_val.py
Generates analogous distribution plots for the validation set.
plot_test.py
Generates analogous distribution plots for the test set.
counts_labels.py
Computes label counts and summary statistics used in the distribution analyses.

These analyses characterize inter-TF variability and class imbalance across training, validation, and test splits.

Hybrid Cross-Attention Module Configuration (Advanced)

The stacked Cross Attention Blocks illustrated above are implemented in:

src/architectures/cross_attention_encoder.py src/architectures/binding_predictor.py

The number of Cross Attention Blocks (k), as well as their internal configuration (hidden dimension, number of heads, dropout, etc.), can be adjusted by modifying the corresponding initialization parameters and module definitions in those files.

In particular:

The depth of the hybrid cross-attention module controls how many cross-attention blocks are stacked sequentially.

Each block models residue–nucleotide interactions via cross-attention, followed by feed-forward transformations.

Advanced users may change the number of blocks or block-level hyperparameters to explore alternative model capacities or ablation variants.

Citation

If you use TFBindFormer in your work, please cite the associated manuscript:

@unpublished{TFBindFormer,
  title   = {TFBindFormer: A hybrid cross-attention Transformer for transcription factor--DNA binding prediction},
  author  = {Liu, Ping and others},
  note    = {Manuscript in preparation},
  year    = {2026}
}

Contact

For questions or issues related to the code or dataset, please open an issue in this repository. Additional inquiries may be directed to the corresponding author:

Ping Liu
Email: pl5vw@missouri.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TFBindFormer

Model Architecture

Features

Repository Structure

Quick Start

1. Create environment and install dependencies

2. External Dependencies

3. Download dataset

4. Generate 3Di structural tokens

5. Generate TF protein embeddings

6. Train TFBindFormer

7. Evaluation

Analysis

1. Ablation Studies (`analysis/ablation/`)

2. Attention Map Analysis (`analysis/attention_map/`)

3. Data Distribution Analysis (`analysis/data_distribution/`)

Hybrid Cross-Attention Module Configuration (Advanced)

Citation

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
analysis		analysis
figures		figures
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

BioinfoMachineLearning/TFBindFormer

Folders and files

Latest commit

History

Repository files navigation

TFBindFormer

Model Architecture

Features

Repository Structure

Quick Start

1. Create environment and install dependencies

2. External Dependencies

3. Download dataset

4. Generate 3Di structural tokens

5. Generate TF protein embeddings

6. Train TFBindFormer

7. Evaluation

Analysis

1. Ablation Studies (analysis/ablation/)

2. Attention Map Analysis (analysis/attention_map/)

3. Data Distribution Analysis (analysis/data_distribution/)

Hybrid Cross-Attention Module Configuration (Advanced)

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Ablation Studies (`analysis/ablation/`)

2. Attention Map Analysis (`analysis/attention_map/`)

3. Data Distribution Analysis (`analysis/data_distribution/`)

Packages