Seal: A Transfer Learning Framework for Sequence-Based Expression Modeling with Spatiotemporal Cell State Resolution
Yun Hao, Christopher Y. Park, Chandra L. Theesfeld, and Olga G. Troyanskaya
Flatiron Institute, Princeton University
Seal is an interpretable deep learning framework that enables high-resolution, context-specific prediction of transcriptional effects of genetic variants—with a particular focus on brain development and neuropsychiatric conditions. Seal addresses a core challenge in systems biology and precision medicine: how to decode the functional effects of genomic variation in highly specific cellular and developmental contexts, such as early fetal brain states, where data are sparse and cell states are transient. This is currently an unsolved problem due to complexity of these systems that require a high level of biological resolution and ability of the models to overcome extreme data limitations. Seal overcomes both barriers with a novel transfer learning–based neural architecture that integrates abundant expression data from general brain contexts with limited data from rare, developmentally specific cell types. As a result, Seal achieves unprecedented coverage for modeling these transcriptional systems—accurately modeling gene expression and variant effects across 802 brain-specific contexts, spanning 26 regions, 30 cell types, and seven developmental stages. Seal is not only accurate but mechanistically insightful. It links sequence variants to specific regulatory mechanisms, such as transcription factors and histone modifications, allowing direct interpretation of predicted biological effects through underlying biological drivers.
The Seal framework is described in the following manuscript: Link
Seal requires Python 3.6+ and Python packages PyTorch (>=1.9). Follow PyTorch installation steps here. The other dependencies can be installed by running pip install -r requirements.txt. Seal also relies on the 'closest-features' from BEDOPS for finding the closest representative TSS of each variant. Follow installation steps here.
Clone the repository then download and extract necessary resource files:
git clone https://github.com/FunctionLab/Seal.git
cd Seal
sh ./download_resources.shCommand line (example bash script):
python seal_predict.py --vcf_file <variant vcf file> --model_info_file <Seal model summary file> --out_file <Output model prediction file>Arguments:
--vcf_file: input VCF file (hg19-based coordinate; example)--model_info_file: input Seal model info file (contains pre-trained and fine-tuned model file location and hidden layer info; example)--out_file: output result file of variant effect predictions (example)
Notes:
- We provided three trained and evaluated Seal models predicting gene expression of brain development at both tissue and cell type resolution. The first model can predict variant effects on gene expression under 122 tissue states of 7 developmental stages from early fetal to adulthood (model info file). The second model can predict variant effects on gene expression under 598 cell states of early fetal stage (model info file). The third model can predict variant effects on gene expression under 82 cell states of 6 developmental stages from mid fetal to adulthood (model info file). For detailed information about the cell states, please check the annotation file.
- Our models were trained with sequence from the hg19 reference genome assembly. Users can use UCSC lift genome annotations for liftover coordinates of other assembly to hg19. Alternatively, users can also replace the input gene annotation BED file (
--gene_bed_fileargument) and input reference genome fasta file (--ref_genome_fileargument) with files of the preferred assembly.
Command line (example bash script):
python seal_train.py --general_exp_file <general context expression file> --n_latent <number of hidden neurons> --lr_pretrain <pre-training learning rate> --l2_pretrain <pre-training L2 regularization factor> --general_context_group <general context group info file> --spec_gene_weight <weight assigned to specific genes> --finetune_exp_file <specific context expression file> --lr_finetune <fine-tuning learning rate> --l2_finetune <fine-tuning L2 regularization factor> --specific_context_group <specific context group info file> --out_name <output file location>Arguments:
--general_exp_file: expression matrix .tsv file of general contexts for pre-training. First column contains gene id. Second to last columns contain normalized expression value. (example)--n_latent: number of hidden neurons for the Module 3 transfer learning neural network model of Seal framework--lr_pretrain: learning rate in pre-training for the Module 3 transfer learning neural network model of Seal framework--l2_pretrain: L2 regularization factor in pre-training for the Module 3 transfer learning neural network model of Seal framework--general_context_group: group info .tsv file of general contexts for gene-weighting of neural network loss function. First column contains the group name. Second column contains the column ID among expression matrix columns. If provided, gene weights will be assigned separately for each context group, based on the expression variation within each group. If not provided (default setting), genes weights will be assigned based on the expression variation across all contexts (example).--spec_gene_weight: the weight score assigned to genes with high expression variation (by default, score of1/spec_gene_weightwill be assigned to genes with low expression variation).--finetune_exp_file: expression matrix .tsv file of general contexts for fine-tuning. Same format as--general_exp_file--lr_finetune: learning rate in fine-tuning for the Module 3 transfer learning neural network model of Seal framework--l2_finetune: L2 regularization factor in fine-tuning for the Module 3 transfer learning neural network model of Seal framework--specific_context_group: group info .tsv file of specific contexts for gene-weighting of neural network loss function. Same format as--general_context_group--out_name: path for output. All output files will be named and stored based on the specific path.
Command line (example bash script):
python seal_train.py --general_exp_file <general context expression file> --pretrained <True> --pretrained_name <pre-trained file location> --n_latent <number of hidden neurons> --spec_gene_weight <weight assigned to specific genes> --finetune_exp_file <specific context expression file> --lr_finetune <fine-tuning learning rate> --l2_finetune <fine-tuning L2 regularization factor> --specific_context_group <specific context group info file> --out_name <output file location>Additional arguments:
--pretrained: bool specifying whether the pretrained model exists (True in this case)--pretrained_name: path where pre-trained files are stored. Same format as--out_name. Pre-trained files will be loaded based on the specified path and our naming scheme.
Command line (example bash script):
python seal_interpret.py --vcf_file <variant vcf file> --model_info_file <Seal model summary file> --interpret_method <interpretation method> --outcome_id <outcome index> --out_file <output feature attribution file>Arguments:
--vcf_file: input VCF file (hg19-based coordinate; example)--model_info_file: input Seal model info file (contains pre-trained and fine-tuned model file location and hidden layer info; example)-interpret_method: interpretation method to be implemented; (Method options: 'saliency', 'integratedGradients', 'deeplift', 'kernalShap', 'gradientShap', 'lime')--outcome_id: column index of outcome to be interpreted--out_file: output feature attribution file (example)
Please post in the Github issues or e-mail Yun Hao yhao@flatironinstitute.org with any questions about the repository, requests for more data, etc.