Author: Jacob Mitchell
Date: 3/11/26
This project preprocesses DNase‑seq accessibility data for machine learning models.
The script reads a BED or narrowPeak file containing accessible
chromosomal regions and automatically determines an optimal sequence
length X that balances data retention and truncation.
Using this optimal value, the pipeline generates:
- positive.txt -- DNA sequences from accessible regions
- negative.txt -- DNA sequences sampled from inaccessible genomic gaps
These files can then be used as training inputs for an AI model that predicts chromatin accessibility.
The program expects a BED‑style file (such as ENCODE narrowPeak)
containing genomic coordinates.
Example format:
chr1 10000 10120 peak1 500
chr1 10500 10720 peak2 450
chr1 11000 11100 peak3 600
Only the first three columns are used:
- Chromosome
- Start position
- End position
These define accessible genomic intervals.
Accessible regions vary in length. Machine learning models require
sequences of equal length, so the program determines an optimal length
X
Instead of simply using the median length, the program evaluates
multiple candidate values and scores them based on:
- Number of regions retained
- Amount of truncation required
- Overall information preserved
For each candidate X:
- kept = number of intervals with length ≥ X
- trim_loss = total bases removed when truncating longer intervals
- score = (kept * X) − penalty * trim_loss
The X with the highest score is selected.
This ensures:
- Most accessible regions are preserved
- Excessive truncation is avoided
- Training data size remains large
Accessible intervals are processed as follows:
- Regions shorter than X are discarded.
- Regions longer than X are truncated to
[start, start + X]. - The resulting BED file is converted to sequences using:
bedtools getfasta
Example positive.txt output:
ATGCGTACGTTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
GCTAGCTAGCTAGCTAACGTTAGCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
TTGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
Properties:
- One DNA sequence per line
- All sequences are length X
- Only characters A T G C
- All sequences converted to uppercase
Negative samples are drawn from gaps between accessible regions
Steps:
- Identify genomic gaps between merged accessible intervals.
- From each gap, sample windows of length X.
- Sampling occurs near the middle of the gap to avoid edges close to accessible regions.
- Negative samples are generated so that:
|negative| ≈ |positive|
(within ~5--10% difference, or a given tolerance value)
Example negative.txt:
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
TTTGGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
AACCGGTTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
- Python 3
- bedtools
- Reference genome FASTA (e.g., GRCh38)
Example:
GRCh38.fa
Example command:
python preprocess_dnase.py \
--bed experiment1.bed \
--genome /path/to/GRCh38.fa \
--outdir results
Parameter Description
--bed Input BED or narrowPeak file
--genome Reference genome FASTA
--outdir Output directory
--min_x Minimum allowed X length
--neg_ratio Ratio of negative to positive sequences
The program produces:
results/
├── positive.bed
├── negative.bed
├── positive.fa
├── negative.fa
├── positive.txt
└── negative.txt
File Purpose
positive.bed Accessible regions trimmed to X
negative.bed Sampled inaccessible windows
positive.fa FASTA sequences of positives
negative.fa FASTA sequences of negatives
positive.txt Training sequences (accessible)
negative.txt Training sequences (inaccessible)
After execution, the following checks should hold:
wc -l positive.txt negative.txt
Counts should be approximately equal.
Check sequence length consistency:
awk '{print length($0)}' positive.txt | sort -u
Should return a single value = X.
This preprocessing pipeline converts DNase accessibility data into
balanced, fixed‑length DNA sequence datasets suitable for machine learning.
Key features:
- Automatic optimal sequence length selection
- Balanced positive and negative datasets
- Clean DNA sequence output
- Compatible with standard genomic tools such as bedtools