DNase Accessible Region Preprocessing

Author: Jacob Mitchell
Date: 3/11/26

Overview

This project preprocesses DNase‑seq accessibility data for machine learning models.
The script reads a BED or narrowPeak file containing accessible
chromosomal regions and automatically determines an optimal sequence
length X that balances data retention and truncation.

Using this optimal value, the pipeline generates:

positive.txt -- DNA sequences from accessible regions
negative.txt -- DNA sequences sampled from inaccessible genomic gaps

These files can then be used as training inputs for an AI model that predicts chromatin accessibility.

How It Works

1. Input Data

The program expects a BED‑style file (such as ENCODE narrowPeak)
containing genomic coordinates.

Example format:

chr1    10000   10120   peak1   500   
chr1    10500   10720   peak2   450   
chr1    11000   11100   peak3   600

Only the first three columns are used:

Chromosome
Start position
End position

These define accessible genomic intervals.

Optimal Sequence Length Selection

Accessible regions vary in length. Machine learning models require
sequences of equal length, so the program determines an optimal length
X

Instead of simply using the median length, the program evaluates
multiple candidate values and scores them based on:

Number of regions retained
Amount of truncation required
Overall information preserved

For each candidate X:

kept = number of intervals with length ≥ X
trim_loss = total bases removed when truncating longer intervals
score = (kept * X) − penalty * trim_loss

The X with the highest score is selected.

This ensures:

Most accessible regions are preserved
Excessive truncation is avoided
Training data size remains large

Positive Dataset Creation

Accessible intervals are processed as follows:

Regions shorter than X are discarded.
Regions longer than X are truncated to [start, start + X].
The resulting BED file is converted to sequences using:

bedtools getfasta

Example positive.txt output:

ATGCGTACGTTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC   
GCTAGCTAGCTAGCTAACGTTAGCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA   
TTGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA

Properties:

One DNA sequence per line
All sequences are length X
Only characters A T G C
All sequences converted to uppercase

Negative Dataset Creation

Negative samples are drawn from gaps between accessible regions

Steps:

Identify genomic gaps between merged accessible intervals.
From each gap, sample windows of length X.
Sampling occurs near the middle of the gap to avoid edges close to accessible regions.
Negative samples are generated so that:

|negative| ≈ |positive|

(within ~5--10% difference, or a given tolerance value)

Example negative.txt:

CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA    
TTTGGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG    
AACCGGTTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG

Requirements

Python 3
bedtools
Reference genome FASTA (e.g., GRCh38)

Example:

GRCh38.fa

Running the Script

Example command:

python preprocess_dnase.py \
--bed experiment1.bed \
--genome /path/to/GRCh38.fa \
--outdir results

Parameters:

Parameter Description
--bed Input BED or narrowPeak file
--genome Reference genome FASTA
--outdir Output directory
--min_x Minimum allowed X length
--neg_ratio Ratio of negative to positive sequences

Output Files

The program produces:
results/
├── positive.bed
├── negative.bed
├── positive.fa
├── negative.fa
├── positive.txt
└── negative.txt

Descriptions:

File Purpose
positive.bed Accessible regions trimmed to X
negative.bed Sampled inaccessible windows
positive.fa FASTA sequences of positives
negative.fa FASTA sequences of negatives
positive.txt Training sequences (accessible)
negative.txt Training sequences (inaccessible)

Data Validation

After execution, the following checks should hold:

wc -l positive.txt negative.txt

Counts should be approximately equal.

Check sequence length consistency:

awk '{print length($0)}' positive.txt | sort -u

Should return a single value = X.

Summary

This preprocessing pipeline converts DNase accessibility data into
balanced, fixed‑length DNA sequence datasets suitable for machine learning.

Key features:

Automatic optimal sequence length selection
Balanced positive and negative datasets
Clean DNA sequence output
Compatible with standard genomic tools such as bedtools

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
preprocess_dnase.py		preprocess_dnase.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNase Accessible Region Preprocessing

Overview

How It Works

1. Input Data

Optimal Sequence Length Selection

Positive Dataset Creation

Negative Dataset Creation

Requirements

Running the Script

Parameters:

Output Files

Descriptions:

Data Validation

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DNase Accessible Region Preprocessing

Overview

How It Works

1. Input Data

Optimal Sequence Length Selection

Positive Dataset Creation

Negative Dataset Creation

Requirements

Running the Script

Parameters:

Output Files

Descriptions:

Data Validation

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages