Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
122 changes: 122 additions & 0 deletions cookbook/02-alignments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
+++
title = "Pairwise alignment"
rss_descr = "Align a gene against a reference genome using BioAlignments.jl"
+++

# Pairwise Alignment

On the most basic level, aligners take two sequences and use algorithms to try to "line them up"
and look for regions of similarity.

Pairwise alignment differs from multiple sequence alignment (MSA) because
it only aligns two sequences, while MSAs align three or more.

### Running the Alignment
There are two main parameters for determining how we want to perform our alignment:
the alignment type and score/cost model.

The alignment type specifies the alignment range (local vs global alignment)
and the score/cost model explains how to score insertions and deletions.

#### Alignment Types
Currently, four types of alignments are supported:
- `GlobalAlignment`: global-to-global alignment
- Aligns sequences end-to-end
- Best for sequences that are already very similar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to give examples here - an example here might be you have a particular gene from two closely related bacteria, or you're comparing alleles of a gene between two individuals.

- All of query is aligned to all of reference
- Example use case:
- Comparing a particular gene from two closely related bacteria
- Comparing alleles of a gene between two individuals
- Not ideal when only one conserved region is shared

- `SemiGlobalAlignment`: local-to-global alignment
- A modification of global alignment that allows the user to specify that gaps are penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)).
- Example use case:
- Aligning a contig to a chromosome to see where that contig belongs
- Aligning a 150 bp Illumina read to a longer reference gene or chromosome segment
- A simple way to think about it: “the query should align completely, but the reference may have unaligned flanks.”
- `LocalAlignment`: local-to-local alignment
- Identifies high-similarity, conserved sub-regions within divergent sequences
- Can occur anywhere in the alignment matrix
- Maps the query sequence to the most similar region on the reference
- Example use case:
- Finding a conserved protein domain inside two otherwise divergent proteins
- Aligning a short resistance-gene fragment to a genome to see whether that region is present
- This is the right choice when you care about “where is the best shared region?” rather than “do these two full sequences match end-to-end?”
- `OverlapAlignment`: end-free alignment
- A modification of global alignment where gaps at the beginning or end of sequences are permitted
- Best when the biologically meaningful match is an end-to-end overlap between the two sequences, and terminal overhangs should not be penalized
- Example use case:
- Merging paired-end reads when the forward and reverse reads overlap
- Stitching amplicons or long reads that share an overlapping end region
- The key distinction from semi-global is that overlap alignment is especially for suffix/prefix-style overlaps between sequence ends, which is why it is so useful in assembly workflows.

The alignment type should be selected based on what is already known about the sequences the user is comparing:
- Are the two sequences very similar and we're looking for a couple of small differences?
- Is the query expected to be a nearly exact match within the reference?
- Are we looking at two sequences from wildly divergent organisms?


### Cost Model

The cost model provides a way to calculate penalties for differences between the two sequences,
and then finds the alignment that minimizes the total penalty.
`AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`.
It imposes an affine gap penalty for insertions and deletions,
which means that it penalizes the opening of a gap more than a gap extending.
Deletions are rare mutations, but if there's a deletion, the length of the deletion is variable.
Longer deletions are less likely than short ones only because they change the structure of the encoded protein more.

A user can also define their own `CostModel` instead of using `AffineGapScoreModel`.
This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions.

After the cost model is defined, a distance metric is used to quantify and minimize the "cost" (difference) between the two sequences.

These distance metrics are currently supported:
- `EditDistance`
- `LevenshteinDistance`
- `HammingDistance`

This is a complicated topic, and more information can be found in the BioAlignments documentation about the cost model [here](https://biojulia.dev/BioAlignments.jl/stable/pairalign/).

Just like alignment type, the cost model should be selected based on what the user is optimizing for
and what is known about the two sequences.


### Calling BioAlignments to Run the Alignment

Now that we have a good understanding of how `pairalign` works, let's run an example!

```julia
using BioAlignments

s1 =
s2 =
scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);

res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment

res
```


### Understanding how alignments are represented
The output of an alignment is a series of `AlignmentAnchor` objects.
This data structure gives information on the position of the start of the alignment,
sections where nucleotides match, as well as where there may be deletions or insertions.

Below is an example alignment:
```julia
julia> Alignment([
AlignmentAnchor(0, 4, 0, OP_START),
AlignmentAnchor(4, 8, 4, OP_MATCH),
AlignmentAnchor(4, 12, 8, OP_DELETE)
])
```
In this example, the alignment starts at position 0 for the query sequence and position 4 for the reference sequence.
Although the Julia programming language typically uses 1-based indexing,
this package uses position 0 to refer to the first position.
The next nucleotides are a match in the query and reference sequences.
The last 8 nucleotides in the alignment are deleted in the query sequence.

To learn more about the output of the alignment created using BioAlignments.jl, see [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/).
Loading