diff --git a/cookbook/sequences.md b/cookbook/01-sequences.md similarity index 100% rename from cookbook/sequences.md rename to cookbook/01-sequences.md diff --git a/cookbook/02-alignments.md b/cookbook/02-alignments.md new file mode 100644 index 0000000..f42f7b4 --- /dev/null +++ b/cookbook/02-alignments.md @@ -0,0 +1,122 @@ ++++ +title = "Pairwise alignment" +rss_descr = "Align a gene against a reference genome using BioAlignments.jl" ++++ + +# Pairwise Alignment + +On the most basic level, aligners take two sequences and use algorithms to try to "line them up" +and look for regions of similarity. + +Pairwise alignment differs from multiple sequence alignment (MSA) because +it only aligns two sequences, while MSAs align three or more. + +### Running the Alignment +There are two main parameters for determining how we want to perform our alignment: +the alignment type and score/cost model. + +The alignment type specifies the alignment range (local vs global alignment) +and the score/cost model explains how to score insertions and deletions. + +#### Alignment Types +Currently, four types of alignments are supported: +- `GlobalAlignment`: global-to-global alignment + - Aligns sequences end-to-end + - Best for sequences that are already very similar + - All of query is aligned to all of reference + - Example use case: + - Comparing a particular gene from two closely related bacteria + - Comparing alleles of a gene between two individuals + - Not ideal when only one conserved region is shared + +- `SemiGlobalAlignment`: local-to-global alignment + - A modification of global alignment that allows the user to specify that gaps are penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)). + - Example use case: + - Aligning a contig to a chromosome to see where that contig belongs + - Aligning a 150 bp Illumina read to a longer reference gene or chromosome segment + - A simple way to think about it: “the query should align completely, but the reference may have unaligned flanks.” +- `LocalAlignment`: local-to-local alignment + - Identifies high-similarity, conserved sub-regions within divergent sequences + - Can occur anywhere in the alignment matrix + - Maps the query sequence to the most similar region on the reference + - Example use case: + - Finding a conserved protein domain inside two otherwise divergent proteins + - Aligning a short resistance-gene fragment to a genome to see whether that region is present + - This is the right choice when you care about “where is the best shared region?” rather than “do these two full sequences match end-to-end?” +- `OverlapAlignment`: end-free alignment + - A modification of global alignment where gaps at the beginning or end of sequences are permitted + - Best when the biologically meaningful match is an end-to-end overlap between the two sequences, and terminal overhangs should not be penalized + - Example use case: + - Merging paired-end reads when the forward and reverse reads overlap + - Stitching amplicons or long reads that share an overlapping end region + - The key distinction from semi-global is that overlap alignment is especially for suffix/prefix-style overlaps between sequence ends, which is why it is so useful in assembly workflows. + +The alignment type should be selected based on what is already known about the sequences the user is comparing: +- Are the two sequences very similar and we're looking for a couple of small differences? +- Is the query expected to be a nearly exact match within the reference? +- Are we looking at two sequences from wildly divergent organisms? + + +### Cost Model + +The cost model provides a way to calculate penalties for differences between the two sequences, +and then finds the alignment that minimizes the total penalty. +`AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`. +It imposes an affine gap penalty for insertions and deletions, +which means that it penalizes the opening of a gap more than a gap extending. +Deletions are rare mutations, but if there's a deletion, the length of the deletion is variable. +Longer deletions are less likely than short ones only because they change the structure of the encoded protein more. + +A user can also define their own `CostModel` instead of using `AffineGapScoreModel`. +This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions. + +After the cost model is defined, a distance metric is used to quantify and minimize the "cost" (difference) between the two sequences. + +These distance metrics are currently supported: +- `EditDistance` +- `LevenshteinDistance` +- `HammingDistance` + +This is a complicated topic, and more information can be found in the BioAlignments documentation about the cost model [here](https://biojulia.dev/BioAlignments.jl/stable/pairalign/). + +Just like alignment type, the cost model should be selected based on what the user is optimizing for +and what is known about the two sequences. + + +### Calling BioAlignments to Run the Alignment + +Now that we have a good understanding of how `pairalign` works, let's run an example! + +```julia +using BioAlignments + +s1 = +s2 = +scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1); + +res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment + +res +``` + + +### Understanding how alignments are represented +The output of an alignment is a series of `AlignmentAnchor` objects. +This data structure gives information on the position of the start of the alignment, +sections where nucleotides match, as well as where there may be deletions or insertions. + +Below is an example alignment: +```julia +julia> Alignment([ + AlignmentAnchor(0, 4, 0, OP_START), + AlignmentAnchor(4, 8, 4, OP_MATCH), + AlignmentAnchor(4, 12, 8, OP_DELETE) + ]) +``` +In this example, the alignment starts at position 0 for the query sequence and position 4 for the reference sequence. +Although the Julia programming language typically uses 1-based indexing, +this package uses position 0 to refer to the first position. +The next nucleotides are a match in the query and reference sequences. +The last 8 nucleotides in the alignment are deleted in the query sequence. + +To learn more about the output of the alignment created using BioAlignments.jl, see [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/). \ No newline at end of file