From c017a7da3b17dd52c9b19ab840e135fb9f6ef581 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Sun, 8 Mar 2026 21:37:47 -0400 Subject: [PATCH 1/3] add rough draft alignments --- cookbook/02-alignments.md | 75 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 cookbook/02-alignments.md diff --git a/cookbook/02-alignments.md b/cookbook/02-alignments.md new file mode 100644 index 0000000..2241b55 --- /dev/null +++ b/cookbook/02-alignments.md @@ -0,0 +1,75 @@ ++++ +title = "Pairwise alignment" +rss_descr = "Align a gene against a reference genome using BioAlignments.jl" ++++ + +# Pairwise Alignment + +On the most basic level, aligners take two sequences and use algorithms to try and "line them up" +and look for regions of similarity. + +Pairwise alignment differs from multiple sequence alignment (MSA) because. +it only aligns two sequences, while MSA's align three or more. +In a pairwise alignment, there is one reference sequence, and one query sequence, +though this may not always be specified by the user. + + +### Running the Alignment +There are two main parameters for determining how we wish to perform our alignment: +the alignment type and score/cost model. + +The alignment type specifies the alignment range (is the alignment local or global?) +and the score/cost model explains how to score insertions and deletions. + +#### Alignment Types +Currently, four types of alignments are supported: +- GlobalAlignment: global-to-global alignment + - Aligns sequences end-to-end + - Best for sequences that are already very similar +- SemiGlobalAlignment: local-to-global alignment + - a modification of global alignment that allows the user to specify that gaps will be penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)). +- LocalAlignment: local-to-local alignment + - Identifies high-similarity, conserved sub-regions within divergent sequences + - Can occur anywhere in the alignment matrix +- OverlapAlignment: end-free alignment + - a modification of global alignment where gaps at the beginning or end of sequences are permitted + +Alignment type can also be a distance of two sequences: +- EditDistance +- LevenshteinDistance +- HammingDistance + +The alignment type should be selected based on what is already known about the sequences the user is comparing +(Are they very similar and we're looking for a couple of small differences? +Are we expecting the query to be a nearly exact match within the reference?). +and what you may be optimizing for +(Speed for a quick and dirty analysis? +Or do we want to use more resources to do a fine-grained comparison?). + +Now that we have a good understanding of how `pairalign` works, + +```julia +res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment + +``` + + +### Understanding how alignments are represented +The output of an alignment is a series of `AlignmentAnchor` objects. +This data structure gives information on the position of the start of the alignment, +sections where nucleotides match, as well as where there may be deletions or insertions. + +Below is an example Alignment: +```julia +julia> Alignment([ + AlignmentAnchor(0, 4, 0, OP_START), + AlignmentAnchor(4, 8, 4, OP_MATCH), + AlignmentAnchor(4, 12, 8, OP_DELETE) + ]) +``` +In this example, the alignment starts at the 0 position for the query sequence and 4th position for the reference sequence. +The next nucleotides are a match in the query and reference sequence. +The last 8 nucleotides in the alignment are missing/deleted in the query sequence. + +To understand more about the output of the alignment created using BioAlignments.jl, +more information can be found [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/). \ No newline at end of file From 33dfbd41f389871e1866503be5304a938f3d4a51 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Mon, 9 Mar 2026 21:29:37 -0400 Subject: [PATCH 2/3] finish draft of BioAlignments package explanation --- cookbook/02-alignments.md | 90 ++++++++++++++++++++++++++------------- 1 file changed, 61 insertions(+), 29 deletions(-) diff --git a/cookbook/02-alignments.md b/cookbook/02-alignments.md index 2241b55..5d4863f 100644 --- a/cookbook/02-alignments.md +++ b/cookbook/02-alignments.md @@ -5,52 +5,83 @@ rss_descr = "Align a gene against a reference genome using BioAlignments.jl" # Pairwise Alignment -On the most basic level, aligners take two sequences and use algorithms to try and "line them up" +On the most basic level, aligners take two sequences and use algorithms to try to "line them up" and look for regions of similarity. -Pairwise alignment differs from multiple sequence alignment (MSA) because. -it only aligns two sequences, while MSA's align three or more. -In a pairwise alignment, there is one reference sequence, and one query sequence, +Pairwise alignment differs from multiple sequence alignment (MSA) because +it only aligns two sequences, while MSAs align three or more. +In a pairwise alignment, there is one reference sequence and one query sequence, though this may not always be specified by the user. ### Running the Alignment -There are two main parameters for determining how we wish to perform our alignment: +There are two main parameters for determining how we want to perform our alignment: the alignment type and score/cost model. -The alignment type specifies the alignment range (is the alignment local or global?) +The alignment type specifies the alignment range (local vs global alignment) and the score/cost model explains how to score insertions and deletions. #### Alignment Types Currently, four types of alignments are supported: -- GlobalAlignment: global-to-global alignment +- `GlobalAlignment`: global-to-global alignment - Aligns sequences end-to-end - Best for sequences that are already very similar -- SemiGlobalAlignment: local-to-global alignment - - a modification of global alignment that allows the user to specify that gaps will be penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)). -- LocalAlignment: local-to-local alignment + - All of query is aligned to all of reference +- `SemiGlobalAlignment`: local-to-global alignment + - A modification of global alignment that allows the user to specify that gaps are penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)). +- `LocalAlignment`: local-to-local alignment - Identifies high-similarity, conserved sub-regions within divergent sequences - Can occur anywhere in the alignment matrix -- OverlapAlignment: end-free alignment - - a modification of global alignment where gaps at the beginning or end of sequences are permitted + - Maps the query sequence to the most similar region on the reference +- `OverlapAlignment`: end-free alignment + - A modification of global alignment where gaps at the beginning or end of sequences are permitted -Alignment type can also be a distance of two sequences: -- EditDistance -- LevenshteinDistance -- HammingDistance +The alignment type should be selected based on what is already known about the sequences the user is comparing: +- Are the two sequences very similar and we're looking for a couple of small differences? +- Is the query expected to be a nearly exact match within the reference? +- Are we looking at two sequences from wildly divergent organisms? -The alignment type should be selected based on what is already known about the sequences the user is comparing -(Are they very similar and we're looking for a couple of small differences? -Are we expecting the query to be a nearly exact match within the reference?). -and what you may be optimizing for -(Speed for a quick and dirty analysis? -Or do we want to use more resources to do a fine-grained comparison?). -Now that we have a good understanding of how `pairalign` works, +### Cost Model + +The cost model provides a way to calculate penalties for differences between the two sequences, +and then finds the alignment that minimizes the total penalty. +`AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`. +It imposes an affine gap penalty for insertions and deletions, +which means that it penalizes the opening of a gap more than a gap extending. +This aligns (pun intended!!) with the biological principle that creating a gap is a rare event, +while extending an already existing gap is less so. + +A user can also define their own `CostModel` instead of using `AffineGapScoreModel`. +This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions. + +After the cost model is defined, a distance metric is used to quantify and minimize the "cost" (difference) between the two sequences. + +These distance metrics are currently supported: +- `EditDistance` +- `LevenshteinDistance` +- `HammingDistance` + +This is a complicated topic, and more information can be found in the BioAlignments documentation about the cost model [here](https://biojulia.dev/BioAlignments.jl/stable/pairalign/). + +Just like alignment type, the cost model should be selected based on what the user is optimizing for +and what is known about the two sequences. + + +### Calling BioAlignments to Run the Alignment + +Now that we have a good understanding of how `pairalign` works, let's run an example! ```julia +using BioAlignments + +s1 = +s2 = +scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1); + res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment +res ``` @@ -59,7 +90,7 @@ The output of an alignment is a series of `AlignmentAnchor` objects. This data structure gives information on the position of the start of the alignment, sections where nucleotides match, as well as where there may be deletions or insertions. -Below is an example Alignment: +Below is an example alignment: ```julia julia> Alignment([ AlignmentAnchor(0, 4, 0, OP_START), @@ -67,9 +98,10 @@ julia> Alignment([ AlignmentAnchor(4, 12, 8, OP_DELETE) ]) ``` -In this example, the alignment starts at the 0 position for the query sequence and 4th position for the reference sequence. -The next nucleotides are a match in the query and reference sequence. -The last 8 nucleotides in the alignment are missing/deleted in the query sequence. +In this example, the alignment starts at position 0 for the query sequence and position 4 for the reference sequence. +Although the Julia programming language typically uses 1-based indexing, +this package uses position 0 to refer to the first position. +The next nucleotides are a match in the query and reference sequences. +The last 8 nucleotides in the alignment are deleted in the query sequence. -To understand more about the output of the alignment created using BioAlignments.jl, -more information can be found [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/). \ No newline at end of file +To learn more about the output of the alignment created using BioAlignments.jl, see [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/). \ No newline at end of file From d3892a3c60147ebaef21408c464ec52d78bda6db Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Wed, 11 Mar 2026 22:16:18 -0400 Subject: [PATCH 3/3] rebase onto main and make edits according to kevin's review --- cookbook/{sequences.md => 01-sequences.md} | 0 cookbook/02-alignments.md | 25 +++++++++++++++++----- 2 files changed, 20 insertions(+), 5 deletions(-) rename cookbook/{sequences.md => 01-sequences.md} (100%) diff --git a/cookbook/sequences.md b/cookbook/01-sequences.md similarity index 100% rename from cookbook/sequences.md rename to cookbook/01-sequences.md diff --git a/cookbook/02-alignments.md b/cookbook/02-alignments.md index 5d4863f..f42f7b4 100644 --- a/cookbook/02-alignments.md +++ b/cookbook/02-alignments.md @@ -10,9 +10,6 @@ and look for regions of similarity. Pairwise alignment differs from multiple sequence alignment (MSA) because it only aligns two sequences, while MSAs align three or more. -In a pairwise alignment, there is one reference sequence and one query sequence, -though this may not always be specified by the user. - ### Running the Alignment There are two main parameters for determining how we want to perform our alignment: @@ -27,14 +24,32 @@ Currently, four types of alignments are supported: - Aligns sequences end-to-end - Best for sequences that are already very similar - All of query is aligned to all of reference + - Example use case: + - Comparing a particular gene from two closely related bacteria + - Comparing alleles of a gene between two individuals + - Not ideal when only one conserved region is shared + - `SemiGlobalAlignment`: local-to-global alignment - A modification of global alignment that allows the user to specify that gaps are penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)). + - Example use case: + - Aligning a contig to a chromosome to see where that contig belongs + - Aligning a 150 bp Illumina read to a longer reference gene or chromosome segment + - A simple way to think about it: “the query should align completely, but the reference may have unaligned flanks.” - `LocalAlignment`: local-to-local alignment - Identifies high-similarity, conserved sub-regions within divergent sequences - Can occur anywhere in the alignment matrix - Maps the query sequence to the most similar region on the reference + - Example use case: + - Finding a conserved protein domain inside two otherwise divergent proteins + - Aligning a short resistance-gene fragment to a genome to see whether that region is present + - This is the right choice when you care about “where is the best shared region?” rather than “do these two full sequences match end-to-end?” - `OverlapAlignment`: end-free alignment - A modification of global alignment where gaps at the beginning or end of sequences are permitted + - Best when the biologically meaningful match is an end-to-end overlap between the two sequences, and terminal overhangs should not be penalized + - Example use case: + - Merging paired-end reads when the forward and reverse reads overlap + - Stitching amplicons or long reads that share an overlapping end region + - The key distinction from semi-global is that overlap alignment is especially for suffix/prefix-style overlaps between sequence ends, which is why it is so useful in assembly workflows. The alignment type should be selected based on what is already known about the sequences the user is comparing: - Are the two sequences very similar and we're looking for a couple of small differences? @@ -49,8 +64,8 @@ and then finds the alignment that minimizes the total penalty. `AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`. It imposes an affine gap penalty for insertions and deletions, which means that it penalizes the opening of a gap more than a gap extending. -This aligns (pun intended!!) with the biological principle that creating a gap is a rare event, -while extending an already existing gap is less so. +Deletions are rare mutations, but if there's a deletion, the length of the deletion is variable. +Longer deletions are less likely than short ones only because they change the structure of the encoded protein more. A user can also define their own `CostModel` instead of using `AffineGapScoreModel`. This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions.