From faf2890a9b9d4f44c7f955a46e4a3df13ab48184 Mon Sep 17 00:00:00 2001
From: Danielle Pinto <danielle.peterson101@gmail.com>
Date: Wed, 4 Mar 2026 22:04:20 -0500
Subject: [PATCH 1/4] add initial draft

---
 cookbook/assets/mecA.fasta | 12 +++++
 cookbook/index.md          | 15 ++++++
 cookbook/sequences.md      | 99 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 126 insertions(+)
 create mode 100644 cookbook/assets/mecA.fasta
 create mode 100644 cookbook/index.md
 create mode 100644 cookbook/sequences.md

diff --git a/cookbook/assets/mecA.fasta b/cookbook/assets/mecA.fasta
new file mode 100644
index 0000000..5edd071
--- /dev/null
+++ b/cookbook/assets/mecA.fasta
@@ -0,0 +1,12 @@
+>NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=]
+ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTG
+GATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGA
+AATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGT
+GTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATC
+AATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATT
+CGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACA
+AGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAA
+CTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCA
+TGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACA
+GAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATT
+TTCCAGAGACAACTGAATAA
diff --git a/cookbook/index.md b/cookbook/index.md
new file mode 100644
index 0000000..45ebf95
--- /dev/null
+++ b/cookbook/index.md
@@ -0,0 +1,15 @@
++++
+using Dates
+date = Date("2026-03-04")
+title = "BioJulia Cookbook"
+rss_descr = "Recipes for performing basic bioinformatics in julia"
++++
+
+# BioJulia Cookbook
+
+This cookbook will provide a series of "recipes" that will help get started quickly with BioJulia so you can doing some bioinformatics!
+
+We have tutorials for reading in files, performing alignments, and using tools such as BLAST,    
+as well as links to more documentation about specific BioJulia packages.
+
+{{list_dir cookbook}}
diff --git a/cookbook/sequences.md b/cookbook/sequences.md
new file mode 100644
index 0000000..b095afd
--- /dev/null
+++ b/cookbook/sequences.md
@@ -0,0 +1,99 @@
++++
+title = "Sequence Input/Output"
+rss_descr = "Reading in fasta files using FASTX.jl"
++++
+
+# Sequence Input/Output
+
+In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module.   
+More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/  
+and with the built-in documentation. 
+
+```julia
+julia> using FASTX
+julia> ?FASTX
+```
+
+If FASTX is not already in your environment,   
+it can be easily added from the Julia Registry.
+
+To demonstrate how to this package,   
+let's try to read in some real-world data!   
+
+The `FASTX` can read in 3 file types: fasta, fastq, and fai.
+
+### FASTA files
+FASTA files are text files containing biological sequence data.       
+They have three parts: name, description, and sequence.
+
+The template of a sequence record is:
+```
+>{description}
+{sequence}
+```
+
+### FASTQ files
+FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!).
+
+The template of a sequence record is:
+```
+@{description}
+{sequence}
++{description}?
+{qualities}
+```
+
+### FAI files
+
+FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files.   
+They are text files with TAB-delimited columns.  
+Each line contains information about each region sequence within the FASTA/FASTQ.     
+More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html).
+
+```
+NAME	Name of this reference sequence
+LENGTH	Total length of this reference sequence, in bases
+OFFSET	Offset in the FASTA/FASTQ file of this sequence's first base
+LINEBASES	The number of bases on each line
+LINEWIDTH	The number of bytes in each line, including the newline
+QUALOFFSET	Offset of sequence's first quality within the FASTQ file
+
+```
+
+
+
+
+
+We will read in a fasta file containing the _mecA_ gene.   
+This gene was taken from NCBI [here](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=3920764#).
+
+First we'll open the file, 
+then we'll iterate over every record in the file and  
+print out the sequence identifier, the sequence description and then the corresponding sequence.
+
+```julia
+julia> FASTAReader(open("assets/mecA.fasta")) do reader
+           for record in reader
+               println(identifier(record))
+               println(description(record))
+               println(sequence(record))
+           end
+       end
+
+NC_007795.1:907598-908317
+NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=]
+ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTGGATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGAAATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGTGTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATCAATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATTCGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACAAGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAACTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCATGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACAGAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATTTTCCAGAGACAACTGAATAA
+```
+
+In this case, there is only one sequence.
+
+Let's try reading in a larger fastq file. 
+
+```julia
+
+
+
+```
+
+
+

From 38c2d71276d1fc89429e1a48ddd830725c9cc26a Mon Sep 17 00:00:00 2001
From: Danielle Pinto <danielle.peterson101@gmail.com>
Date: Thu, 5 Mar 2026 22:07:29 -0500
Subject: [PATCH 2/4] add sequence tutorial

---
 cookbook/sequences.md | 76 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/cookbook/sequences.md b/cookbook/sequences.md
index b095afd..533ed58 100644
--- a/cookbook/sequences.md
+++ b/cookbook/sequences.md
@@ -33,7 +33,8 @@ The template of a sequence record is:
 ```
 
 ### FASTQ files
-FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!).
+FASTQ files are also text-based files that contain sequences, along with a name and description.   
+However, they also store sequence quality information (the Q is for quality!).
 
 The template of a sequence record is:
 ```
@@ -43,6 +44,14 @@ The template of a sequence record is:
 {qualities}
 ```
 
+Here is an example record:
+```
+@FSRRS4401BE7HA
+tcagTTAAGATGGGAT
++
+###EEEEEEEEE##E#
+```
+
 ### FAI files
 
 FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files.   
@@ -89,11 +98,76 @@ In this case, there is only one sequence.
 
 Let's try reading in a larger fastq file. 
 
+The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).   
+The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
+
+The biosample ID for this sample is `SAMN14830786`.
+This ID refers to the physical bacterial isolate.  
+
+The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`.    
+
+Both values correspond to one another and are helpful identifiers. 
+
+The SRR (sample run accession number) is the unique identifier within SRA   
+and corresponds to the specific sequencing run within SRA. 
+
+In a later tutorial, we will discuss how we can download this file within julia with the SRR.
+
+But for now, the file can be downloaded using curl 
+
+
+```
+curl -L --retry 5 --retry-delay 2 \
+  "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR12147540" \
+  | gzip -c > SRR12147540.fastq.gz
+```
+This file is gzipped, so we'll need to account for that as we are reading it in.
+
+Instead of printing out every record (this isn't super practical because it's a big file), let's save all the records into a vector.
+
 ```julia
+using CodecZlib
 
+records = []
 
+FASTQReader(GzipDecompressorStream(open("assets/SRR12147540.fastq.gz"))) do reader
+           for record in reader
+               push!(records, record)
+           end
+       end
+```
+
+We can see how many reads there are by looking at the length of `records`.
 
+```julia
+julia> length(records)
+163528
 ```
 
+Let's take a look at what the first 10 reads look like:
+
+```
+julia> records[1:10]
+10-element Vector{Any}:
+ FASTX.FASTQ.Record("SRR12147540.1  length=43", "TTTTTTTTCCTTTCTTTCT…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.2  length=40", "TTTGTTTTTTTTTGTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.3  length=32", "TTTTTTTTTGTTCTTTGGT…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.4  length=40", "TTTTCTTTTCCTTCTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.5  length=55", "TGTTGTTTGTGTCTCGTTT…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.6  length=166", "TTTCCTTTTTTTTCCTCTC…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.7  length=338", "TGACCACCTTAGAACTTGG…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.8  length=245", "ACGCCGCGGCCAAAGAACG…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.9  length=157", "TTTGTTTGCGCGGTCTCTT…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRR12147540.10  length=100", "TTCTTTCGCCTTTTTGCCT…", "$$$$$$$$$$$$$$$$$$$…")
+ ```
+
+ All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.  
+ More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).  
+ This would be quite poor if we were looking at Illumia data.  
+ However, because of how Pacbio chemistry works,  
+ the quality scores are often flattened and there is simply a placeholder value on this line.  
+ This does not mean our reads are trash!  
+ Now that we've learned how to read files in and manipulate them a bit,  
+let's see if we can align the mecA gene to the Staphylococcus aureus genome.  
 
 

From bdd640c5b3ec3f7b6ad0e17c7aa4b328243dcbc8 Mon Sep 17 00:00:00 2001
From: Danielle Pinto <danielle.peterson101@gmail.com>
Date: Fri, 6 Mar 2026 15:27:20 -0500
Subject: [PATCH 3/4] fix typos and update mecA.fasta

---
 cookbook/assets/mecA.fasta | 44 ++++++++++++++++------
 cookbook/sequences.md      | 76 +++++++++++++++++++++++---------------
 2 files changed, 78 insertions(+), 42 deletions(-)

diff --git a/cookbook/assets/mecA.fasta b/cookbook/assets/mecA.fasta
index 5edd071..1480bd5 100644
--- a/cookbook/assets/mecA.fasta
+++ b/cookbook/assets/mecA.fasta
@@ -1,12 +1,32 @@
->NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=]
-ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTG
-GATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGA
-AATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGT
-GTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATC
-AATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATT
-CGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACA
-AGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAA
-CTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCA
-TGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACA
-GAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATT
-TTCCAGAGACAACTGAATAA
+>NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS
+ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATG
+CTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAA
+AGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATAT
+AATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAAC
+GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT
+TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA
+AGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCA
+ATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGC
+TAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACC
+TTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTA
+CAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGG
+TCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAA
+AAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACG
+ATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAAC
+TATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATC
+CACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATG
+GCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGAT
+TACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGAC
+GATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTA
+CAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTT
+TGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAA
+GATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTAT
+TAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGC
+ATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAA
+AATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATA
+AAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACA
+AGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCT
+ATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATG
+ATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAA
+CGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTC
+TTCATTT
\ No newline at end of file
diff --git a/cookbook/sequences.md b/cookbook/sequences.md
index 533ed58..534e28f 100644
--- a/cookbook/sequences.md
+++ b/cookbook/sequences.md
@@ -1,26 +1,26 @@
 +++
 title = "Sequence Input/Output"
-rss_descr = "Reading in fasta files using FASTX.jl"
+rss_descr = "Reading in FASTA, FASTQ, and FAI files using FASTX.jl"
 +++
 
 # Sequence Input/Output
 
 In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module.   
 More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/  
-and with the built-in documentation. 
+and with the built-in documentation you can access directly within the Julia REPL. 
 
 ```julia
 julia> using FASTX
 julia> ?FASTX
 ```
 
-If FASTX is not already in your environment,   
+If `FASTX` is not already in your environment,   
 it can be easily added from the Julia Registry.
 
-To demonstrate how to this package,   
+To demonstrate how to use this package,   
 let's try to read in some real-world data!   
 
-The `FASTX` can read in 3 file types: fasta, fastq, and fai.
+The `FASTX` package can read in three file types: `fasta`, `fastq`, and `fai`.
 
 ### FASTA files
 FASTA files are text files containing biological sequence data.       
@@ -32,6 +32,16 @@ The template of a sequence record is:
 {sequence}
 ```
 
+The identifier is the first part of the description until the first whitespace.     
+If there is no white space, the name and description are the same.
+
+Here is an example fasta:
+```
+>chrI chromosome 1
+CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
+CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA
+```
+
 ### FASTQ files
 FASTQ files are also text-based files that contain sequences, along with a name and description.   
 However, they also store sequence quality information (the Q is for quality!).
@@ -54,9 +64,9 @@ tcagTTAAGATGGGAT
 
 ### FAI files
 
-FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files.   
+FAI (FASTA index) files are used in conjunction with FASTA/FASTQ files.   
 They are text files with TAB-delimited columns.  
-Each line contains information about each region sequence within the FASTA/FASTQ.     
+They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory.      
 More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html).
 
 ```
@@ -69,15 +79,14 @@ QUALOFFSET	Offset of sequence's first quality within the FASTQ file
 
 ```
 
+We will read in a FASTA file containing the _mecA_ gene.   
+_mecA_ is an antibiotic resistance gene commonly found in Methicillin-resistant _Staphylococcus aureus_ (MRSA). 
+It helps bacteria to break down beta-lactam antibiotics like methicillin.    
+It is typically 2.1 kB long.  
+This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1).  
 
-
-
-
-We will read in a fasta file containing the _mecA_ gene.   
-This gene was taken from NCBI [here](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=3920764#).
-
-First we'll open the file, 
-then we'll iterate over every record in the file and  
+First we'll open the file.   
+Then we'll iterate over every record in the file and  
 print out the sequence identifier, the sequence description and then the corresponding sequence.
 
 ```julia
@@ -86,32 +95,38 @@ julia> FASTAReader(open("assets/mecA.fasta")) do reader
                println(identifier(record))
                println(description(record))
                println(sequence(record))
+               println(length(sequence(record)))
            end
        end
 
-NC_007795.1:907598-908317
-NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=]
-ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTGGATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGAAATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGTGTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATCAATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATTCGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACAAGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAACTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCATGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACAGAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATTTTCCAGAGACAACTGAATAA
+NG_047945.1
+NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS
+ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT
+2107
 ```
+We confirmed that the length of the gene matches what we'd expect for _mecA_. 
+In this case, there is only one sequence that spans the entire length of the gene.   
+After this gene was sequenced, all of the reads were assembled together into a single consensus sequence.      
+_mecA_ is a well-characterized gene, so there are no ambiguous regions, and there is no need for there to be multiple contigs  
+(AKA for the gene to be broken up into multiple pieces, as we know exactly how the reads should be assembled together.).  
 
-In this case, there is only one sequence.
 
-Let's try reading in a larger fastq file. 
+Let's try reading in a larger FASTQ file. 
 
-The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).   
-The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
+The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).     
+The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
 
-The biosample ID for this sample is `SAMN14830786`.
+The BioSample ID for this sample is `SAMN14830786`.
 This ID refers to the physical bacterial isolate.  
 
-The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`.    
+The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`.    
 
 Both values correspond to one another and are helpful identifiers. 
 
 The SRR (sample run accession number) is the unique identifier within SRA   
-and corresponds to the specific sequencing run within SRA. 
+and corresponds to the specific sequencing run. 
 
-In a later tutorial, we will discuss how we can download this file within julia with the SRR.
+In a later tutorial, we will discuss how to download this file in Julia using the SRR.
 
 But for now, the file can be downloaded using curl 
 
@@ -164,10 +179,11 @@ julia> records[1:10]
  All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.  
  More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).  
  This would be quite poor if we were looking at Illumia data.  
- However, because of how Pacbio chemistry works,  
- the quality scores are often flattened and there is simply a placeholder value on this line.  
- This does not mean our reads are trash!  
+ However, because of how PacBio chemistry works,  
+ quality scores are often flattened and there is simply a placeholder value on this line.  
+ This does not mean our reads are low quality!  
  Now that we've learned how to read files in and manipulate them a bit,  
-let's see if we can align the mecA gene to the Staphylococcus aureus genome.  
+let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.  
+This will tell us if this _S. aureus_ is MRSA.  
 
 

From 06552476e187f72f06807d8872ef4fdfeba3e7df Mon Sep 17 00:00:00 2001
From: Danielle Pinto <danielle.peterson101@gmail.com>
Date: Tue, 10 Mar 2026 22:13:07 -0400
Subject: [PATCH 4/4] switch example fastq to illumina sample

---
 cookbook/sequences.md | 111 ++++++++++++++++++++++++++++++------------
 1 file changed, 80 insertions(+), 31 deletions(-)

diff --git a/cookbook/sequences.md b/cookbook/sequences.md
index 534e28f..4b6e9d8 100644
--- a/cookbook/sequences.md
+++ b/cookbook/sequences.md
@@ -113,28 +113,42 @@ _mecA_ is a well-characterized gene, so there are no ambiguous regions, and ther
 
 Let's try reading in a larger FASTQ file. 
 
-The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).     
-The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540).
+The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).     
+The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625).
 
-The BioSample ID for this sample is `SAMN14830786`.
+The BioSample ID for this sample is `SAMN02360768`.
 This ID refers to the physical bacterial isolate.  
-
-The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`.    
-
+The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`.    
 Both values correspond to one another and are helpful identifiers. 
 
 The SRR (sample run accession number) is the unique identifier within SRA   
-and corresponds to the specific sequencing run. 
-
-In a later tutorial, we will discuss how to download this file in Julia using the SRR.
-
-But for now, the file can be downloaded using curl 
+and corresponds to the specific sequencing run.   
+There are two sequencing runs and accession numbers for this sample,   
+but we'll select one to download: `SRX392511`.  
 
+This file can be downloaded on the command line using `curl`.  
+> One of the nice things about julia is that it is 
+> super easy to toggle between the julia REPL and 
+> bash shell by typing `;` to access shell mode 
+> from julia.  
+> You can use the `backspace`/`delete` key to 
+> quickly toggle back to the julia REPL.  
 
+To download using the command line, type:
 ```
 curl -L --retry 5 --retry-delay 2 \
-  "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR12147540" \
-  | gzip -c > SRR12147540.fastq.gz
+  "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \
+  | gzip -c > SRX392511.fastq.gz
+```
+Alternatively, command line code can be executed from within julia:
+
+```julia
+run(pipeline(
+    `curl -L --retry 5 --retry-delay 2 "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511"`,
+    `gzip -c`,
+    "SRX392511.fastq.gz"
+    )
+)
 ```
 This file is gzipped, so we'll need to account for that as we are reading it in.
 
@@ -145,7 +159,7 @@ using CodecZlib
 
 records = []
 
-FASTQReader(GzipDecompressorStream(open("assets/SRR12147540.fastq.gz"))) do reader
+FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader
            for record in reader
                push!(records, record)
            end
@@ -156,7 +170,7 @@ We can see how many reads there are by looking at the length of `records`.
 
 ```julia
 julia> length(records)
-163528
+9852716
 ```
 
 Let's take a look at what the first 10 reads look like:
@@ -164,24 +178,59 @@ Let's take a look at what the first 10 reads look like:
 ```
 julia> records[1:10]
 10-element Vector{Any}:
- FASTX.FASTQ.Record("SRR12147540.1  length=43", "TTTTTTTTCCTTTCTTTCT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.2  length=40", "TTTGTTTTTTTTTGTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.3  length=32", "TTTTTTTTTGTTCTTTGGT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.4  length=40", "TTTTCTTTTCCTTCTTTTC…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.5  length=55", "TGTTGTTTGTGTCTCGTTT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.6  length=166", "TTTCCTTTTTTTTCCTCTC…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.7  length=338", "TGACCACCTTAGAACTTGG…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.8  length=245", "ACGCCGCGGCCAAAGAACG…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.9  length=157", "TTTGTTTGCGCGGTCTCTT…", "$$$$$$$$$$$$$$$$$$$…")
- FASTX.FASTQ.Record("SRR12147540.10  length=100", "TTCTTTCGCCTTTTTGCCT…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGGATTTTGTTATATTGTA…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGCATAAATTTAAAAAAAA…", "$$$$$$$$$$$$$$$$$$$…")
+ FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAT…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAG…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.3 3 length=101", "GAGCAGTAGTATAAAATGA…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.3 3 length=101", "CCGACACAATTACAAGCCA…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.4 4 length=101", "GATTATCTAGTCATAATTC…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…")
+ FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…")
  ```
 
- All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.  
- More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).  
- This would be quite poor if we were looking at Illumia data.  
- However, because of how PacBio chemistry works,  
- quality scores are often flattened and there is simply a placeholder value on this line.  
- This does not mean our reads are low quality!  
+ All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119.
+ This is not concerning, as it is typical for the first couple of reads to be low quality.    
+
+ Let's calculate the average quality across all reads. 
+ We'll do this by converting each PHRED score to its probability error,   
+ and summing these values across all sequences in all reads.  
+ Then, we can divide this sum by the total number of bases  
+ to get the average probability error for the entire sequence. 
+ It's important that we first convert the PHRED scores to the probability errors before averaging,   
+because PHRED is a logarithmic transform of error probability.  
+Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum.  
+
+```julia
+function phred_to_prob(Q)
+    # The formula is Pe = 10^(-Q/10)
+    return 10^(-Q / 10.0)
+end
+
+# get sum of all error probabilities
+total_error_prob = sum(
+    sum(phred_to_prob(q) for q in FASTQ.quality_scores(record))
+    for record in records
+)
+4.5100356107423276e7
+# get total number of base pairs
+total_bases = sum(length(FASTQ.quality_scores(record)) for record in records)
+995124316
+
+# get average error probability
+total_error_prob/total_bases
+0.045321328584059316
+```
+
+The average error probability is just 4.53%,   
+meaning the majority of reads are high quality.  
+
+ More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html).  
+ 
+
+
+
  Now that we've learned how to read files in and manipulate them a bit,  
 let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome.  
 This will tell us if this _S. aureus_ is MRSA.