From faf2890a9b9d4f44c7f955a46e4a3df13ab48184 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Wed, 4 Mar 2026 22:04:20 -0500 Subject: [PATCH 1/4] add initial draft --- cookbook/assets/mecA.fasta | 12 +++++ cookbook/index.md | 15 ++++++ cookbook/sequences.md | 99 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 126 insertions(+) create mode 100644 cookbook/assets/mecA.fasta create mode 100644 cookbook/index.md create mode 100644 cookbook/sequences.md diff --git a/cookbook/assets/mecA.fasta b/cookbook/assets/mecA.fasta new file mode 100644 index 0000000..5edd071 --- /dev/null +++ b/cookbook/assets/mecA.fasta @@ -0,0 +1,12 @@ +>NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=] +ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTG +GATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGA +AATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGT +GTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATC +AATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATT +CGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACA +AGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAA +CTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCA +TGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACA +GAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATT +TTCCAGAGACAACTGAATAA diff --git a/cookbook/index.md b/cookbook/index.md new file mode 100644 index 0000000..45ebf95 --- /dev/null +++ b/cookbook/index.md @@ -0,0 +1,15 @@ ++++ +using Dates +date = Date("2026-03-04") +title = "BioJulia Cookbook" +rss_descr = "Recipes for performing basic bioinformatics in julia" ++++ + +# BioJulia Cookbook + +This cookbook will provide a series of "recipes" that will help get started quickly with BioJulia so you can doing some bioinformatics! + +We have tutorials for reading in files, performing alignments, and using tools such as BLAST, +as well as links to more documentation about specific BioJulia packages. + +{{list_dir cookbook}} diff --git a/cookbook/sequences.md b/cookbook/sequences.md new file mode 100644 index 0000000..b095afd --- /dev/null +++ b/cookbook/sequences.md @@ -0,0 +1,99 @@ ++++ +title = "Sequence Input/Output" +rss_descr = "Reading in fasta files using FASTX.jl" ++++ + +# Sequence Input/Output + +In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module. +More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/ +and with the built-in documentation. + +```julia +julia> using FASTX +julia> ?FASTX +``` + +If FASTX is not already in your environment, +it can be easily added from the Julia Registry. + +To demonstrate how to this package, +let's try to read in some real-world data! + +The `FASTX` can read in 3 file types: fasta, fastq, and fai. + +### FASTA files +FASTA files are text files containing biological sequence data. +They have three parts: name, description, and sequence. + +The template of a sequence record is: +``` +>{description} +{sequence} +``` + +### FASTQ files +FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!). + +The template of a sequence record is: +``` +@{description} +{sequence} ++{description}? +{qualities} +``` + +### FAI files + +FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files. +They are text files with TAB-delimited columns. +Each line contains information about each region sequence within the FASTA/FASTQ. +More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html). + +``` +NAME Name of this reference sequence +LENGTH Total length of this reference sequence, in bases +OFFSET Offset in the FASTA/FASTQ file of this sequence's first base +LINEBASES The number of bases on each line +LINEWIDTH The number of bytes in each line, including the newline +QUALOFFSET Offset of sequence's first quality within the FASTQ file + +``` + + + + + +We will read in a fasta file containing the _mecA_ gene. +This gene was taken from NCBI [here](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=3920764#). + +First we'll open the file, +then we'll iterate over every record in the file and +print out the sequence identifier, the sequence description and then the corresponding sequence. + +```julia +julia> FASTAReader(open("assets/mecA.fasta")) do reader + for record in reader + println(identifier(record)) + println(description(record)) + println(sequence(record)) + end + end + +NC_007795.1:907598-908317 +NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=] +ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTGGATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGAAATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGTGTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATCAATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATTCGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACAAGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAACTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCATGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACAGAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATTTTCCAGAGACAACTGAATAA +``` + +In this case, there is only one sequence. + +Let's try reading in a larger fastq file. + +```julia + + + +``` + + + From 38c2d71276d1fc89429e1a48ddd830725c9cc26a Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Thu, 5 Mar 2026 22:07:29 -0500 Subject: [PATCH 2/4] add sequence tutorial --- cookbook/sequences.md | 76 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 75 insertions(+), 1 deletion(-) diff --git a/cookbook/sequences.md b/cookbook/sequences.md index b095afd..533ed58 100644 --- a/cookbook/sequences.md +++ b/cookbook/sequences.md @@ -33,7 +33,8 @@ The template of a sequence record is: ``` ### FASTQ files -FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!). +FASTQ files are also text-based files that contain sequences, along with a name and description. +However, they also store sequence quality information (the Q is for quality!). The template of a sequence record is: ``` @@ -43,6 +44,14 @@ The template of a sequence record is: {qualities} ``` +Here is an example record: +``` +@FSRRS4401BE7HA +tcagTTAAGATGGGAT ++ +###EEEEEEEEE##E# +``` + ### FAI files FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files. @@ -89,11 +98,76 @@ In this case, there is only one sequence. Let's try reading in a larger fastq file. +The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). +The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). + +The biosample ID for this sample is `SAMN14830786`. +This ID refers to the physical bacterial isolate. + +The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`. + +Both values correspond to one another and are helpful identifiers. + +The SRR (sample run accession number) is the unique identifier within SRA +and corresponds to the specific sequencing run within SRA. + +In a later tutorial, we will discuss how we can download this file within julia with the SRR. + +But for now, the file can be downloaded using curl + + +``` +curl -L --retry 5 --retry-delay 2 \ + "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR12147540" \ + | gzip -c > SRR12147540.fastq.gz +``` +This file is gzipped, so we'll need to account for that as we are reading it in. + +Instead of printing out every record (this isn't super practical because it's a big file), let's save all the records into a vector. + ```julia +using CodecZlib +records = [] +FASTQReader(GzipDecompressorStream(open("assets/SRR12147540.fastq.gz"))) do reader + for record in reader + push!(records, record) + end + end +``` + +We can see how many reads there are by looking at the length of `records`. +```julia +julia> length(records) +163528 ``` +Let's take a look at what the first 10 reads look like: + +``` +julia> records[1:10] +10-element Vector{Any}: + FASTX.FASTQ.Record("SRR12147540.1 length=43", "TTTTTTTTCCTTTCTTTCT…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.2 length=40", "TTTGTTTTTTTTTGTTTTC…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.3 length=32", "TTTTTTTTTGTTCTTTGGT…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.4 length=40", "TTTTCTTTTCCTTCTTTTC…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.5 length=55", "TGTTGTTTGTGTCTCGTTT…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.6 length=166", "TTTCCTTTTTTTTCCTCTC…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.7 length=338", "TGACCACCTTAGAACTTGG…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.8 length=245", "ACGCCGCGGCCAAAGAACG…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.9 length=157", "TTTGTTTGCGCGGTCTCTT…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRR12147540.10 length=100", "TTCTTTCGCCTTTTTGCCT…", "$$$$$$$$$$$$$$$$$$$…") + ``` + + All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119. + More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html). + This would be quite poor if we were looking at Illumia data. + However, because of how Pacbio chemistry works, + the quality scores are often flattened and there is simply a placeholder value on this line. + This does not mean our reads are trash! + Now that we've learned how to read files in and manipulate them a bit, +let's see if we can align the mecA gene to the Staphylococcus aureus genome. From bdd640c5b3ec3f7b6ad0e17c7aa4b328243dcbc8 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Fri, 6 Mar 2026 15:27:20 -0500 Subject: [PATCH 3/4] fix typos and update mecA.fasta --- cookbook/assets/mecA.fasta | 44 ++++++++++++++++------ cookbook/sequences.md | 76 +++++++++++++++++++++++--------------- 2 files changed, 78 insertions(+), 42 deletions(-) diff --git a/cookbook/assets/mecA.fasta b/cookbook/assets/mecA.fasta index 5edd071..1480bd5 100644 --- a/cookbook/assets/mecA.fasta +++ b/cookbook/assets/mecA.fasta @@ -1,12 +1,32 @@ ->NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=] -ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTG -GATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGA -AATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGT -GTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATC -AATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATT -CGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACA -AGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAA -CTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCA -TGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACA -GAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATT -TTCCAGAGACAACTGAATAA +>NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS +ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATG +CTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAA +AGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATAT +AATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAAC +GAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGT +TAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAA +AGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCA +ATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGC +TAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACC +TTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTA +CAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGG +TCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAA +AAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACG +ATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAAC +TATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATC +CACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATG +GCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGAT +TACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGAC +GATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTA +CAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTT +TGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAA +GATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTAT +TAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGC +ATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAA +AATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATA +AAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACA +AGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCT +ATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATG +ATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAA +CGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTC +TTCATTT \ No newline at end of file diff --git a/cookbook/sequences.md b/cookbook/sequences.md index 533ed58..534e28f 100644 --- a/cookbook/sequences.md +++ b/cookbook/sequences.md @@ -1,26 +1,26 @@ +++ title = "Sequence Input/Output" -rss_descr = "Reading in fasta files using FASTX.jl" +rss_descr = "Reading in FASTA, FASTQ, and FAI files using FASTX.jl" +++ # Sequence Input/Output In this chapter, we'll talk about how to read in sequence files using the `FASTX.jl` module. More information about the `FASTX.jl` package can be found at https://biojulia.dev/FASTX.jl/stable/ -and with the built-in documentation. +and with the built-in documentation you can access directly within the Julia REPL. ```julia julia> using FASTX julia> ?FASTX ``` -If FASTX is not already in your environment, +If `FASTX` is not already in your environment, it can be easily added from the Julia Registry. -To demonstrate how to this package, +To demonstrate how to use this package, let's try to read in some real-world data! -The `FASTX` can read in 3 file types: fasta, fastq, and fai. +The `FASTX` package can read in three file types: `fasta`, `fastq`, and `fai`. ### FASTA files FASTA files are text files containing biological sequence data. @@ -32,6 +32,16 @@ The template of a sequence record is: {sequence} ``` +The identifier is the first part of the description until the first whitespace. +If there is no white space, the name and description are the same. + +Here is an example fasta: +``` +>chrI chromosome 1 +CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC +CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA +``` + ### FASTQ files FASTQ files are also text-based files that contain sequences, along with a name and description. However, they also store sequence quality information (the Q is for quality!). @@ -54,9 +64,9 @@ tcagTTAAGATGGGAT ### FAI files -FAI (FASTA index) files are used in conjuction with FASTA/FASTQ files. +FAI (FASTA index) files are used in conjunction with FASTA/FASTQ files. They are text files with TAB-delimited columns. -Each line contains information about each region sequence within the FASTA/FASTQ. +They allow the user to access specific regions of the reference FASTA/FASTQ without reading in the entire sequence into memory. More information about fai index files can be found [here](https://www.htslib.org/doc/faidx.html). ``` @@ -69,15 +79,14 @@ QUALOFFSET Offset of sequence's first quality within the FASTQ file ``` +We will read in a FASTA file containing the _mecA_ gene. +_mecA_ is an antibiotic resistance gene commonly found in Methicillin-resistant _Staphylococcus aureus_ (MRSA). +It helps bacteria to break down beta-lactam antibiotics like methicillin. +It is typically 2.1 kB long. +This specific reference fasta was downloaded from NCBI [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1?report=fasta). More information about this reference sequence can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NG_047945.1). - - - -We will read in a fasta file containing the _mecA_ gene. -This gene was taken from NCBI [here](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=3920764#). - -First we'll open the file, -then we'll iterate over every record in the file and +First we'll open the file. +Then we'll iterate over every record in the file and print out the sequence identifier, the sequence description and then the corresponding sequence. ```julia @@ -86,32 +95,38 @@ julia> FASTAReader(open("assets/mecA.fasta")) do reader println(identifier(record)) println(description(record)) println(sequence(record)) + println(length(sequence(record))) end end -NC_007795.1:907598-908317 -NC_007795.1:907598-908317 SAOUHSC_00935 [organism=Staphylococcus aureus subsp. aureus NCTC 8325] [GeneID=3920764] [chromosome=] -ATGAGAATAGAACGAGTAGATGATACAACTGTAAAATTGTTTATAACATATAGCGATATCGAGGCCCGTGGATTTAGTCGTGAAGATTTATGGACAAATCGCAAACGTGGCGAAGAATTCTTTTGGTCAATGATGGATGAAATTAACGAAGAAGAAGATTTTGTTGTAGAAGGTCCATTATGGATTCAAGTACATGCCTTTGAAAAAGGTGTCGAAGTCACAATTTCTAAATCTAAAAATGAAGATATGATGAATATGTCTGATGATGATGCAACTGATCAATTTGATGAACAAGTTCAAGAATTGTTAGCTCAAACATTAGAAGGTGAAGATCAATTAGAAGAATTATTCGAGCAACGAACAAAAGAAAAAGAAGCTCAAGGTTCTAAACGTCAAAAGTCTTCAGCACGTAAAAATACAAGAACAATCATTGTGAAATTTAACGATTTAGAAGATGTTATTAATTATGCATATCATAGCAATCCAATAACTACAGAGTTTGAAGATTTGTTATATATGGTTGATGGTACTTATTATTATGCTGTATATTTTGATAGTCATGTTGATCAAGAAGTCATTAATGATAGTTACAGTCAATTGCTTGAATTTGCTTATCCAACAGACAGAACAGAAGTTTATTTAAATGACTATGCTAAAATAATTATGAGTCATAACGTAACAGCTCAAGTTCGACGTTATTTTCCAGAGACAACTGAATAA +NG_047945.1 +NG_047945.1 Staphylococcus aureus TN/CN/1/12 mecA gene for ceftaroline-resistant PBP2a family peptidoglycan transpeptidase MecA, complete CDS +ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT +2107 ``` +We confirmed that the length of the gene matches what we'd expect for _mecA_. +In this case, there is only one sequence that spans the entire length of the gene. +After this gene was sequenced, all of the reads were assembled together into a single consensus sequence. +_mecA_ is a well-characterized gene, so there are no ambiguous regions, and there is no need for there to be multiple contigs +(AKA for the gene to be broken up into multiple pieces, as we know exactly how the reads should be assembled together.). -In this case, there is only one sequence. -Let's try reading in a larger fastq file. +Let's try reading in a larger FASTQ file. -The raw reads for a Staphylococcus aureus isolate was sequenced with Pac-Bio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). -The link to download the raw fastq's can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). +The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). +The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). -The biosample ID for this sample is `SAMN14830786`. +The BioSample ID for this sample is `SAMN14830786`. This ID refers to the physical bacterial isolate. -The SRA sample accession number (an internal value used within Sequence Read Archive) is `SRS6947643`. +The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`. Both values correspond to one another and are helpful identifiers. The SRR (sample run accession number) is the unique identifier within SRA -and corresponds to the specific sequencing run within SRA. +and corresponds to the specific sequencing run. -In a later tutorial, we will discuss how we can download this file within julia with the SRR. +In a later tutorial, we will discuss how to download this file in Julia using the SRR. But for now, the file can be downloaded using curl @@ -164,10 +179,11 @@ julia> records[1:10] All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119. More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html). This would be quite poor if we were looking at Illumia data. - However, because of how Pacbio chemistry works, - the quality scores are often flattened and there is simply a placeholder value on this line. - This does not mean our reads are trash! + However, because of how PacBio chemistry works, + quality scores are often flattened and there is simply a placeholder value on this line. + This does not mean our reads are low quality! Now that we've learned how to read files in and manipulate them a bit, -let's see if we can align the mecA gene to the Staphylococcus aureus genome. +let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome. +This will tell us if this _S. aureus_ is MRSA. From 06552476e187f72f06807d8872ef4fdfeba3e7df Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Tue, 10 Mar 2026 22:13:07 -0400 Subject: [PATCH 4/4] switch example fastq to illumina sample --- cookbook/sequences.md | 111 ++++++++++++++++++++++++++++++------------ 1 file changed, 80 insertions(+), 31 deletions(-) diff --git a/cookbook/sequences.md b/cookbook/sequences.md index 534e28f..4b6e9d8 100644 --- a/cookbook/sequences.md +++ b/cookbook/sequences.md @@ -113,28 +113,42 @@ _mecA_ is a well-characterized gene, so there are no ambiguous regions, and ther Let's try reading in a larger FASTQ file. -The raw reads for a _Staphylococcus aureus_ isolate were sequenced with PacBio and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). -The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR12147540). +The raw reads for a _Staphylococcus aureus_ isolate were sequenced with Illumina and uploaded to NCBI [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625). +The link to download the raw FASTQ files can be found [here](https://trace.ncbi.nlm.nih.gov/Traces/index.html?run=SRR1050625). -The BioSample ID for this sample is `SAMN14830786`. +The BioSample ID for this sample is `SAMN02360768`. This ID refers to the physical bacterial isolate. - -The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS6947643`. - +The SRA sample accession number (an internal value used within the Sequence Read Archive) is `SRS515580`. Both values correspond to one another and are helpful identifiers. The SRR (sample run accession number) is the unique identifier within SRA -and corresponds to the specific sequencing run. - -In a later tutorial, we will discuss how to download this file in Julia using the SRR. - -But for now, the file can be downloaded using curl +and corresponds to the specific sequencing run. +There are two sequencing runs and accession numbers for this sample, +but we'll select one to download: `SRX392511`. +This file can be downloaded on the command line using `curl`. +> One of the nice things about julia is that it is +> super easy to toggle between the julia REPL and +> bash shell by typing `;` to access shell mode +> from julia. +> You can use the `backspace`/`delete` key to +> quickly toggle back to the julia REPL. +To download using the command line, type: ``` curl -L --retry 5 --retry-delay 2 \ - "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR12147540" \ - | gzip -c > SRR12147540.fastq.gz + "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511" \ + | gzip -c > SRX392511.fastq.gz +``` +Alternatively, command line code can be executed from within julia: + +```julia +run(pipeline( + `curl -L --retry 5 --retry-delay 2 "https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRX392511"`, + `gzip -c`, + "SRX392511.fastq.gz" + ) +) ``` This file is gzipped, so we'll need to account for that as we are reading it in. @@ -145,7 +159,7 @@ using CodecZlib records = [] -FASTQReader(GzipDecompressorStream(open("assets/SRR12147540.fastq.gz"))) do reader +FASTQReader(GzipDecompressorStream(open("assets/SRX392511.fastq.gz"))) do reader for record in reader push!(records, record) end @@ -156,7 +170,7 @@ We can see how many reads there are by looking at the length of `records`. ```julia julia> length(records) -163528 +9852716 ``` Let's take a look at what the first 10 reads look like: @@ -164,24 +178,59 @@ Let's take a look at what the first 10 reads look like: ``` julia> records[1:10] 10-element Vector{Any}: - FASTX.FASTQ.Record("SRR12147540.1 length=43", "TTTTTTTTCCTTTCTTTCT…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.2 length=40", "TTTGTTTTTTTTTGTTTTC…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.3 length=32", "TTTTTTTTTGTTCTTTGGT…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.4 length=40", "TTTTCTTTTCCTTCTTTTC…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.5 length=55", "TGTTGTTTGTGTCTCGTTT…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.6 length=166", "TTTCCTTTTTTTTCCTCTC…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.7 length=338", "TGACCACCTTAGAACTTGG…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.8 length=245", "ACGCCGCGGCCAAAGAACG…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.9 length=157", "TTTGTTTGCGCGGTCTCTT…", "$$$$$$$$$$$$$$$$$$$…") - FASTX.FASTQ.Record("SRR12147540.10 length=100", "TTCTTTCGCCTTTTTGCCT…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGGATTTTGTTATATTGTA…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRX392511.1 1 length=101", "AGCATAAATTTAAAAAAAA…", "$$$$$$$$$$$$$$$$$$$…") + FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAT…", "???????????????????…") + FASTX.FASTQ.Record("SRX392511.2 2 length=101", "TAGATTAAAATTCTCGTAG…", "???????????????????…") + FASTX.FASTQ.Record("SRX392511.3 3 length=101", "GAGCAGTAGTATAAAATGA…", "???????????????????…") + FASTX.FASTQ.Record("SRX392511.3 3 length=101", "CCGACACAATTACAAGCCA…", "???????????????????…") + FASTX.FASTQ.Record("SRX392511.4 4 length=101", "GATTATCTAGTCATAATTC…", "???????????????????…") + FASTX.FASTQ.Record("SRX392511.4 4 length=101", "TTCGCCCCCCCAAAAGGCT…", "???????????????????…") + FASTX.FASTQ.Record("SRX392511.5 5 length=101", "ATAAAATGAACTTGCGTTA…", "???????????????????…") + FASTX.FASTQ.Record("SRX392511.5 5 length=101", "TAACTTGTGGATAATTATT…", "???????????????????…") ``` - All of the nucleotides in all of the reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119. - More information about how to convert ASCII values to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html). - This would be quite poor if we were looking at Illumia data. - However, because of how PacBio chemistry works, - quality scores are often flattened and there is simply a placeholder value on this line. - This does not mean our reads are low quality! + All of the nucleotides in these two reads have a quality score of `$`, which corresponds to a probabilty of error of 0.50119. + This is not concerning, as it is typical for the first couple of reads to be low quality. + + Let's calculate the average quality across all reads. + We'll do this by converting each PHRED score to its probability error, + and summing these values across all sequences in all reads. + Then, we can divide this sum by the total number of bases + to get the average probability error for the entire sequence. + It's important that we first convert the PHRED scores to the probability errors before averaging, +because PHRED is a logarithmic transform of error probability. +Therefore, simply averaging PHRED and then converting to error probability is not the same as converting PHRED to error probability and averaging that sum. + +```julia +function phred_to_prob(Q) + # The formula is Pe = 10^(-Q/10) + return 10^(-Q / 10.0) +end + +# get sum of all error probabilities +total_error_prob = sum( + sum(phred_to_prob(q) for q in FASTQ.quality_scores(record)) + for record in records +) +4.5100356107423276e7 +# get total number of base pairs +total_bases = sum(length(FASTQ.quality_scores(record)) for record in records) +995124316 + +# get average error probability +total_error_prob/total_bases +0.045321328584059316 +``` + +The average error probability is just 4.53%, +meaning the majority of reads are high quality. + + More information about how to convert ASCII values/PHRED scores to quality scores [here](https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html). + + + + Now that we've learned how to read files in and manipulate them a bit, let's see if we can align the _mecA_ gene to the _Staphylococcus aureus_ genome. This will tell us if this _S. aureus_ is MRSA.