Hello,
I have a question about the position and read lengths in the read headers produced by NanoSim. I am using the human_giab_hg002_sub1M_kitv14_dorado_v3.2.1 pretrained model to simulate sequences from hg002, hg003 and hg004. I observed some interesting behaviour when mapping the simulated reads and wanted to plot the amount of coverage the reads should provide based on where they are sampled from and compare that to what I get when I actually map the reads.
I thought I could use the position in the header as a start point, and the sum of the position and the length of the alignable middle region as the end point of where this read was obtained from in the original genome. When I try that, however, I get end positions for some of the reads that are larger than the chromosome they were sampled from. I can also see positions that are higher than the entire chromosome size.
Is this expected behaviour? If so, can you help me better understand what the position and read lengths mean?
I am also wondering about what the position means for reverse reads, is it the start or end point?
Below are the first few lines of some of my reads that show this pattern:

I created this in R using the code below. The input comb table was created by reading in all the headers from a nanosim file and processing it, and chr_sizes is another table with the lengths of the chromosomes in my reference genome (hg38).
comb %>%
left_join(chr_sizes) %>%
filter(pos+len_middle_region > size) %>%
mutate(len_diff = size-pos-len_middle_region) %>%
arrange(len_diff)
I'd really appreciate any insights you could give me. Thanks!
Hello,
I have a question about the position and read lengths in the read headers produced by NanoSim. I am using the
human_giab_hg002_sub1M_kitv14_dorado_v3.2.1pretrained model to simulate sequences from hg002, hg003 and hg004. I observed some interesting behaviour when mapping the simulated reads and wanted to plot the amount of coverage the reads should provide based on where they are sampled from and compare that to what I get when I actually map the reads.I thought I could use the position in the header as a start point, and the sum of the position and the length of the alignable middle region as the end point of where this read was obtained from in the original genome. When I try that, however, I get end positions for some of the reads that are larger than the chromosome they were sampled from. I can also see positions that are higher than the entire chromosome size.
Is this expected behaviour? If so, can you help me better understand what the position and read lengths mean?
I am also wondering about what the position means for reverse reads, is it the start or end point?
Below are the first few lines of some of my reads that show this pattern:

I created this in R using the code below. The input comb table was created by reading in all the headers from a nanosim file and processing it, and chr_sizes is another table with the lengths of the chromosomes in my reference genome (hg38).
I'd really appreciate any insights you could give me. Thanks!