|
1 | | -ROOT scripts to convert a SAM file to a RAM (ROOT Alignment/Map) file and to work on RAM files. |
| 1 | +# RAMTools - ROOT Alignment/Map Format Tools |
2 | 2 |
|
3 | | - - To convert a SAM file to a RAM file do: |
| 3 | +RAMTools provides efficient tools for converting SAM files to ROOT's modern, columnar RNTuple format (RAM - ROOT Alignment/Map) and working with genomic alignment data. |
4 | 4 |
|
| 5 | +## Features |
| 6 | + |
| 7 | +- High-performance SAM to RAM conversion with an RNTuple backend |
| 8 | +- Chromosome-based splitting for parallel processing |
| 9 | +- Region-based querying capabilities |
| 10 | + |
| 11 | +## Requirements |
| 12 | + |
| 13 | +- ROOT 6.26+ |
| 14 | +- C++17 compatible compiler |
| 15 | +- CMake 3.16+ |
| 16 | + |
| 17 | +## Quick Start |
| 18 | +```bash |
| 19 | +# 1. Build the tools |
| 20 | +mkdir build && cd build |
| 21 | +cmake .. |
| 22 | +make -j$(nproc) |
| 23 | + |
| 24 | +# 2. Convert a SAM file to the RAM format |
| 25 | +./tools/samtoramntuple ../test/samexample.sam output.root |
| 26 | + |
| 27 | +# 3. Query a specific region from the command line |
| 28 | +./tools/ramntupleview output.root "chr1:15700-15800" |
| 29 | +``` |
| 30 | + |
| 31 | +## Command-Line Tools |
| 32 | + |
| 33 | +The primary way to interact with RAMTools is through these command-line executables. |
| 34 | + |
| 35 | +### SAM to RAM Conversion |
| 36 | + |
| 37 | +Convert a standard SAM file into the optimized RNTuple-based RAM format. |
5 | 38 | ```bash |
6 | | - $ root |
7 | | - root [0] .x samtoram.C |
8 | | - root [1] .q |
| 39 | +# Basic conversion |
| 40 | +./tools/samtoramntuple input.sam output.root |
| 41 | + |
| 42 | +# Split by chromosome for parallel processing |
| 43 | +# (Creates output-chr1.root, output-chr2.root, etc.) |
| 44 | +./tools/samtoramntuple input.sam output -split |
9 | 45 | ``` |
10 | 46 |
|
11 | | - - To test read a RAM file do: |
| 47 | +### Region Querying (RNTuple) |
| 48 | + |
| 49 | +Query a specific genomic region from a RAM file, similar to samtools view. |
| 50 | +```bash |
| 51 | +# Usage: ./tools/ramntupleview [input.root] "[chromosome]:[start]-[end]" |
| 52 | +./tools/ramntupleview output.root "chr1:10150-10300" |
| 53 | +``` |
| 54 | + |
| 55 | +## Benchmark Results |
| 56 | + |
| 57 | +Tested with HG00154 sample from the 1000 Genomes Project (196M reads, 72GB SAM file): |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | +### Region Query Performance(LZMA Compression) |
| 62 | + |
| 63 | +| Format | Region | Time (s) | CPU (s) | Reads/sec | Total Reads | |
| 64 | +|--------|--------|----------|---------|-----------|-------------| |
| 65 | +| **TTree** | Small (100bp) | 4.18 | 1.50 | 4.7 | 7 | |
| 66 | +| | Gene (BRCA2) | 1.87 | 1.79 | 24,010 | 42,961 | |
| 67 | +| | 10Mb | 41.2 | 39.7 | 75,105 | 2,977,922 | |
| 68 | +| | 100Mb | 36.3 | 35.4 | 80,492 | 2,852,438 | |
| 69 | +| **RNTuple** | Small (100bp) | 4.10 | 2.55 | 2.7 | 7 | |
| 70 | +| | Gene (BRCA2) | 1.20 | 1.15 | 37,308 | 42,961 | |
| 71 | +| | 10Mb | 7.40 | 6.67 | 446,331 | 2,977,922 | |
| 72 | +| | 100Mb | 6.46 | 6.36 | 448,688 | 2,852,438 | |
| 73 | + |
| 74 | +### Region Query Performance (LZ4 Compression) |
| 75 | + |
| 76 | +| Format | Region | Time (s) | CPU (s) | Reads/sec | Total Reads | |
| 77 | +|--------|--------|----------|---------|-----------|-------------| |
| 78 | +| **TTree** | Small (100bp) | 3.23 | 1.86 | 3.8 | 7 | |
| 79 | +| | Gene (BRCA2) | 0.941 | 0.842 | 51,030 | 42,961 | |
| 80 | +| | 10Mb | 14.5 | 11.0 | 269,797 | 2,977,922 | |
| 81 | +| | 100Mb | 9.05 | 9.05 | 315,088 | 2,852,438 | |
| 82 | +| **RNTuple** | Small (100bp) | 2.98 | 1.67 | 4.2 | 7 | |
| 83 | +| | Gene (BRCA2) | 1.01 | 0.948 | 45,321 | 42,961 | |
| 84 | +| | 10Mb | 7.43 | 6.68 | 445,709 | 2,977,922 | |
| 85 | +| | 100Mb | 6.47 | 6.29 | 453,736 | 2,852,438 | |
| 86 | + |
| 87 | +### Region Query Performance (ZLIB Compression) |
| 88 | + |
| 89 | +| Format | Region | Time (s) | CPU (s) | Reads/sec | Total Reads | |
| 90 | +|--------|--------|----------|---------|-----------|-------------| |
| 91 | +| **TTree** | Small (100bp) | 4.19 | 2.31 | 3.0 | 7 | |
| 92 | +| | Gene (BRCA2) | 1.22 | 1.11 | 38,661 | 42,961 | |
| 93 | +| | 10Mb | 18.2 | 16.3 | 183,021 | 2,977,922 | |
| 94 | +| | 100Mb | 14.4 | 14.4 | 197,815 | 2,852,438 | |
| 95 | +| **RNTuple** | Small (100bp) | 2.85 | 1.73 | 4.0 | 7 | |
| 96 | +| | Gene (BRCA2) | 1.19 | 1.14 | 37,529 | 42,961 | |
| 97 | +| | 10Mb | 7.40 | 6.62 | 449,599 | 2,977,922 | |
| 98 | +| | 100Mb | 6.49 | 6.41 | 445,148 | 2,852,438 | |
| 99 | + |
| 100 | +**Key Findings**: |
| 101 | +- RNTuple demonstrates **1.4-2.5x faster** query performance for large regions compared to TTree |
| 102 | +- LZ4 compression provides the best query performance among all compression algorithms |
| 103 | +- For a 100Mb region query: RNTuple processes **453,736 reads/sec** vs TTree+ZLIB's **197,815 reads/sec** |
| 104 | + |
| 105 | +## TTree Implementation (Legacy) |
| 106 | + |
| 107 | +ROOT scripts to convert a SAM file to a RAM (ROOT Alignment/Map) file using the older TTree format and to work with those files. |
| 108 | + |
| 109 | +### Convert SAM to RAM with TTree |
| 110 | +```bash |
| 111 | +$ root |
| 112 | +root [0] .x samtoram.C |
| 113 | +root [1] .q |
| 114 | +``` |
12 | 115 |
|
| 116 | +### Read a RAM file (TTree) |
13 | 117 | ```bash |
14 | | - $ root |
15 | | - root [0] .x ramreader.C |
16 | | - root [1] .q |
| 118 | +$ root |
| 119 | +root [0] .x ramreader.C |
| 120 | +root [1] .q |
17 | 121 | ``` |
18 | 122 |
|
19 | | - - To view a region, the equivalent of `samtools view bamexample.bam chr1:10150-10300`, do: |
| 123 | +### View a specific region (TTree) |
20 | 124 |
|
| 125 | +To view a region, the equivalent of `samtools view bamexample.bam chr1:10150-10300`: |
21 | 126 | ```bash |
22 | | - $ root |
23 | | - root [0] .x ramview.C("ramexample.root","chr1:10150-10300") |
24 | | - root [1] .q |
| 127 | +$ root |
| 128 | +root [0] .x ramview.C("ramexample.root","chr1:10150-10300") |
| 129 | +root [1] .q |
25 | 130 | ``` |
26 | 131 |
|
0 commit comments