11# Multi-omics Knowledge Graph QA Generation
22
3- This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using different QA generation methods .
3+ This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified ` omics_qa ` method .
44
55## Pipeline Overview
66
77The pipeline includes the following steps:
88
9- 1 . ** read** : Read input files (JSONL format with sequence queries)
10- 2 . ** search** : Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein)
9+ 1 . ** read** : Read input files (JSON/ JSONL format with sequence queries or protein data )
10+ 2 . ** search** : Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - * optional if input already contains search results *
11113 . ** chunk** : Chunk sequences and metadata
12124 . ** build_kg** : Extract entities and relationships to build knowledge graph
13- 5 . ** quiz** (optional): Generate quiz questions for KG nodes and edges
14- 6 . ** judge** (optional): Judge the correctness of KG descriptions
15- 7 . ** partition** : Partition the knowledge graph into communities
16- 8 . ** generate** : Generate QA pairs from partitioned communities
17-
18- ## Available QA Generation Methods
19-
20- This example provides configurations for different QA generation methods:
21-
22- ### 1. Atomic QA (` omics_atomic_config.yaml ` )
23- - ** Method** : ` atomic `
24- - ** Format** : Alpaca
25- - ** Partition** : DFS with max_units_per_community=1
26- - ** Use case** : Simple, single-fact questions
27- - ** Run** : ` ./generate_omics_atomic.sh `
28-
29- ### 2. Aggregated QA (` omics_aggregated_config.yaml ` )
30- - ** Method** : ` aggregated `
31- - ** Format** : ChatML
32- - ** Partition** : ECE with comprehension loss
33- - ** Includes** : quiz and judge steps
34- - ** Use case** : Comprehensive questions covering multiple facts
35- - ** Run** : ` ./generate_omics_aggregated.sh `
36-
37- ### 3. Chain of Thought (CoT) QA (` omics_cot_config.yaml ` )
38- - ** Method** : ` cot `
39- - ** Format** : ShareGPT
40- - ** Partition** : Leiden algorithm
41- - ** Use case** : Questions requiring step-by-step reasoning
42- - ** Run** : ` ./generate_omics_cot.sh `
43-
44- ### 4. Multi-hop QA (` omics_multi_hop_config.yaml ` )
45- - ** Method** : ` multi_hop `
46- - ** Format** : ChatML
47- - ** Partition** : ECE with random sampling
48- - ** Use case** : Questions requiring reasoning across multiple KG relationships
49- - ** Run** : ` ./generate_omics_multi_hop.sh `
13+ 5 . ** partition** : Partition the knowledge graph into communities using anchor-based BFS
14+ 6 . ** generate** : Generate QA pairs from partitioned communities with automatic molecule caption extraction
15+
16+ ## Key Features
17+
18+ - ** Unified QA Generation** : Single ` omics_qa ` method supports DNA, RNA, and Protein
19+ - ** Automatic Caption Extraction** : Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair
20+ - ** Flexible Configuration** : Easy to switch between DNA, RNA, and Protein by changing input file and data source
21+ - ** Anchor-based Partitioning** : Uses molecule type as anchor for BFS partitioning (dna/rna/protein)
5022
5123## Quick Start
5224
5325### 1. Configure Input Data
5426
55- Edit the config file to set:
56- - ** Input file** : Change ` input_path ` in the ` read_files ` node
57- - DNA: ` examples/input_examples/search_dna_demo.jsonl `
58- - RNA: ` examples/input_examples/search_rna_demo.jsonl `
59- - Protein: ` examples/input_examples/search_protein_demo.jsonl `
27+ Edit ` omics_qa_config.yaml ` to set the input file path:
28+
29+ ** For DNA:**
30+ ``` yaml
31+ input_path :
32+ - examples/input_examples/search_dna_demo.jsonl
33+ ` ` `
34+
35+ **For RNA:**
36+ ` ` ` yaml
37+ input_path :
38+ - examples/input_examples/search_rna_demo.jsonl
39+ ` ` `
40+
41+ **For Protein:**
42+ ` ` ` yaml
43+ input_path :
44+ - examples/input_examples/search_protein_demo.jsonl
45+ ` ` `
6046
6147### 2. Configure Data Source
6248
63- Set the appropriate data source and parameters:
49+ Set the appropriate data source and parameters in the ` search_data` node :
6450
6551**For DNA (NCBI):**
6652` ` ` yaml
6753data_sources: [ncbi]
6854ncbi_params:
6955 email: your_email@example.com # Required!
7056 tool: GraphGen
71- use_local_blast : false
57+ use_local_blast: true
58+ local_blast_db: refseq_release/refseq_release
59+ blast_num_threads: 2
7260 max_concurrent: 5
7361` ` `
7462
7563**For RNA (RNAcentral):**
7664` ` ` yaml
7765data_sources: [rnacentral]
7866rnacentral_params:
79- use_local_blast : false
67+ use_local_blast: true
68+ local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD
69+ blast_num_threads: 2
8070 max_concurrent: 5
8171` ` `
8272
8373**For Protein (UniProt):**
8474` ` ` yaml
8575data_sources: [uniprot]
8676uniprot_params:
87- use_local_blast : false
77+ use_local_blast: true
78+ local_blast_db: /your_path/2024_01/uniprot_sprot
79+ blast_num_threads: 2
8880 max_concurrent: 5
8981` ` `
9082
91- ### 3. Run the Pipeline
83+ # ## 3. Configure Anchor Type
9284
93- Use individual scripts for each QA method:
94-
95- ` ` ` bash
96- # Atomic QA
97- ./generate_omics_atomic.sh
85+ Set the `anchor_type` in the `partition` node to match your molecule type :
9886
99- # Aggregated QA (includes quiz & judge)
100- ./generate_omics_aggregated.sh
87+ ` ` ` yaml
88+ partition:
89+ params:
90+ method: anchor_bfs
91+ method_params:
92+ anchor_type: protein # Change to "dna" or "rna" as needed
93+ max_units_per_community: 10
94+ ` ` `
10195
102- # Chain of Thought QA
103- ./generate_omics_cot.sh
96+ # ## 4. Run the Pipeline
10497
105- # Multi-hop QA
106- ./generate_omics_multi_hop .sh
98+ ` ` ` bash
99+ ./generate_omics_qa .sh
107100` ` `
108101
109- #### Direct Python Command
110-
111102Or run directly with Python :
112103
113104` ` ` bash
114105python3 -m graphgen.run \
115- --config_file examples/generate/generate_omics_qa/omics_atomic_config .yaml \
106+ --config_file examples/generate/generate_omics_qa/omics_qa_config .yaml \
116107 --output_dir cache/
117108` ` `
118109
119110# # Input Format
120111
121- Input files should be JSONL format with one query per line:
122-
112+ # ## For DNA/RNA (JSONL format):
123113` ` ` jsonl
124114{"type": "text", "content": "BRCA1"}
125115{"type": "text", "content": ">query\n ATGCGATCG..."}
126116{"type": "text", "content": "ATGCGATCG..."}
127117` ` `
128118
119+ # ## For Protein (JSONL format):
120+ ` ` ` jsonl
121+ {"type": "text", "content": "P01308"}
122+ {"type": "text", "content": "insulin"}
123+ {"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"}
124+ ` ` `
125+
126+ # # Output Format
127+
128+ The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs :
129+
130+ # ## Alpaca Format:
131+ ` ` ` json
132+ {
133+ "instruction": "What is the function of this protein?",
134+ "input": "",
135+ "output": "The protein functions as...",
136+ "dna": {...}, # DNA caption (if molecule_type is DNA)
137+ "rna": {...}, # RNA caption (if molecule_type is RNA)
138+ "protein": {...} # Protein caption (if molecule_type is protein)
139+ }
140+ ` ` `
141+
142+ # ## ChatML Format:
143+ ` ` ` json
144+ {
145+ "messages": [
146+ {
147+ "role": "user",
148+ "content": [
149+ {
150+ "text": "What is the function of this protein?",
151+ "dna": {...},
152+ "rna": {...},
153+ "protein": {...}
154+ }
155+ ]
156+ },
157+ {
158+ "role": "assistant",
159+ "content": "The protein functions as..."
160+ }
161+ ]
162+ }
163+ ` ` `
164+
165+ # # Caption Information
166+
167+ The generator automatically extracts relevant caption information based on molecule type :
168+
169+ - **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc.
170+ - **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc.
171+ - **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc.
172+
129173# # Configuration Options
130174
131175# ## Chunking Parameters
@@ -134,34 +178,39 @@ Input files should be JSONL format with one query per line:
134178- `sequence_chunk_size` : Size for sequence chunks (default: 1000)
135179- `sequence_chunk_overlap` : Overlap for sequence chunks (default: 100)
136180
137- ### Partition Methods
138- - ` dfs ` : Depth-first search partitioning
139- - ` bfs ` : Breadth-first search partitioning
140- - ` ece ` : Error Comprehension Estimation (requires quiz & judge)
141- - ` leiden ` : Leiden community detection algorithm
181+ # ## Partition Parameters
182+ - `method` : ` anchor_bfs` (recommended for omics data)
183+ - `anchor_type` : ` dna` , `rna`, or `protein` (must match your data type)
184+ - `max_units_per_community` : Maximum nodes and edges per community (default: 10)
142185
143- ### QA Generation Methods
144- - ` atomic ` : Single-fact questions
145- - ` aggregated ` : Multi-fact comprehensive questions
146- - ` cot ` : Chain of thought reasoning questions
147- - ` multi_hop ` : Multi-hop reasoning questions
148- - ` vqa ` : Visual question answering (not applicable for sequences)
149-
150- ### Output Formats
151- - ` Alpaca ` : Alpaca instruction format
152- - ` ChatML ` : ChatML conversation format
153- - ` Sharegpt ` : ShareGPT format
154-
155- ## Output
156-
157- The pipeline generates:
158- - Knowledge graph with biological entities (genes, RNAs, proteins, organisms, etc.) and relationships
159- - QA pairs in the specified format (ChatML, Alpaca, or ShareGPT)
160- - Output location: ` cache/ ` directory (configurable via ` working_dir ` )
186+ # ## Generation Parameters
187+ - `method` : ` omics_qa` (unified method for DNA/RNA/Protein)
188+ - `data_format` : ` Alpaca` , `ChatML`, or `Sharegpt`
161189
162190# # Notes
163191
164192- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params`
165- - ** Quiz & Judge steps ** are only included in the aggregated config (required for ECE partition with loss-based sampling )
193+ - **Anchor type must match molecule type ** - Set `anchor_type` to match your data (dna/rna/protein )
166194- **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`)
195+ - **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information
167196- Adjust `max_concurrent` based on your system resources and API rate limits
197+
198+ # # Examples
199+
200+ # ## Generate QA for Protein Data
201+ 1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl`
202+ 2. Set `data_sources : [uniprot]`
203+ 3. Set `anchor_type : protein`
204+ 4. Run `./generate_omics_qa.sh`
205+
206+ # ## Generate QA for DNA Data
207+ 1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl`
208+ 2. Set `data_sources : [ncbi]`
209+ 3. Set `anchor_type : dna`
210+ 4. Run `./generate_omics_qa.sh`
211+
212+ # ## Generate QA for RNA Data
213+ 1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl`
214+ 2. Set `data_sources : [rnacentral]`
215+ 3. Set `anchor_type : rna`
216+ 4. Run `./generate_omics_qa.sh`
0 commit comments