Skip to content

Commit dab4f1e

Browse files
committed
feat: expand protein-qa to omics (dna rna prot)
1 parent 5948b83 commit dab4f1e

23 files changed

+608
-759
lines changed
Lines changed: 139 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,131 +1,175 @@
11
# Multi-omics Knowledge Graph QA Generation
22

3-
This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using different QA generation methods.
3+
This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified `omics_qa` method.
44

55
## Pipeline Overview
66

77
The pipeline includes the following steps:
88

9-
1. **read**: Read input files (JSONL format with sequence queries)
10-
2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein)
9+
1. **read**: Read input files (JSON/JSONL format with sequence queries or protein data)
10+
2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - *optional if input already contains search results*
1111
3. **chunk**: Chunk sequences and metadata
1212
4. **build_kg**: Extract entities and relationships to build knowledge graph
13-
5. **quiz** (optional): Generate quiz questions for KG nodes and edges
14-
6. **judge** (optional): Judge the correctness of KG descriptions
15-
7. **partition**: Partition the knowledge graph into communities
16-
8. **generate**: Generate QA pairs from partitioned communities
17-
18-
## Available QA Generation Methods
19-
20-
This example provides configurations for different QA generation methods:
21-
22-
### 1. Atomic QA (`omics_atomic_config.yaml`)
23-
- **Method**: `atomic`
24-
- **Format**: Alpaca
25-
- **Partition**: DFS with max_units_per_community=1
26-
- **Use case**: Simple, single-fact questions
27-
- **Run**: `./generate_omics_atomic.sh`
28-
29-
### 2. Aggregated QA (`omics_aggregated_config.yaml`)
30-
- **Method**: `aggregated`
31-
- **Format**: ChatML
32-
- **Partition**: ECE with comprehension loss
33-
- **Includes**: quiz and judge steps
34-
- **Use case**: Comprehensive questions covering multiple facts
35-
- **Run**: `./generate_omics_aggregated.sh`
36-
37-
### 3. Chain of Thought (CoT) QA (`omics_cot_config.yaml`)
38-
- **Method**: `cot`
39-
- **Format**: ShareGPT
40-
- **Partition**: Leiden algorithm
41-
- **Use case**: Questions requiring step-by-step reasoning
42-
- **Run**: `./generate_omics_cot.sh`
43-
44-
### 4. Multi-hop QA (`omics_multi_hop_config.yaml`)
45-
- **Method**: `multi_hop`
46-
- **Format**: ChatML
47-
- **Partition**: ECE with random sampling
48-
- **Use case**: Questions requiring reasoning across multiple KG relationships
49-
- **Run**: `./generate_omics_multi_hop.sh`
13+
5. **partition**: Partition the knowledge graph into communities using anchor-based BFS
14+
6. **generate**: Generate QA pairs from partitioned communities with automatic molecule caption extraction
15+
16+
## Key Features
17+
18+
- **Unified QA Generation**: Single `omics_qa` method supports DNA, RNA, and Protein
19+
- **Automatic Caption Extraction**: Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair
20+
- **Flexible Configuration**: Easy to switch between DNA, RNA, and Protein by changing input file and data source
21+
- **Anchor-based Partitioning**: Uses molecule type as anchor for BFS partitioning (dna/rna/protein)
5022

5123
## Quick Start
5224

5325
### 1. Configure Input Data
5426

55-
Edit the config file to set:
56-
- **Input file**: Change `input_path` in the `read_files` node
57-
- DNA: `examples/input_examples/search_dna_demo.jsonl`
58-
- RNA: `examples/input_examples/search_rna_demo.jsonl`
59-
- Protein: `examples/input_examples/search_protein_demo.jsonl`
27+
Edit `omics_qa_config.yaml` to set the input file path:
28+
29+
**For DNA:**
30+
```yaml
31+
input_path:
32+
- examples/input_examples/search_dna_demo.jsonl
33+
```
34+
35+
**For RNA:**
36+
```yaml
37+
input_path:
38+
- examples/input_examples/search_rna_demo.jsonl
39+
```
40+
41+
**For Protein:**
42+
```yaml
43+
input_path:
44+
- examples/input_examples/search_protein_demo.jsonl
45+
```
6046
6147
### 2. Configure Data Source
6248
63-
Set the appropriate data source and parameters:
49+
Set the appropriate data source and parameters in the `search_data` node:
6450

6551
**For DNA (NCBI):**
6652
```yaml
6753
data_sources: [ncbi]
6854
ncbi_params:
6955
email: your_email@example.com # Required!
7056
tool: GraphGen
71-
use_local_blast: false
57+
use_local_blast: true
58+
local_blast_db: refseq_release/refseq_release
59+
blast_num_threads: 2
7260
max_concurrent: 5
7361
```
7462

7563
**For RNA (RNAcentral):**
7664
```yaml
7765
data_sources: [rnacentral]
7866
rnacentral_params:
79-
use_local_blast: false
67+
use_local_blast: true
68+
local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD
69+
blast_num_threads: 2
8070
max_concurrent: 5
8171
```
8272

8373
**For Protein (UniProt):**
8474
```yaml
8575
data_sources: [uniprot]
8676
uniprot_params:
87-
use_local_blast: false
77+
use_local_blast: true
78+
local_blast_db: /your_path/2024_01/uniprot_sprot
79+
blast_num_threads: 2
8880
max_concurrent: 5
8981
```
9082

91-
### 3. Run the Pipeline
83+
### 3. Configure Anchor Type
9284

93-
Use individual scripts for each QA method:
94-
95-
```bash
96-
# Atomic QA
97-
./generate_omics_atomic.sh
85+
Set the `anchor_type` in the `partition` node to match your molecule type:
9886

99-
# Aggregated QA (includes quiz & judge)
100-
./generate_omics_aggregated.sh
87+
```yaml
88+
partition:
89+
params:
90+
method: anchor_bfs
91+
method_params:
92+
anchor_type: protein # Change to "dna" or "rna" as needed
93+
max_units_per_community: 10
94+
```
10195

102-
# Chain of Thought QA
103-
./generate_omics_cot.sh
96+
### 4. Run the Pipeline
10497

105-
# Multi-hop QA
106-
./generate_omics_multi_hop.sh
98+
```bash
99+
./generate_omics_qa.sh
107100
```
108101

109-
#### Direct Python Command
110-
111102
Or run directly with Python:
112103

113104
```bash
114105
python3 -m graphgen.run \
115-
--config_file examples/generate/generate_omics_qa/omics_atomic_config.yaml \
106+
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
116107
--output_dir cache/
117108
```
118109

119110
## Input Format
120111

121-
Input files should be JSONL format with one query per line:
122-
112+
### For DNA/RNA (JSONL format):
123113
```jsonl
124114
{"type": "text", "content": "BRCA1"}
125115
{"type": "text", "content": ">query\nATGCGATCG..."}
126116
{"type": "text", "content": "ATGCGATCG..."}
127117
```
128118

119+
### For Protein (JSONL format):
120+
```jsonl
121+
{"type": "text", "content": "P01308"}
122+
{"type": "text", "content": "insulin"}
123+
{"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"}
124+
```
125+
126+
## Output Format
127+
128+
The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs:
129+
130+
### Alpaca Format:
131+
```json
132+
{
133+
"instruction": "What is the function of this protein?",
134+
"input": "",
135+
"output": "The protein functions as...",
136+
"dna": {...}, # DNA caption (if molecule_type is DNA)
137+
"rna": {...}, # RNA caption (if molecule_type is RNA)
138+
"protein": {...} # Protein caption (if molecule_type is protein)
139+
}
140+
```
141+
142+
### ChatML Format:
143+
```json
144+
{
145+
"messages": [
146+
{
147+
"role": "user",
148+
"content": [
149+
{
150+
"text": "What is the function of this protein?",
151+
"dna": {...},
152+
"rna": {...},
153+
"protein": {...}
154+
}
155+
]
156+
},
157+
{
158+
"role": "assistant",
159+
"content": "The protein functions as..."
160+
}
161+
]
162+
}
163+
```
164+
165+
## Caption Information
166+
167+
The generator automatically extracts relevant caption information based on molecule type:
168+
169+
- **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc.
170+
- **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc.
171+
- **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc.
172+
129173
## Configuration Options
130174

131175
### Chunking Parameters
@@ -134,34 +178,39 @@ Input files should be JSONL format with one query per line:
134178
- `sequence_chunk_size`: Size for sequence chunks (default: 1000)
135179
- `sequence_chunk_overlap`: Overlap for sequence chunks (default: 100)
136180

137-
### Partition Methods
138-
- `dfs`: Depth-first search partitioning
139-
- `bfs`: Breadth-first search partitioning
140-
- `ece`: Error Comprehension Estimation (requires quiz & judge)
141-
- `leiden`: Leiden community detection algorithm
181+
### Partition Parameters
182+
- `method`: `anchor_bfs` (recommended for omics data)
183+
- `anchor_type`: `dna`, `rna`, or `protein` (must match your data type)
184+
- `max_units_per_community`: Maximum nodes and edges per community (default: 10)
142185

143-
### QA Generation Methods
144-
- `atomic`: Single-fact questions
145-
- `aggregated`: Multi-fact comprehensive questions
146-
- `cot`: Chain of thought reasoning questions
147-
- `multi_hop`: Multi-hop reasoning questions
148-
- `vqa`: Visual question answering (not applicable for sequences)
149-
150-
### Output Formats
151-
- `Alpaca`: Alpaca instruction format
152-
- `ChatML`: ChatML conversation format
153-
- `Sharegpt`: ShareGPT format
154-
155-
## Output
156-
157-
The pipeline generates:
158-
- Knowledge graph with biological entities (genes, RNAs, proteins, organisms, etc.) and relationships
159-
- QA pairs in the specified format (ChatML, Alpaca, or ShareGPT)
160-
- Output location: `cache/` directory (configurable via `working_dir`)
186+
### Generation Parameters
187+
- `method`: `omics_qa` (unified method for DNA/RNA/Protein)
188+
- `data_format`: `Alpaca`, `ChatML`, or `Sharegpt`
161189

162190
## Notes
163191

164192
- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params`
165-
- **Quiz & Judge steps** are only included in the aggregated config (required for ECE partition with loss-based sampling)
193+
- **Anchor type must match molecule type** - Set `anchor_type` to match your data (dna/rna/protein)
166194
- **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`)
195+
- **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information
167196
- Adjust `max_concurrent` based on your system resources and API rate limits
197+
198+
## Examples
199+
200+
### Generate QA for Protein Data
201+
1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl`
202+
2. Set `data_sources: [uniprot]`
203+
3. Set `anchor_type: protein`
204+
4. Run `./generate_omics_qa.sh`
205+
206+
### Generate QA for DNA Data
207+
1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl`
208+
2. Set `data_sources: [ncbi]`
209+
3. Set `anchor_type: dna`
210+
4. Run `./generate_omics_qa.sh`
211+
212+
### Generate QA for RNA Data
213+
1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl`
214+
2. Set `data_sources: [rnacentral]`
215+
3. Set `anchor_type: rna`
216+
4. Run `./generate_omics_qa.sh`

examples/generate/generate_omics_qa/generate_omics_aggregated.sh

Lines changed: 0 additions & 6 deletions
This file was deleted.

examples/generate/generate_omics_qa/generate_omics_atomic.sh

Lines changed: 0 additions & 6 deletions
This file was deleted.

examples/generate/generate_omics_qa/generate_omics_cot.sh

Lines changed: 0 additions & 6 deletions
This file was deleted.

examples/generate/generate_omics_qa/generate_omics_multi_hop.sh

Lines changed: 0 additions & 6 deletions
This file was deleted.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
python3 -m graphgen.run \
2+
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
3+
--output_dir cache/

0 commit comments

Comments
 (0)