Skip to content

Commit cb04c79

Browse files
refactor: refactor pipeline engine using ray data (#110)
* feat: add config and operator node types * refactor: refactor readers with ray data * fix: delete param parallelism for readers * fix: fix import error * refactor read and chunk operators with no side effects * fix: fix import error * fix: fix return logic * refactor: rename operator split to chunk * refactor: refactor build_kg to accomodate ray data * feat: add StorageFactory & global params * refactor: refactor quiz to accomodata ray data engine * fix: reload graph before quizzing * Potential fix for pull request finding 'Unreachable code' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * fix: fix quiz params * refactor: refactor quiz&judge to ray actors * fix: fix transferring quizzed data to JudgeService * refactor: refactor partition to accomodate ray data * fix: fix lint problem * refactor: refactor op generate * feat: write results in output folder * fix: raise error when no dataset is created * fix: return generator in ece_partitioner * fix: return generator in ece_partitioner * refactor: refactor data format to support multi-modal input * fix: delete fetching schema to avoid ray's duplicate execution * fix: fix operators' registry * feat: refactor schema_guided_extraction & add examples * feat: seperate ray logs and service logs * feat: use storage actor * feat: add kuzu graph database * feat: add llm as actors * refactor: delete old runner * fix: fix vllm wrapper * docs: update .env.example * fix: use kuzudb in quiz_service * fix: update webui * feat: make storage backend configuragble * docs: update README” --------- Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
1 parent 737f45d commit cb04c79

File tree

157 files changed

+3252
-2311
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

157 files changed

+3252
-2311
lines changed

.env.example

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,23 @@ TRAINEE_API_KEY=
3535
#
3636
# TRAINEE_BACKEND=huggingface
3737
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
38+
39+
# # sglang
40+
# SYNTHESIZER_BACKEND=sglang
41+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
42+
# SYNTHESIZER_TP_SIZE=1
43+
# SYNTHESIZER_NUM_GPUS=1
44+
45+
# TRAINEE_BACKEND=sglang
46+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
47+
# SYNTHESIZER_TP_SIZE=1
48+
# SYNTHESIZER_NUM_GPUS=1
49+
50+
# # vllm
51+
# SYNTHESIZER_BACKEND=vllm
52+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
53+
# SYNTHESIZER_NUM_GPUS=1
54+
55+
# TRAINEE_BACKEND=vllm
56+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
57+
# TRAINEE_NUM_GPUS=1

README.md

Lines changed: 84 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -193,42 +193,105 @@ For any questions, please check [FAQ](https://github.com/open-sciencelab/GraphGe
193193
```
194194
- Set the following environment variables:
195195
```bash
196-
# Synthesizer is the model used to construct KG and generate data
197-
SYNTHESIZER_MODEL=your_synthesizer_model_name
198-
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
199-
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
200-
# Trainee is the model used to train with the generated data
201-
TRAINEE_MODEL=your_trainee_model_name
202-
TRAINEE_BASE_URL=your_base_url_for_trainee_model
203-
TRAINEE_API_KEY=your_api_key_for_trainee_model
196+
# Tokenizer
197+
TOKENIZER_MODEL=
198+
199+
# LLM
200+
# Support different backends: http_api, openai_api, ollama_api, ollama, huggingface, tgi, sglang, tensorrt
201+
# Synthesizer is the model used to construct KG and generate data
202+
# Trainee is the model used to train with the generated data
203+
204+
# http_api / openai_api
205+
SYNTHESIZER_BACKEND=openai_api
206+
SYNTHESIZER_MODEL=gpt-4o-mini
207+
SYNTHESIZER_BASE_URL=
208+
SYNTHESIZER_API_KEY=
209+
TRAINEE_BACKEND=openai_api
210+
TRAINEE_MODEL=gpt-4o-mini
211+
TRAINEE_BASE_URL=
212+
TRAINEE_API_KEY=
213+
214+
# azure_openai_api
215+
# SYNTHESIZER_BACKEND=azure_openai_api
216+
# The following is the same as your "Deployment name" in Azure
217+
# SYNTHESIZER_MODEL=<your-deployment-name>
218+
# SYNTHESIZER_BASE_URL=https://<your-resource-name>.openai.azure.com/openai/deployments/<your-deployment-name>/chat/completions
219+
# SYNTHESIZER_API_KEY=
220+
# SYNTHESIZER_API_VERSION=<api-version>
221+
222+
# # ollama_api
223+
# SYNTHESIZER_BACKEND=ollama_api
224+
# SYNTHESIZER_MODEL=gemma3
225+
# SYNTHESIZER_BASE_URL=http://localhost:11434
226+
#
227+
# Note: TRAINEE with ollama_api backend is not supported yet as ollama_api does not support logprobs.
228+
229+
# # huggingface
230+
# SYNTHESIZER_BACKEND=huggingface
231+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
232+
#
233+
# TRAINEE_BACKEND=huggingface
234+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
235+
236+
# # sglang
237+
# SYNTHESIZER_BACKEND=sglang
238+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
239+
# SYNTHESIZER_TP_SIZE=1
240+
# SYNTHESIZER_NUM_GPUS=1
241+
242+
# TRAINEE_BACKEND=sglang
243+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
244+
# SYNTHESIZER_TP_SIZE=1
245+
# SYNTHESIZER_NUM_GPUS=1
246+
247+
# # vllm
248+
# SYNTHESIZER_BACKEND=vllm
249+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
250+
# SYNTHESIZER_NUM_GPUS=1
251+
252+
# TRAINEE_BACKEND=vllm
253+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
254+
# TRAINEE_NUM_GPUS=1
204255
```
205-
2. (Optional) Customize generation parameters in `graphgen/configs/` folder.
256+
2. (Optional) Customize generation parameters in `config.yaml` .
206257

207258
Edit the corresponding YAML file, e.g.:
208259

209260
```yaml
210-
# configs/cot_config.yaml
211-
input_file: resources/input_examples/jsonl_demo.jsonl
212-
output_data_type: cot
213-
tokenizer: cl100k_base
261+
# examples/generate/generate_aggregated_qa/aggregated_config.yaml
262+
global_params:
263+
working_dir: cache
264+
graph_backend: kuzu # graph database backend, support: kuzu, networkx
265+
kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv
266+
267+
nodes:
268+
- id: read_files # id is unique in the pipeline, and can be referenced by other steps
269+
op_name: read
270+
type: source
271+
dependencies: []
272+
params:
273+
input_path:
274+
- examples/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt, pdf. See examples/input_examples for examples
275+
214276
# additional settings...
215277
```
216278

217279
3. Generate data
218280

219281
Pick the desired format and run the matching script:
220-
221-
| Format | Script to run | Notes |
222-
|--------------|------------------------------------------------|-------------------------------------------------------------------|
223-
| `cot` | `bash scripts/generate/generate_cot.sh` | Chain-of-Thought Q\&A pairs |
224-
| `atomic` | `bash scripts/generate/generate_atomic.sh` | Atomic Q\&A pairs covering basic knowledge |
225-
| `aggregated` | `bash scripts/generate/generate_aggregated.sh` | Aggregated Q\&A pairs incorporating complex, integrated knowledge |
226-
| `multi-hop` | `bash scripts/generate/generate_multihop.sh` | Multi-hop reasoning Q\&A pairs |
282+
283+
| Format | Script to run | Notes |
284+
| ------------ | ---------------------------------------------------------------------- | -------------------------------------------------------------------------- |
285+
| `cot` | `bash examples/generate/generate_cot_qa/generate_cot.sh` | Chain-of-Thought Q\&A pairs |
286+
| `atomic` | `bash examples/generate/generate_atomic_qa/generate_atomic.sh` | Atomic Q\&A pairs covering basic knowledge |
287+
| `aggregated` | `bash examples/generate/generate_aggregated_qa/generate_aggregated.sh` | Aggregated Q\&A pairs incorporating complex, integrated knowledge |
288+
| `multi-hop` | `examples/generate/generate_multi_hop_qa/generate_multi_hop.sh` | Multi-hop reasoning Q\&A pairs |
289+
| `vqa` | `bash examples/generate/generate_vqa/generate_vqa.sh` | Visual Question Answering pairs combining visual and textual understanding |
227290

228291

229292
4. Get the generated data
230293
```bash
231-
ls cache/data/graphgen
294+
ls cache/output
232295
```
233296

234297
### Run with Docker

README_zh.md

Lines changed: 83 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -190,42 +190,106 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
190190
```
191191
- 设置以下环境变量:
192192
```bash
193-
# Synthesizer 用于构建知识图谱并生成数据
194-
SYNTHESIZER_MODEL=your_synthesizer_model_name
195-
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
196-
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
197-
# Trainee 用于使用生成数据进行训练
198-
TRAINEE_MODEL=your_trainee_model_name
199-
TRAINEE_BASE_URL=your_base_url_for_trainee_model
200-
TRAINEE_API_KEY=your_api_key_for_trainee_model
193+
# Tokenizer
194+
TOKENIZER_MODEL=
195+
196+
# LLM
197+
# 支持不同的后端:http_api、openai_api、ollama_api、ollama、huggingface、tgi、sglang、tensorrt
198+
# Synthesizer 用于构建知识图谱并生成数据
199+
# Trainee 用于使用生成数据进行训练
200+
201+
# http_api / openai_api
202+
SYNTHESIZER_BACKEND=openai_api
203+
SYNTHESIZER_MODEL=gpt-4o-mini
204+
SYNTHESIZER_BASE_URL=
205+
SYNTHESIZER_API_KEY=
206+
TRAINEE_BACKEND=openai_api
207+
TRAINEE_MODEL=gpt-4o-mini
208+
TRAINEE_BASE_URL=
209+
TRAINEE_API_KEY=
210+
211+
# azure_openai_api
212+
# SYNTHESIZER_BACKEND=azure_openai_api
213+
# The following is the same as your "Deployment name" in Azure
214+
# SYNTHESIZER_MODEL=<your-deployment-name>
215+
# SYNTHESIZER_BASE_URL=https://<your-resource-name>.openai.azure.com/openai/deployments/<your-deployment-name>/chat/completions
216+
# SYNTHESIZER_API_KEY=
217+
# SYNTHESIZER_API_VERSION=<api-version>
218+
219+
# # ollama_api
220+
# SYNTHESIZER_BACKEND=ollama_api
221+
# SYNTHESIZER_MODEL=gemma3
222+
# SYNTHESIZER_BASE_URL=http://localhost:11434
223+
#
224+
# Note: TRAINEE with ollama_api backend is not supported yet as ollama_api does not support logprobs.
225+
226+
# # huggingface
227+
# SYNTHESIZER_BACKEND=huggingface
228+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
229+
#
230+
# TRAINEE_BACKEND=huggingface
231+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
232+
233+
# # sglang
234+
# SYNTHESIZER_BACKEND=sglang
235+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
236+
# SYNTHESIZER_TP_SIZE=1
237+
# SYNTHESIZER_NUM_GPUS=1
238+
239+
# TRAINEE_BACKEND=sglang
240+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
241+
# SYNTHESIZER_TP_SIZE=1
242+
# SYNTHESIZER_NUM_GPUS=1
243+
244+
# # vllm
245+
# SYNTHESIZER_BACKEND=vllm
246+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
247+
# SYNTHESIZER_NUM_GPUS=1
248+
249+
# TRAINEE_BACKEND=vllm
250+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct
251+
# TRAINEE_NUM_GPUS=1
201252
```
202253
2. (可选)如需修改默认生成配置,可编辑 `graphgen/configs/` 文件夹中的 YAML 文件.
203254

204255
例如:
205256

206257
```yaml
207-
# configs/cot_config.yaml
208-
input_file: resources/input_examples/jsonl_demo.jsonl
209-
output_data_type: cot
210-
tokenizer: cl100k_base
258+
# examples/generate/generate_aggregated_qa/aggregated_config.yaml
259+
global_params:
260+
working_dir: cache
261+
graph_backend: kuzu # graph database backend, support: kuzu, networkx
262+
kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv
263+
264+
nodes:
265+
- id: read_files # id is unique in the pipeline, and can be referenced by other steps
266+
op_name: read
267+
type: source
268+
dependencies: []
269+
params:
270+
input_path:
271+
- examples/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt, pdf. See examples/input_examples for examples
272+
211273
# 其他设置...
212274
```
213275

214276
3. 生成数据
215277

216278
选择所需格式并运行对应脚本:
217279

218-
| 格式 | 运行脚本 | 说明 |
219-
|--------------|------------------------------------------------|--------------|
220-
| `cot` | `bash scripts/generate/generate_cot.sh` | 思维链问答对 |
221-
| `atomic` | `bash scripts/generate/generate_atomic.sh` | 覆盖基础知识的原子问答对 |
222-
| `aggregated` | `bash scripts/generate/generate_aggregated.sh` | 整合复杂知识的聚合问答对 |
223-
| `multi-hop` | `bash scripts/generate/generate_multihop.sh` | 多跳推理问答对 |
280+
| 格式 | 运行脚本 | 说明 |
281+
| ------------ | ---------------------------------------------------------------------- | --------------- |
282+
| `cot` | `bash examples/generate/generate_cot_qa/generate_cot.sh` | 思维链问答对 |
283+
| `atomic` | `bash examples/generate/generate_atomic_qa/generate_atomic.sh` | 覆盖基础知识的原子问答对 |
284+
| `aggregated` | `bash examples/generate/generate_aggregated_qa/generate_aggregated.sh` | 整合复杂知识的聚合问答对 |
285+
| `multi-hop` | `bash examples/generate/generate_multi_hop_qa/generate_multi_hop.sh` | 多跳推理问答对 |
286+
| `vqa` | `bash examples/generate/generate_vqa/generate_vqa.sh` | 视觉问答对,结合视觉和文本理解 |
287+
224288

225289

226290
4. 查看生成结果
227291
```bash
228-
ls cache/data/graphgen
292+
ls cache/output
229293
```
230294

231295
### 使用 Docker 运行

baselines/BDS/bds.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
from tqdm.asyncio import tqdm as tqdm_async
99

1010
from graphgen.bases import BaseLLMWrapper
11+
from graphgen.common import init_llm
1112
from graphgen.models import NetworkXStorage
12-
from graphgen.operators import init_llm
1313
from graphgen.utils import create_event_loop
1414

1515
QA_GENERATION_PROMPT = """
@@ -54,9 +54,7 @@ def _post_process(text: str) -> dict:
5454

5555
class BDS:
5656
def __init__(self, llm_client: BaseLLMWrapper = None, max_concurrent: int = 1000):
57-
self.llm_client: BaseLLMWrapper = llm_client or init_llm(
58-
"synthesizer"
59-
)
57+
self.llm_client: BaseLLMWrapper = llm_client or init_llm("synthesizer")
6058
self.max_concurrent: int = max_concurrent
6159

6260
def generate(self, tasks: List[dict]) -> List[dict]:
File renamed without changes.

0 commit comments

Comments
 (0)