Skip to content

Commit 52c23f0

Browse files
authored
Merge branch 'main' into feature/protein-qa
2 parents cffaf5d + 02adac3 commit 52c23f0

File tree

12 files changed

+48
-45
lines changed

12 files changed

+48
-45
lines changed

README.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -62,13 +62,16 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
6262

6363
## 📌 Latest Updates
6464

65-
- **2025.12.1**: Added search support for [NCBI](https://www.ncbi.nlm.nih.gov/) and [RNAcentral](https://rnacentral.org/) databases, enabling extraction of DNA and RNA data from these bioinformatics databases.
66-
- **2025.10.30**: We support several new LLM clients and inference backends including [Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) and [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
67-
- **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.
65+
- **2025.12.16**: Added [rocksdb](https://github.com/facebook/rocksdb) for key-value storage backend and [kuzudb](https://github.com/kuzudb/kuzu) for graph database backend support.
66+
- **2025.12.16**: Added [vllm](https://github.com/vllm-project/vllm) for local inference backend support.
67+
- **2025.12.16**: Refactored the data generation pipeline using [ray](https://github.com/ray-project/ray) to improve the efficiency of distributed execution and resource management.
6868

6969
<details>
7070
<summary>History</summary>
7171

72+
- **2025.12.1**: Added search support for [NCBI](https://www.ncbi.nlm.nih.gov/) and [RNAcentral](https://rnacentral.org/) databases, enabling extraction of DNA and RNA data from these bioinformatics databases.
73+
- **2025.10.30**: We support several new LLM clients and inference backends including [Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) and [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
74+
- **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.
7275
- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
7376
- **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).
7477
- **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
@@ -84,13 +87,14 @@ We support various LLM inference servers, API servers, inference clients, input
8487
Users can flexibly configure according to the needs of synthetic data.
8588

8689

87-
| Inference Server | Api Server | Inference Client | Data Source | Data Modal | Data Type |
88-
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
89-
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | Files(CSV, JSON, PDF, TXT, etc.)<br>Databases([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>Search Engines([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>Knowledge Graphs([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
90+
| Inference Server | Api Server | Inference Client | Data Source | Data Modal | Data Type |
91+
|--------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
92+
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg]<br>[![vllm-icon]vllm][vllm] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | Files(CSV, JSON, PDF, TXT, etc.)<br>Databases([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>Search Engines([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>Knowledge Graphs([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
9093

9194
<!-- links -->
9295
[hf]: https://huggingface.co/docs/transformers/index
9396
[sg]: https://docs.sglang.ai
97+
[vllm]: https://github.com/vllm-project/vllm
9498
[sif]: https://siliconflow.cn
9599
[oai]: https://openai.com
96100
[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
@@ -106,6 +110,7 @@ Users can flexibly configure according to the needs of synthetic data.
106110
<!-- icons -->
107111
[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
108112
[sg-icon]: https://www.google.com/s2/favicons?domain=https://docs.sglang.ai
113+
[vllm-icon]: https://www.google.com/s2/favicons?domain=https://docs.vllm.ai
109114
[sif-icon]: https://www.google.com/s2/favicons?domain=siliconflow.com
110115
[oai-icon]: https://www.google.com/s2/favicons?domain=https://openai.com
111116
[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com

README_zh.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -62,13 +62,16 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
6262
在数据生成后,您可以使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)[xtuner](https://github.com/InternLM/xtuner)对大语言模型进行微调。
6363

6464
## 📌 最新更新
65-
- **2025.12.1**:新增对 [NCBI](https://www.ncbi.nlm.nih.gov/)[RNAcentral](https://rnacentral.org/) 数据库的检索支持,现在可以从这些生物信息学数据库中提取DNA和RNA数据
66-
- **2025.10.30**:我们支持多种新的 LLM 客户端和推理后端,包括 [Ollama_client]([Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py)[SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py)
67-
- **2025.10.23**:我们现在支持视觉问答(VQA)数据生成。运行脚本:`bash scripts/generate/generate_vqa.sh`
65+
- **2025.12.16**:新增 [rocksdb](https://github.com/facebook/rocksdb) 作为键值存储后端, [kuzudb](https://github.com/kuzudb/kuzu) 作为图数据库后端的支持
66+
- **2025.12.16**:新增 [vllm](https://github.com/vllm-project/vllm) 作为本地推理后端的支持
67+
- **2025.12.16**:使用 [ray](https://github.com/ray-project/ray) 重构了数据生成 pipeline,提升了分布式执行和资源管理的效率
6868

6969
<details>
7070
<summary>历史更新</summary>
7171

72+
- **2025.12.1**:新增对 [NCBI](https://www.ncbi.nlm.nih.gov/)[RNAcentral](https://rnacentral.org/) 数据库的检索支持,现在可以从这些生物信息学数据库中提取DNA和RNA数据。
73+
- **2025.10.30**:我们支持多种新的 LLM 客户端和推理后端,包括 [Ollama_client]([Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py)[SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py)
74+
- **2025.10.23**:我们现在支持视觉问答(VQA)数据生成。运行脚本:`bash scripts/generate/generate_vqa.sh`
7275
- **2025.10.21**:我们现在通过 [MinerU](https://github.com/opendatalab/MinerU) 支持 PDF 作为数据生成的输入格式。
7376
- **2025.09.29**:我们在 [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen)[ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen) 上自动更新 Gradio 应用。
7477
- **2025.08.14**:支持利用 Leiden 社区发现算法对知识图谱进行社区划分,合成 CoT 数据。
@@ -82,13 +85,14 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
8285
我们支持多种 LLM 推理服务器、API 服务器、推理客户端、输入文件格式、数据模态、输出数据格式和输出数据类型。
8386
可以根据合成数据的需求进行灵活配置。
8487

85-
| 推理服务器 | API 服务器 | 推理客户端 | 输入文件格式 | 数据模态 | 输出数据类型 |
86-
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
87-
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | 文件(CSV, JSON, JSONL, PDF, TXT等)<br>数据库([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>搜索引擎([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>知识图谱([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
88+
| 推理服务器 | API 服务器 | 推理客户端 | 输入文件格式 | 数据模态 | 输出数据类型 |
89+
|--------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------|
90+
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg]<br>[![vllm-icon]vllm][vllm] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | 文件(CSV, JSON, JSONL, PDF, TXT等)<br>数据库([![uniprot-icon]UniProt][uniprot], [![ncbi-icon]NCBI][ncbi], [![rnacentral-icon]RNAcentral][rnacentral])<br>搜索引擎([![bing-icon]Bing][bing], [![google-icon]Google][google])<br>知识图谱([![wiki-icon]Wikipedia][wiki]) | TEXT<br>IMAGE | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
8891

8992
<!-- links -->
9093
[hf]: https://huggingface.co/docs/transformers/index
9194
[sg]: https://docs.sglang.ai
95+
[vllm]: https://github.com/vllm-project/vllm
9296
[sif]: https://siliconflow.cn
9397
[oai]: https://openai.com
9498
[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
@@ -104,6 +108,7 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
104108
<!-- icons -->
105109
[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
106110
[sg-icon]: https://www.google.com/s2/favicons?domain=https://docs.sglang.ai
111+
[vllm-icon]: https://www.google.com/s2/favicons?domain=https://docs.vllm.ai
107112
[sif-icon]: https://www.google.com/s2/favicons?domain=siliconflow.com
108113
[oai-icon]: https://www.google.com/s2/favicons?domain=https://openai.com
109114
[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com

examples/evaluate/evaluate.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
python3 -m graphgen.evaluate --folder cache/data \
2-
--output cache/output \
32
--reward "OpenAssistant/reward-model-deberta-v3-large-v2,BAAI/IndustryCorpus2_DataRater" \
43
--uni MingZhong/unieval-sum \
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python3 -m graphgen.run \
2-
--config_file examples/extract/extract_schema_guided/schema_guided_extraction_config.yaml \
3-
--output_dir cache/
2+
--config_file examples/extract/extract_schema_guided/schema_guided_extraction_config.yaml
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python3 -m graphgen.run \
2-
--config_file examples/generate/generate_aggregated_qa/aggregated_config.yaml \
3-
--output_dir cache/
2+
--config_file examples/generate/generate_aggregated_qa/aggregated_config.yaml
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python3 -m graphgen.run \
2-
--config_file examples/generate/generate_atomic_qa/atomic_config.yaml \
3-
--output_dir cache/
2+
--config_file examples/generate/generate_atomic_qa/atomic_config.yaml
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python3 -m graphgen.run \
2-
--config_file examples/generate/generate_cot_qa/cot_config.yaml \
3-
--output_dir cache/
2+
--config_file examples/generate/generate_cot_qa/cot_config.yaml
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python3 -m graphgen.run \
2-
--config_file examples/generate/generate_multi_hop_qa/multi_hop_config.yaml \
3-
--output_dir cache/
2+
--config_file examples/generate/generate_multi_hop_qa/multi_hop_config.yaml
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python3 -m graphgen.run \
2-
--config_file examples/generate/generate_vqa/vqa_config.yaml \
3-
--output_dir cache/
2+
--config_file examples/generate/generate_vqa/vqa_config.yaml

graphgen/common/init_llm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ def create_llm(
131131
ray.get_actor(actor_name)
132132
except ValueError:
133133
print(f"Creating Ray actor for LLM {model_type} with backend {backend}.")
134-
num_gpus = int(config.pop("num_gpus", 0))
134+
num_gpus = float(config.pop("num_gpus", 0))
135135
actor = (
136136
ray.remote(LLMServiceActor)
137137
.options(

0 commit comments

Comments
 (0)