Skip to content

Commit 679dd0d

Browse files
Model handlers support and config updates (#114)
* batch_size common for all metrics * Fix max_length=32 for gibberish * fix: probe * ruff fixes * Cleanup * fix MUSE citation --------- Co-authored-by: Anmol Mekala <49127549+molereddy@users.noreply.github.com> Co-authored-by: molereddy <m.anmolreddy@gmail.com>
1 parent 31d826c commit 679dd0d

31 files changed

+202
-54
lines changed

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
## 📖 Overview
2020

21-
We provide efficient and streamlined implementations of the TOFU, MUSE unlearning benchmarks while supporting 6 unlearning methods, 3+ datasets, 9+ evaluation metrics, and 6+ LLM architectures. Each of these can be easily extended to incorporate more variants.
21+
We provide efficient and streamlined implementations of the TOFU, MUSE and WMDP unlearning benchmarks while supporting 6 unlearning methods, 5+ datasets, 10+ evaluation metrics, and 7+ LLM architectures. Each of these can be easily extended to incorporate more variants.
2222

2323
We invite the LLM unlearning community to collaborate by adding new benchmarks, unlearning methods, datasets and evaluation metrics here to expand OpenUnlearning's features, gain feedback from wider usage and drive progress in the field.
2424

@@ -64,7 +64,7 @@ We provide several variants for each of the components in the unlearning pipelin
6464
| **Benchmarks** | [TOFU](https://arxiv.org/abs/2401.06121), [MUSE](https://muse-bench.github.io/), [WMDP](https://www.wmdp.ai/) |
6565
| **Unlearning Methods** | GradAscent, GradDiff, NPO, SimNPO, DPO, RMU |
6666
| **Evaluation Metrics** | Verbatim Probability, Verbatim ROUGE, Knowledge QA-ROUGE, Model Utility, Forget Quality, TruthRatio, Extraction Strength, Exact Memorization, 6 MIA attacks, [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
67-
| **Datasets** | MUSE-News (BBC), MUSE-Books (Harry Potter), TOFU (different splits) |
67+
| **Datasets** | MUSE-News (BBC), MUSE-Books (Harry Potter), TOFU (different splits), WMDP-Bio, WMDP-Cyber |
6868
| **Model Families** | TOFU: LLaMA-3.2, LLaMA-3.1, LLaMA-2; MUSE: LLaMA-2; Additional: Phi-3.5, Phi-1.5, Gemma, Zephyr |
6969

7070
---
@@ -209,13 +209,14 @@ If you use OpenUnlearning in your research, please cite OpenUnlearning and the b
209209
booktitle={First Conference on Language Modeling},
210210
year={2024}
211211
}
212-
@inproceedings{
213-
shi2025muse,
214-
title={{MUSE}: Machine Unlearning Six-Way Evaluation for Language Models},
212+
@article{shi2024muse,
213+
title={MUSE: Machine Unlearning Six-Way Evaluation for Language Models},
215214
author={Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang},
216-
booktitle={The Thirteenth International Conference on Learning Representations},
217-
year={2025},
218-
url={https://openreview.net/forum?id=TArmA033BU}
215+
year={2024},
216+
eprint={2407.06460},
217+
archivePrefix={arXiv},
218+
primaryClass={cs.CL},
219+
url={https://arxiv.org/abs/2407.06460},
219220
}
220221
```
221222
</details>

configs/eval/tofu.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,6 @@ metrics: {} # lists a mapping from each evaluation metric to its config
2626
overwrite: false
2727
forget_split: forget10
2828
holdout_split: holdout10
29-
retain_logs_path: null
29+
retain_logs_path: null
30+
question_key: "question" # Specifies which key to use during forget and retain evaluations (e.g., "question" or "paraphrased_question")
31+
batch_size: 32

configs/eval/tofu_metrics/exact_memorization.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,11 @@ defaults:
55
# ^ get default dataset and generation config information
66

77
handler: exact_memorization
8-
batch_size: 32
8+
batch_size: ${eval.tofu.batch_size}
99

1010
datasets:
1111
TOFU_QA_forget:
1212
args:
1313
hf_args:
14-
name: ${eval.tofu.forget_split}
14+
name: ${eval.tofu.forget_split}_perturbed
15+
question_key: ${eval.tofu.question_key}

configs/eval/tofu_metrics/extraction_strength.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,11 @@ defaults:
55
# ^ get default dataset and generation config information
66

77
handler: extraction_strength
8-
batch_size: 32
8+
batch_size: ${eval.tofu.batch_size}
99

1010
datasets:
1111
TOFU_QA_forget:
1212
args:
1313
hf_args:
14-
name: ${eval.tofu.forget_split}
14+
name: ${eval.tofu.forget_split}_perturbed
15+
question_key: ${eval.tofu.question_key}

configs/eval/tofu_metrics/forget_Q_A_PARA_Prob.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@ defaults:
66
# ^ get default dataset and generation config information
77

88
handler: probability
9-
batch_size: 32
9+
batch_size: ${eval.tofu.batch_size}
1010

1111
datasets:
1212
TOFU_QA_forget_para:
1313
args:
1414
hf_args:
15-
name: ${eval.tofu.forget_split}_perturbed
15+
name: ${eval.tofu.forget_split}_perturbed
16+
question_key: ${eval.tofu.question_key}

configs/eval/tofu_metrics/forget_Q_A_PARA_ROUGE.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,14 @@ defaults:
88

99
handler: rouge
1010
rouge_type: rougeL_recall
11-
batch_size: 32
11+
batch_size: ${eval.tofu.batch_size}
1212

1313
datasets: # override as needed
1414
TOFU_QA_forget_para:
1515
args:
1616
hf_args:
1717
name: ${eval.tofu.forget_split}_perturbed
18+
question_key: ${eval.tofu.question_key}
1819
predict_with_generate: True
1920
collators:
2021
DataCollatorForSupervisedDataset:

configs/eval/tofu_metrics/forget_Q_A_PERT_Prob.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,11 @@ defaults:
55
# ^ get default dataset and generation config information
66

77
handler: probability
8-
batch_size: 32
8+
batch_size: ${eval.tofu.batch_size}
99

1010
datasets:
1111
TOFU_QA_forget_pert:
1212
args:
1313
hf_args:
14-
name: ${eval.tofu.forget_split}_perturbed
14+
name: ${eval.tofu.forget_split}_perturbed
15+
question_key: ${eval.tofu.question_key}

configs/eval/tofu_metrics/forget_Q_A_PERT_ROUGE.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,14 @@ defaults:
77

88
handler: rouge
99
rouge_type: rougeL_recall
10-
batch_size: 32
10+
batch_size: ${eval.tofu.batch_size}
1111

1212
datasets: # override as needed
1313
TOFU_QA_forget_pert:
1414
args:
1515
hf_args:
1616
name: ${eval.tofu.forget_split}_perturbed
17+
question_key: ${eval.tofu.question_key}
1718
predict_with_generate: True
1819
collators:
1920
DataCollatorForSupervisedDataset:

configs/eval/tofu_metrics/forget_Q_A_Prob.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,11 @@ defaults:
55
# ^ get default dataset and generation config information
66

77
handler: probability
8-
batch_size: 32
8+
batch_size: ${eval.tofu.batch_size}
99

1010
datasets:
1111
TOFU_QA_forget:
1212
args:
1313
hf_args:
14-
name: ${eval.tofu.forget_split}
14+
name: ${eval.tofu.forget_split}_perturbed
15+
question_key: ${eval.tofu.question_key}

configs/eval/tofu_metrics/forget_Q_A_ROUGE.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,14 @@ defaults:
88

99
handler: rouge
1010
rouge_type: rougeL_recall
11-
batch_size: 32
11+
batch_size: ${eval.tofu.batch_size}
1212

1313
datasets: # override as needed
1414
TOFU_QA_forget:
1515
args:
1616
hf_args:
17-
name: ${eval.tofu.forget_split}
17+
name: ${eval.tofu.forget_split}_perturbed
18+
question_key: ${eval.tofu.question_key}
1819
predict_with_generate: True
1920
collators:
2021
DataCollatorForSupervisedDataset:

0 commit comments

Comments
 (0)